1

I am trying to implement a hash cons in java, comparable to what String.intern does for strings. I.e., I want a class to store all distinct values of a data type T in a set and provide an T intern(T t) method that checks whether t is already in the set. If so, the instance in the set is returned, otherwise t is added to the set and returned. The reason is that the resulting values can be compared using reference equality since two equal values returned from intern will for sure also be the same instance.

Of course, the most obvious candidate data structure for a hash cons is java.util.HashSet<T>. However, it seems that its interface is flawed and does not allow efficient insertion, because there is no method to retrieve an element that is already in the set or insert one if it is not in there.

An algorithm using HashSet would look like this:

class HashCons<T>{
    HashSet<T> set = new HashSet<>();

    public T intern(T t){
        if(set.contains(t)) {
           return ???;  // <----- PROBLEM
        } else {
           set.add(t); // <--- Inefficient, second hash lookup
           return t;
    }
}

As you see, the problem is twofold:

  1. This solution would be inefficient since I would access the hash table twice, once for contains and once for add. But okay, this may not be a too big performance hit since the correct bucket will be in the cache after the contains, so add will not trigger a cache miss and thus be quite fast.
  2. I cannot retrieve an element already in the set (see line flagged PROBLEM). There is just no method to retrieve the element in the set. So it is just not possible to implement this.

Am I missing something here? Or is it really impossible to build a usual hash cons with java.util.HashSet?

2 Answers 2

1

I don't think it's possible using HashSet. You could use some kind of Map instead and use your value as key and as value. The java.util.concurrent.ConcurrentMap also happens to posess the quite convenient method

putIfAbsent(K key, V value)

that returns the value if it is already existent. However, I don't know about the performance of this method (compared to checking "manually" on non-concurrent implementations of Map).

Here is how you would do it using a HashMap:

class HashCons<T>{
    Map<T,T> map = new HashMap<T,T>();

    public T intern(T t){
        if (!map.containsKey(t))
            map.put(t,t);
        return map.get(t);
    }
}

I think the reason why it is not possible with HashSet is quite simple: To the set, if contains(t) is fulfilled, it means that the given t also equals one of the t' in the set. There is no reason for being able return it (as you already have it).

Sign up to request clarification or add additional context in comments.

8 Comments

Yes, I thought already about this, too. But this is a bit hacky, as the thing is not a map, it is a set. Of course, I could abuse a map that has no meaningful value type (as HashSet implementation does itself internally), but it would be a shame that a usual set implementation is lacking such a feature... In addition, I would still have to access the HashMap two times which is a shame :(
Look here stackoverflow.com/questions/23877776/… you need a Map to substitute any candidate with its first created identical instance. Also stackoverflow.com/questions/23744201/…
Just a little remark here: HashSet actually uses a HashMap to store the items, so if you change it to HashMap it won't have any negative effect. Eventually, you can do the same as a HashSet does: use only the keys of a map.
A map is such a waste, each T is stored twice, once as key and once as value. If I store millions of values, that is a considerable waste of space. It is hilarious that there is no space efficient set in java.util...
@cy3er: Yes, true. I will write my own hash table then, since the thing is to big and I cannot tolerate so much space wastage.
|
1

Well HashSet is implemented as HashMap wrapper in OpenJDK, so you won't win in memory usage comparing to solution suggested by aRestless.

10-min sketch

class HashCons<T> {
    T[] table;
    int size;
    int sizeLimit;
    HashCons(int expectedSize) {
        init(Math.max(Integer.highestOneBit(expectedSize * 2) * 2, 16));
    }

    private void init(int capacity) {
        table = (T[]) new Object[capacity];
        size = 0;
        sizeLimit = (int) (capacity * 2L / 3);
    }

    T cons(@Nonnull T key) {
        int mask = table.length - 1;
        int i = key.hashCode() & mask;
        do {
            if (table[i] == null) break;
            if (key.equals(table[i])) return table[i];
            i = (i + 1) & mask;
        } while (true);
        table[i] = key;
        if (++size > sizeLimit) rehash();
        return key;
    }

    private void rehash() {
        T[] table = this.table;
        if (table.length == (1 << 30))
            throw new IllegalStateException("HashCons is full");
        init(table.length << 1);
        for (T key : table) {
            if (key != null) cons(key);
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.