Hashing array of strings in javascript

Question

Just wondering if there is some other way than this.

var hashStringArray = function(array) {
    array.sort();
    return array.join('|');
};

I don't like sorting much and using that delimiter is not safe either if it's contained in one of the strings. In overall I need to produce same hash no matter the order of strings. It will be rather short arrays (up to 10 items), but it will be required very often so it shouldn't be too slow.

I intend to use it with ES6 Map object and I need to easily find same array collection.

Updated example of use

var theMap = new Map();
var lookup = function(arr) {
    var item = null;
    var hashed = hashStringArray(arr);
    if (item = theMap.get( hashed )) {
        return item;
    }
    theMap.set( hashed, itemBasedOnInput );
    return itemBasedOnInput;
}

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];

lookup(arr1) === lookup(arr2)

Performance tests

http://jsperf.com/hashing-array-of-strings/5

I could be wrong but doesn't ES6 Map support non-string keys? i.e. theMap.set(arr1, "test"). Still, you could use JSON.stringify(arr1) or create a "real" hash function that converts a string into a number... — Bart
– Bart, Commented Aug 3, 2014 at 12:11
@FredyC Can you please give an example where the delimiter would be a problem? — thefourtheye
– thefourtheye, Commented Aug 3, 2014 at 12:17
On the other hand, you could just use arr.join("\0"), it seems unlikely anyone will put a null character in a string... — Bart
– Bart, Commented Aug 3, 2014 at 12:19
@thefourtheye If you use | as the delimiter, these two arrays would produce the same hash: ["abc", "def|ghi"] and ["abc","def","ghi"]. — Barmar
– Barmar, Commented Aug 3, 2014 at 12:22
Also, there's no guarantee that MD5 will be unique, although collisions are unlikely. — Barmar
– Barmar, Commented Aug 3, 2014 at 12:23

sifriday · Accepted Answer · 2014-08-03 18:55:17Z

6

Two things occurred to me as the basis of a solution:

summing doesn't depend on order, which is actually a flaw in simple checksums (they don't catch changes in block order within a word), and
we can convert strings to summable numbers using their charcodes

Here's a function to do (2) :

charsum = function(s) {
  var i, sum = 0;
  for (i = 0; i < s.length; i++) {
    sum += (s.charCodeAt(i) * (i+1));
  }
  return sum
}

Here's a version of (1) that computes an array hash by summing the charsum values:

array_hash = function(a) {
  var i, sum = 0
  for (i = 0; i < a.length; i++) {
    var cs = charsum(a[i])
    sum = sum + (65027 / cs)
  }
  return ("" + sum).slice(0,16)
}

Fiddle here: http://jsfiddle.net/WS9dC/11/

If we did a straight sum of the charsum values, then the array ["a", "d"] would have the same hash as the array ["b", "c"] - leading to undesired collisions. So based on using non-UTF strings, where charcodes go up to 255, and allowing for 255 characters in each string, then the max return value of charsum is 255 * 255 = 65025. So I picked the next prime number up, 65027, and used (65027 / cs) to compute the hash. I am not 100% convinced this removes collisions... perhaps more thought needed... but it certainly fixes the [a, d] versus [b, c] case. Testing:

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];

console.log(array_hash(arr1))
console.log(array_hash(arr2))
console.log(array_hash(arr1) == array_hash(arr2))

Outputs:

443.5322979371356 
443.5322979371356
true

And testing a case that shows different hashes:

var arr3 = ['a', 'd'];
var arr4 = ['b', 'c'];

console.log(array_hash(arr3))
console.log(array_hash(arr4))
console.log(array_hash(arr3) == array_hash(arr4))

outputs:

1320.651443298969
1320.3792001649144
false

Edit:

Here's a revised version, which ignore duplicates from the arrays as it goes, and return the hash based on unique items only:

http://jsfiddle.net/WS9dC/7/

array_hash = function(a) {
  var i, sum = 0, product = 1
  for (i = 0; i < a.length; i++) {
    var cs = charsum(a[i])
    if (product % cs > 0) {
      product = product * cs
      sum = sum + (65027 / cs)  
    }
  }
  return ("" + sum).slice(0, 16)
}

testing:

var arr1 = ['alpha', 'beta', 'gama', 'delta', 'theta', 'alpha', 'gama'];
var arr2 = ["beta", "gama", "alpha", "theta", "delta", "beta"];

console.log(array_hash(arr1))
console.log(array_hash(arr2))
console.log(array_hash(arr1) === array_hash(arr2))

returns:

689.878503111701
689.878503111701
true

Edit

I've revised the answer above to account for arrays of words that have the same letters. We need these to return different hashes, which they now do:

var arr1 = ['alpha', 'beta']
var arr2 = ['alhpa', 'ateb']

The fix was to add a multiplier to the charsum func based on the char index:

sum += (s.charCodeAt(i) * (i+1));

edited Aug 3, 2014 at 18:55

answered Aug 3, 2014 at 14:22

sifriday

4,5011 gold badge16 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

FredyC Over a year ago

Could you please add this to that benchmark suite ? jsperf.com/hashing-array-of-strings

sifriday Over a year ago

yep, will do - it will be interesting to work out how it compares. Also, I'd like to think of a good testwrap that looks for collisions, but this might be beyond me! It might actually be easier to think through the maths and look for something that might break it.

FredyC Over a year ago

Not that big difference, but definitely faster. However it doesn't really solves collisions... jsfiddle.net/WS9dC/4

sifriday Over a year ago

sorry if I am being slow, but the example in your fiddle looks OK - the arrays have different hashes and the comparison returns false, as desired? for me it returns: 975.5782767769929, 847.7110273835465, false

sifriday Over a year ago

better fix and test cases: jsfiddle.net/WS9dC/11 ... the change was the addition line in charsum, which is now: sum += (s.charCodeAt(i) * (i+1));

|

Guffa · Accepted Answer · 2014-08-03 12:51:53Z

1

If you calculate a numeric hash code for each string, then you can combine them with an operator where the order doesn't matter, like the ^ XOR operator, then you don't need to sort the array:

function hashStringArray(array) {
  var code = 0;
  for (var i = 0; i < array.length; i++) {
    var n = 0;
    for (var j = 0; j < array[i].length; j++) {
      n = n * 251 ^ array[i].charCodeAt(j);
    }
    code ^= n;
  }
  return code
};

answered Aug 3, 2014 at 12:51

Guffa

703k111 gold badges760 silver badges1k bronze badges

9 Comments

FredyC Over a year ago

Well isn't that too slow ? As I said, I expect this function to be called pretty often. I can hardly imagine it would be performance wise to loop over these strings every time.

Guffa Over a year ago

@FredyC: Well, you would have to profile that to find out. Copying strings is also a way to loop through all data, and that also allocates memory, so it's not obvious which would be faster.

FredyC Over a year ago

Thanks for the effort @Guffa, I will give it a try when hunting for the performance. It's definitely interesting approach too.

sifriday Over a year ago

+1 for using charCodeAt to generate a numeric representation of the strings, then using an operator that doesn't care about ordering. However, Guffa - doesn't this lead to easy collisions? eg try it with var c = ['c', 'd', 'f']; var d = ['a', 'b', 'b'];

FredyC Over a year ago

@sifriday I am expecting to filter out duplicates before running this. I don't expect there is a way how to avoid that.

|

Marco Bonelli · Accepted Answer · 2014-08-03 12:33:02Z

0

You can do this:

var hashStringArray = function(array) {
    return array.sort().join('\u200b');
};

The \u200b character is an unicode character that also means null, but is not the same as the \0 character, which is most widely used.

'\u200b' == '\0'

> false

answered Aug 3, 2014 at 12:33

Marco Bonelli

71.1k21 gold badges129 silver badges153 bronze badges

Comments

GameAlchemist · Accepted Answer · 2014-08-03 19:04:44Z

0

An idea to have very fast hash if your set of possible string is less than 32 items long : hash the string with a built-in hash function that will return power-of two as hash :

function getStringHash(aString) {
   var currentPO2 = 0;
   var hashSet = [];
   getStringHash = function ( aString) {
       var aHash = hashSet[aString];
       if (aHash) return aHash;
       aHash = 1 << currentPO2++;
       hashSet[aString] = aHash; 
       return aHash;
   }
   return getStringHash(aString);
}

Then use this hash on your string array, ORing the hashes ( | ) :

function getStringArrayHash( aStringArray) {
    var aHash = 0;
    for (var i=0; i<aStringArray.length; i++) {
        aHash |= getStringHash(aStringArray[i]);
    }
    return aHash;
}

So to test a bit :

console.log(getStringHash('alpha'));  // 1
console.log(getStringHash('beta'));   // 2
console.log(getStringHash('gamma'));  // 4
console.log(getStringHash('alpha'));  // 1 again

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];
var arr3 = ['alpha', 'teta'];

console.log(getStringArrayHash(arr1)); // 11
console.log(getStringArrayHash(arr2)); // 11 also, like for arr1

var arr3 = ['alpha', 'teta'];
console.log(getStringArrayHash(arr3)); // 17 : a different array has != hashset

jsbin is here : http://jsbin.com/rozanufa/1/edit?js,console

RQ !!! with this method, arrays are considered as set, meaning that a repeated item won't change the hash of an array !!!

This HAS to be faster since it uses only 1) function call 2) lookup 3) integer arithmetic. So no sort, no (long) string, no concat.

jsperf confirms that : http://jsperf.com/hashing-array-of-strings/4

enter image description here

EDIT :

version with prime numbers, here : http://jsbin.com/rozanufa/3/edit?js,console

        // return the unique prime associated with the string.
    function getPrimeStringHash(aString) {
       var hashSet = [];
       var currentPrimeIndex = 0;
       var primes = [ 2, 3, 5, 7, 11, 13, 17 ];
       getPrimeStringHash = function ( aString) {
           var aPrime = hashSet[aString];
           if (aPrime) return aPrime;
           if (currentPrimeIndex == primes.length) aPrime = getNextPrime();
           else aPrime = primes[currentPrimeIndex]; 
           currentPrimeIndex++
           hashSet[aString] = aPrime; 
           return aPrime;
       };
       return getPrimeStringHash(aString);
       // compute next prime number, store it and returns it.
       function getNextPrime() {
         var pr = primes[primes.length-1];
         do {
             pr+=2;
             var divides = false;
             // discard the number if it divides by one earlier prime.
             for (var i=0; i<primes.length; i++) {
                 if ( ( pr % primes[i] ) == 0 ) {
                     divides = true;
                     break;
                 }
             }
          } while (divides == true)
          primes.push(pr);
         return pr;
        }
    }

    function getStringPrimeArrayHash( aStringArray) {
        var primeMul = 1;
        for (var i=0; i<aStringArray.length; i++) {
            primeMul *= getPrimeStringHash(aStringArray[i]);
        }
        return primeMul;
    }

    function compareByPrimeHash( aStringArray, anotherStringArray)  {
        var mul1 = getStringPrimeArrayHash ( aStringArray ) ;
        var mul2 = getStringPrimeArrayHash ( anotherStringArray ) ;
        return  ( mul1 > mul2 ) ? 
                                   ! ( mul1 % mul2 ) 
                                 : ! ( mul2 % mul1 );
      // Rq : just test for mul1 == mul2 if you are sure there's no duplicates
    }

Tests :

console.log(getPrimeStringHash('alpha'));  // 2
console.log(getPrimeStringHash('beta'));   // 3
console.log(getPrimeStringHash('gamma'));  // 5
console.log(getPrimeStringHash('alpha'));  // 2 again
console.log(getPrimeStringHash('a1'));  // 7 
console.log(getPrimeStringHash('a2'));  // 11


var arr1 = ['alpha','beta','gamma'];
var arr2 = ['beta','alpha','gamma'];
var arr3 = ['alpha', 'teta'];
var arr4 = ['alpha','beta','gamma', 'alpha']; // == arr1 + duplicate 'alpha'

console.log(getStringPrimeArrayHash(arr1)); // 30
console.log(getStringPrimeArrayHash(arr2)); // 30 also, like for arr1

var arr3 = ['alpha', 'teta'];
console.log(getStringPrimeArrayHash(arr3)); // 26 : a different array has != hashset

console.log(compareByPrimeHash(arr1, arr2) ); // true
console.log(compareByPrimeHash(arr1, arr3) ); // false
console.log(compareByPrimeHash(arr1, arr4) ); // true despite duplicate

edited Aug 3, 2014 at 19:04

answered Aug 3, 2014 at 15:27

GameAlchemist

19.3k7 gold badges38 silver badges59 bronze badges

11 Comments

FredyC Over a year ago

I haven't inspected it much, but would you please add this to the benchmark test so we have comparison with @sifriday solution ?

GameAlchemist Over a year ago

@FredyC : this is in deed much faster, see the jsperf test.

FredyC Over a year ago

Well that definitely looks even faster and solves duplicates, however you misunderstood the intentional slow down of first two tests. This way the filtering is not accounted for benchmark. Also your solution is not using the lookup from the test suite. I have updated it now ... jsperf.com/hashing-array-of-strings/5 ... It still beats everything else, awesome !

FredyC Over a year ago

Hmm, I just realized I cannot use this solution. There will be definitely more than 32 strings in total :(

GameAlchemist Over a year ago

no issue with that : use prime numbers instead of power of two and multiply instead of ORing. Two hashes are equal if their modulo is == 0.

|

Collectives™ on Stack Overflow

Hashing array of strings in javascript

Updated example of use

Performance tests

4 Answers 4

8 Comments

9 Comments

Comments

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Updated example of use

Performance tests

4 Answers 4

8 Comments

9 Comments

Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related