38

I want to hash a string of length up-to 30. What will be the best idea to do that if time is my concern. The function will be called over 100 million times. currently I am using the following code,

static UInt64 CalculateHash(string read, bool lowTolerance)
{
    UInt64 hashedValue = 0;
    int i = 0;
    while (i < read.Length)
    {
        hashedValue += read.ElementAt(i) * (UInt64)Math.Pow(31, i);
        if (lowTolerance) i += 2;
        else i++;
    }
    return hashedValue;
}
13
  • 7
    Is there a reason that the Object.GetHashCode() method won't work for you? It seems like you're pretty much reimplementing the same concept. Commented Mar 3, 2012 at 11:20
  • 3
    Anything that doesn't use floating point math will be faster. Commented Mar 3, 2012 at 11:21
  • 1
    GetHashCode is not persistable, so if he needs to store the Hash Code into a database, it's not useful. Then again, neither is this. What is your usage? Do you only need to hash the string at runtime, or what do you need to do with the Hash? Adler-32 might be an option if you need to store it and don't run into too many collisions. Commented Mar 3, 2012 at 11:26
  • 2
    @Pbasak Then cast it to uint or mask it with 0x7FFFFF. Commented Mar 3, 2012 at 13:36
  • 26
    Run a profiler. That will tell you what the slow part is. Then fix the slow part. Commented Mar 3, 2012 at 15:24

3 Answers 3

49
static UInt64 CalculateHash(string read)
{
    UInt64 hashedValue = 3074457345618258791ul;
    for(int i=0; i<read.Length; i++)
    {
        hashedValue += read[i];
        hashedValue *= 3074457345618258799ul;
    }
    return hashedValue;
}

This is a Knuth hash. You can also use Jenkins.

Sign up to request clarification or add additional context in comments.

8 Comments

According to my own test, this function does not achieve avalanche. YMMV.
It's worse. But I should quantify my original statement. Toggling a single bit on the input results in about 49.40% of the output bits toggling (using your original constant), which is MUCH better than Bernstein-based functions. That's probably good enough for most uses. But, for instance, SuperFastHash (landman-code.blogspot.com/2009/02/…) is giving me 50.02%. And Murmur2 on the same page is giving me 50.04%.
It's not intended for applications where you care about that. It's just intended to be used to distribute strings in a hash table.
Could you please provide citation for this algorithm? I searched TAOCP Vol 3 but can't find these constants you have.
@ShitalShah I'm pretty sure it's from TAOCP, but I'm not sure which volume.
|
7

First of all, consider using GetHashCode().

A simple improvement on your existing implementation:

static UInt64 CalculateHash(string read, bool lowTolerance)
{
    UInt64 hashedValue = 0;
    int i = 0;
    ulong multiplier = 1;
    while (i < read.Length)
    {
        hashedValue += read[i] * multiplier;
        multiplier *= 37;
        if (lowTolerance) i += 2;
        else i++;
    }
    return hashedValue;
}

It avoids the expensive floating point calculation, and the overhead of ElementAt.

Btw (UInt64)Math.Pow(31, i) doesn't work well for longer strings. Floating point rounding will lead to a multiplier of 0 for characters beyond 15 or so.

5 Comments

The multiplier must start at a prime value greater than 256 or this breaks horribly if the first byte is small.
@DavidSchwartz A larger prime is certainly better, but breaking horribly is a bit of an overstatement.
If a 64-bit hash function has numerous 2-byte inputs that collide, IMO it breaks horribly. (But given how awful the function the OP started with, maybe my standards are too high.)
Even with a prime >256 but <65536 there will be two char collisions. C# works on UTF-16 codepoints, not single byte chars.
Just a warning for everyone who's on .NET Core: in .NET Core GetHashCode is randmized between app restarts! Which means you get a different hash for the same string every time the app is restarted/recycled
2

To speed up your implementation, the (UInt64)Math.Pow(31, i) call should be replaced by a lookup: pre-calculate a table of the first 30 powers of 31, and use it at runtime. Since the limit on length is 30, you need only 31 element:

private static unsigned long[] Pow31 = new unsigned long[31];

static HashCalc() {
    Pow31[0] = 1;
    for (int i = 1 ; i != Pow31.Length ; i++) {
        Pow31[i] = 31*Pow31[i-1];
    }
}

// In your hash function...
hashedValue += read.ElementAt(i) * Pow31[i];

2 Comments

I wouldn't be so sure that a table lookup is faster than an integer multiplication.
@CodeInChaos It's certainly faster than Math.Pow(31, i). Also I'd need a additional multiplication when i goes up by 2 inside a condition, so I'd try the lookup first.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.