226

I've implemented a BloomFilter in python 3.3, and got different results every session. Drilling down this weird behavior got me to the internal hash() function - it returns different hash values for the same string every session.

Example:

>>> hash("235")
-310569535015251310

----- opening a new python console -----

>>> hash("235")
-1900164331622581997

Why is this happening? Why is this useful?

0

4 Answers 4

262

Python uses a random hash seed to prevent attackers from tar-pitting your application by sending you keys designed to collide. See the original vulnerability disclosure. By offsetting the hash with a random seed (set once at startup) attackers can no longer predict what keys will collide.

You can set a fixed seed or disable the feature by setting the PYTHONHASHSEED environment variable; the default is random but you can set it to a fixed positive integer value, with 0 disabling the feature altogether.

Python versions 2.7 and 3.2 have the feature disabled by default (use the -R switch or set PYTHONHASHSEED=random to enable it); it is enabled by default in Python 3.3 and up.

If you were relying on the order of keys in a Python set, then don't. Python uses a hash table to implement these types and their order depends on the insertion and deletion history as well as the random hash seed. Note that in Python 3.5 and older, this applies to dictionaries, too.

Also see the object.__hash__() special method documentation:

Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

If you need a stable hash implementation, you probably want to look at the hashlib module; this implements cryptographic hash functions. The pybloom project uses this approach.

Since the offset consists of a prefix and a suffix (start value and final XORed value, respectively) you cannot just store the offset, unfortunately. On the plus side, this does mean that attackers cannot easily determine the offset with timing attacks either.

Sign up to request clarification or add additional context in comments.

9 Comments

Why don't they simply use a cryptographic hash, doesn't that mostly eliminate concerns with hash collision?
I think the point (reason) is that if an attacker did manage to compute inputs that produce the same hash value, then by adding this random seed those inputs no longer colide, and this attacker cannot therefore cause a collision by knowing inputs which collide. In any hash function the output range is much smaller than the input domain isn't it? so collisions can always exist.
@profPlum because a cryptographic hash has a very different use case and actually would not prevent this attack. A cryptographic hash is useful when you want to make it unlikely that the original input is recovered, but you want to guarantee that the result is always the same. Here we don’t want the value to be the same in different processes so an attacker can’t force collisions to happen.
Yea I get the idea of the fix. But it seems like the only niche use cases that I had for hash() are now impossible because this “solution” made the feature almost useless (e.g. saving hash values to files). Except when it is implicitly used for dictionaries.
@matanox: yes, collisions always exist and are not usually a problem. It is only a problem when there are a very large number of collisions.
|
17

This behavior of hash() tripped me up when trying to compare records saved in a database between sessions.

The PYTHONHASHSEED solution was too complicated because I needed my program to work reliably, independent of environment variable settings.

So I created my simple has function that hashes strings (it's easy to convert anything to strings) and produces a 32 bit positive integer as the hash. It's not a cryptographically safe hash but it's good enough for quick comparisons.

def myHash(text:str):
  hash=0
  for ch in text:
    hash = ( hash*281  ^ ord(ch)*997) & 0xFFFFFFFF
  return hash

The numbers in the multiplications are just arbitrarily chosen prime numbers in order to mix up the bits.

If you want the hash to be a hex string, you can replace the last line with:

return hex(hash)[2:].upper().zfill(8)

2 Comments

How about its speed compared to built in hash
It must be much slower because it's a pure Python code which can't compete with the C based code in the standard library. Test it for your use case and see whether this is usable or not.
16

Hash randomisation is turned on by default in Python 3. This is a security feature:

Hash randomization is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict construction

In previous versions from 2.6.8, you could switch it on at the command line with -R, or the PYTHONHASHSEED environment option.

You can switch it off by setting PYTHONHASHSEED to zero.

Comments

12

Just in case you want deterministic values from a hash function, you can use hash functions from hashlib:

import hashlib

hash_obj = hashlib.sha256(b"hello")
hex_hash = hash_obj.hexdigest()
print(hex_hash)
# Always prints: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

There are various kinds of hash functions available in the module, so see more on the hashlib documention.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.