0

Is there any message digest algorithm that you can apply set functions on the digest and the result still makes sense? In other words, is there a hash function that does NOT break the concept of "set" before and after hashing?

I'm looking for a hash function that:

  1. hashes a set of data into a fixed-length (or bounded-length) string
  2. produces identical hash if the input data set is the same
  3. if you select a subset of your raw data, it is equivalent to either hash the data subset, or apply the subset to the hash of the original data set, i.e. you will get the same subset hash in the both ways.

As an example, in the following picture set A has several data points (red dimonds). B is a subset of A. Is there such a hash function that:

data in A ---- hash function ----> _hashA ---- set operation ----> _hashB

data in B ---- hash function ----> _hashB

enter image description here

3 Answers 3

1

This looks a bit like http://en.wikipedia.org/wiki/Homomorphic_encryption and a bit like database privacy schemes like http://en.wikipedia.org/wiki/Differential_privacy - at least to me.

In both cases developers have had problems because it turned out that once you let users do a few things they could find clever ways to work out how to do anything they wanted using those few things as building blocks so the system lacked any security at all.

In your case I think you want AndHash(hash(a), hash(b)) = hash(a and b). This means that if hash(a) != hash(null set) then I can find out if a is a member of any set based on the hash value of that set. If this happens a lot I can work out many of the members of a hashed set given its hash value, which means that the hash value must be pretty much as big as the set, as it contains all the information in it.

Depending on what you want this for, it might be worth looking at http://en.wikipedia.org/wiki/Minhash.

Sign up to request clarification or add additional context in comments.

Comments

0

AFAIK, no. Hash functions generally (and I've seen many) operate on a single chunk of data without any regard whatsoever for what that data may actually represent, the primary concern being to reduce to probability of collisions. That said, it's certainly possible to come up with something like what you're wanting to do, but I imagine it would be exceedingly difficult, and the result most likely suboptimal in terms of collision-avoidance.

Comments

0

The short answer is no, there isn't such an algorithm. What you might try is encrypting your data and then decrypting it when you need to apply your set function, then encrypting it again. Hashing algorithms, however, are by their very nature one way and involve the loss of data. There's a good explanation of the difference between hash and encryption algorithms here: Fundamental difference between Hashing and Encryption algorithms

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.