5

I would like to store a simple hash value of n number of columns e.g. address and name, and store the hash in another column. I want to use the hash value as a quick way of synchronising data between 2 sources, by simply comparing the hash value. What is the best way of doing something like this. I don't need crypto functionality just to create the hash.

e.g.

John Smith, 1 Long Lane Road, Village, Town, Postcode. Hash: AK67KS38

I would like the hash value to be simple enough to be readable (not have whole range of Uni-code set).

Edit Ok I would still like to do this in c# code and LINQ. The data is brought in externally through interrogating other sources. There is no other way to link this uniquely with my data as there is no 'key' as such from the external source to link to. So in this sense, a timestamp value would not be an option. I understand this method is not 'exact' but I can live with that. I might add the ability to review manually the possible hash matches and promote them further into the database if they are acceptable.

6
  • What Database? Do you want to do the hashing in the DB or in your code? Commented Feb 17, 2011 at 11:01
  • 1
    Hash is never guaranteed to be unique so syncing only by a (short) hash is dangerous. And you speak of columns but you tagged your question with c# and .Net. Are you working in a database or do you mean the fields/properties of an object ? Commented Feb 17, 2011 at 11:04
  • @SemVanmeenen: What about a CRC check? I used to use this in a C program to check validity of files. Commented Feb 17, 2011 at 11:43
  • @Jon Just read the edit of your question. Hashing is usually used to speed up the process. First, compare the hashes, if those are equal, compare the values on which the hash is based and only if those are equal then you sync. If you really want to do it only on a hash basis, you'll have to choose between length of the hash and the chance on equal hashes. Wikipedia has here a list of hash functions. Generally speaking, how longer the hash, how better. I wouldn't recommend it however, I wouldn't feel safe myself to implement it that way. Commented Feb 17, 2011 at 12:24
  • 1
    The chances that multiple instances of the putatively "same" address will be slightly different from one another are very high, in my experience, at least with U.S. addresses. "Road" and "Avenue" can be abbreviated "Rd" or "Ave" or spelled out. "Apt 3B" might be "Apt. #3B". Etc. Joining on addresses (or on a hashed representation of address) is notoriously difficult. There are "address sanitation" measures that one can take to regularize the address format by reducing variants. A hash on santized addresses would work much more reliably than a hash on raw addresses. Commented Feb 17, 2011 at 13:12

2 Answers 2

3

Jon - an alternative approach would be to add a timestamp column onto the table (or in the object). set this to UTC and be done with it. of course, you could argue about concurrency on edits etc., but you'd have a far more difficult job to determine the latest edit/diff if you were only comparing the hash column values.

however, if the hash column approach is a 'must', then you should be very careful of this across database installations on different cultures as the mechanism can vary. i'd carefully consider the merits of the timestamp vs the hash in this case.

[edit] Jon, based on your recent edit, i'd then suggest creating a custom GetHashCode() and Equals() on your object and using the comparer to do the grunt work for you. i did a quick google to figure what might be useful. a reasonable (2 minute google) starting point might be here:

http://www.eggheadcafe.com/community/aspnet/2/78458/hashcode.aspx

and here is a quick code example (based on your requirement of name and address being used for the hash [thank you resharper :-)]):

public class ContactDetails
{
    public string Name { get; set; }
    public string Address { get; set; }
    public string Village { get; set; }
    public string Town { get; set; }
    public string PostCode { get; set; }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != typeof (ContactDetails)) return false;
        return Equals((ContactDetails) obj);
    }

    public bool Equals(ContactDetails other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return Equals(other.Name, Name) && Equals(other.Address, Address);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            return ((Name != null ? Name.GetHashCode() : 0)*397) 
                ^ (Address != null ? Address.GetHashCode() : 0);
        }
    }
}

typical usage:

bool isChanged = ContactFromServer1.Equals(ContactFromServer2);
//etc..

hope this helps

Sign up to request clarification or add additional context in comments.

1 Comment

Jon, ok, that edit does help :). have added to my answer above.
0

Assuming your columns are copied to a User instance (user):

DataContractSerializer serializer = new DataContractSerializer(typeof(User));
MemoryStream memoryStream = new MemoryStream();
XmlWriter writer = XmlDictionaryWriter.CreateBinaryWriter(memoryStream);
serializer.WriteObject(memoryStream, user);
byte[] serializedData = memoryStream.ToArray();

// Calculte the serialized data's hash value
SHA1CryptoServiceProvider SHA = new SHA1CryptoServiceProvider();
byte[] hash = SHA.ComputeHash(serializedData);

// Convert the hash to a base 64 string
string hashAsText = Convert.ToBase64String(hash);

Note that I've converted the hash to a base 64 string and that you could use MD5 too as the hash is calculated for checksum purposes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.