We are trying to implement the following case. We have a invoice table and there is a column which has email address. We want to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well
-
7This has been asked many times before, and the short answer is it's impossible to take an infinite (or relatively infinite) domain (string/varchar) and map it 1:1 with a finite domain (int, bigint). You need to make compromises on uniqueness or the output data type. My suggestion is that you just index on the e-mail address itself.Mark Peters– Mark Peters2011-09-27 19:11:42 +00:00Commented Sep 27, 2011 at 19:11
-
1I don't see why that would be so, @MarkPeters. Any string they get will be encoded in a finite number of bytes. Just interpret the same bytes as a bigint, and voila, you have a number.Tom Zych– Tom Zych2011-09-27 19:17:05 +00:00Commented Sep 27, 2011 at 19:17
-
@Tom: I'm not sure of the exact semantics of bigint. In Java, BigInteger has variable bit length so that would work (I already have a deleted answer along exactly those lines). My impression is that bigint in the context of SQL is still bounded (64 bit) and so it represents a finite domain.Mark Peters– Mark Peters2011-09-27 19:19:46 +00:00Commented Sep 27, 2011 at 19:19
-
I think the point @MarkPeters is making is that the number of possible email addresses is always going to be > the number of integers of a given length. For example, if using a 64-bit integer, there are 2^64 possible values. There are an infinite number of possible email addresses, which is > 2^64.daiscog– daiscog2011-09-27 19:21:26 +00:00Commented Sep 27, 2011 at 19:21
-
1@TomZych If you've got an integer type with defined precision (e.g. 32bit), it's trivial to generate 2**32 + 1 different E-Mail adresses, and there goes your uniqueness.Frank Schmitt– Frank Schmitt2011-09-27 19:22:18 +00:00Commented Sep 27, 2011 at 19:22
4 Answers
Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.
1 Comment
It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.
Some things to keep in mind:
convert everything to lower/uppercase before computing the hashcode (so [email protected] gets the same hashcode as [email protected])
apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk
4 Comments
MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)
9 Comments
I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.
So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.
Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.
Sorry to be of so little help.