String to unique int algorithm

Question

We are trying to implement the following case. We have a invoice table and there is a column which has email address. We want to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well

This has been asked many times before, and the short answer is it's impossible to take an infinite (or relatively infinite) domain (string/varchar) and map it 1:1 with a finite domain (int, bigint). You need to make compromises on uniqueness or the output data type. My suggestion is that you just index on the e-mail address itself. — Mark Peters
– Mark Peters, Commented Sep 27, 2011 at 19:11
I don't see why that would be so, @MarkPeters. Any string they get will be encoded in a finite number of bytes. Just interpret the same bytes as a bigint, and voila, you have a number. — Tom Zych
– Tom Zych, Commented Sep 27, 2011 at 19:17
@Tom: I'm not sure of the exact semantics of bigint. In Java, BigInteger has variable bit length so that would work (I already have a deleted answer along exactly those lines). My impression is that bigint in the context of SQL is still bounded (64 bit) and so it represents a finite domain. — Mark Peters
– Mark Peters, Commented Sep 27, 2011 at 19:19
I think the point @MarkPeters is making is that the number of possible email addresses is always going to be > the number of integers of a given length. For example, if using a 64-bit integer, there are 2^64 possible values. There are an infinite number of possible email addresses, which is > 2^64. — daiscog
– daiscog, Commented Sep 27, 2011 at 19:21
@TomZych If you've got an integer type with defined precision (e.g. 32bit), it's trivial to generate 2**32 + 1 different E-Mail adresses, and there goes your uniqueness. — Frank Schmitt
– Frank Schmitt, Commented Sep 27, 2011 at 19:22

Marc B · Accepted Answer · 2011-09-27 19:38:03Z

5

Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.

edited Sep 27, 2011 at 19:38

answered Sep 27, 2011 at 19:21

Marc B

362k44 gold badges433 silver badges508 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

woliveirajr Over a year ago

+1 Best answer, I think. Why complicate things when the OP hasn't made constraints ?

Frank Schmitt · Accepted Answer · 2011-09-27 19:12:22Z

1

It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.

Some things to keep in mind:

convert everything to lower/uppercase before computing the hashcode (so [email protected] gets the same hashcode as [email protected])
apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk

answered Sep 27, 2011 at 19:12

Frank Schmitt

30.5k13 gold badges79 silver badges111 bronze badges

4 Comments

Frank Schmitt Over a year ago

@Raze2dust Yes, of course. But the probability for that is very, very small (esp. if using SHA1).

Michael McGowan Over a year ago

@FrankSchmitt Some people care about the difference between very, very small probability and impossible...and some don't.

Blastfurnace Over a year ago

@Frank Schmitt: Since the question mentions invoices, the people that care are customers, accountants, lawyers, etc. Potential collisions should be enough to disqualify any hashing scheme.

LiKao Over a year ago

@Blastfurnace: MD5 and SHA1 collisions are an accepted tradeoff for HTTPS certificates, of which there are far more than most customer-email addresses a single company uses. Usually unless somebody is attacking the system and targeting for a collision this will no happen. Also note that the possibility is influenced by the grammar for email addresses as most collisions will appear for strings outside of the grammar or very long (and hence unusable addresses).

daiscog · Accepted Answer · 2011-09-27 19:08:15Z

0

MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)

answered Sep 27, 2011 at 19:08

daiscog

12.2k8 gold badges55 silver badges63 bronze badges

9 Comments

Michael McGowan Over a year ago

OP said he wanted a unique integer; MD5 does not guarantee uniqueness.

Frank Schmitt Over a year ago

You cannot guarantee unique integers (unless using integers of arbitrary length), since there's an infinite amount of possible E-Mail adresses.

Michael McGowan Over a year ago

@FrankSchmitt There's a finite number of email addresses that will actually be used, so we could just index all of them as we see them.

ypercubeᵀᴹ Over a year ago

@Michael: daiscog does not state that MD5 guarantees uniqueness. It states that it gets you near guranteed uniqueness.

Tony Over a year ago

Depending on inputs vs outputs, NOTHING "guarantees" uniqueness.

|

Community · Accepted Answer · 2021-10-07 06:31:44Z

0

I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.

So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.

Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.

Sorry to be of so little help.

edited Oct 7, 2021 at 6:31

CommunityBot

11 silver badge

answered Sep 27, 2011 at 19:23

Pete Wilson

8,7246 gold badges42 silver badges52 bronze badges

Collectives™ on Stack Overflow

String to unique int algorithm

4 Answers 4

1 Comment

4 Comments

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related