0

I need to store 1000s (and possibly soon 100,000s maybe even millions) of unique random strings of 12 characters in a database. Every time I have to generate a new code (actually done in batches of 10,000+s) I need to compare it against the existing database to ensure there will be no duplicates -- but also when a code is "redeemed" by a user, I need to ensure it exists.

Both of these tasks are likely to be very slow, so I wish to make them as streamlined as possible. To begin with I've ensured that the strings are stored in BINARY format on the DB with an INDEX on them. This is apparently quicker than CHAR, VARCHAR and VARBINARY.

I was thinking of trying to do further improvements, and I came up with this simple idea: Storing the first character as a TINYINT in an indexed column, and comparing that first -- thus hopefully finding matching records faster.

For example:

public function getFirstCharAsNum($code) {
    $firstChar = substr($code, 0);
    $firstCharHex = bin2hex($firstChar);
    $prefix = hexdec($firstCharHex);
    return $prefix;
}

public function isDuplicate($generatedCode) {

    $result = false;

    $params["code"] = $generatedCode;
    $params["prefix"] = getFirstCharAsNum($generatedCode);

    $STH = $this->_db->prepare("SELECT count(*) FROM codes
        WHERE prefix = :prefix AND code = :code;");

    try {
        $result = $STH->execute($params);
    } catch (PDOException $e) {
        throw new Exception($e->getMessage());
    }   

    $result = $STH->fetch(PDO::FETCH_COLUMN);

    if($result) {
        return true;
    } else {
        return false;
    }        

}

The idea is that it will only attempt the second part of the AND operation if it finds a match, and searching through TINYINTs should be a lot quicker than a whole BINARY(12) column.

Is this actually quicker? Or is adding an additional lookup going to slow me down?

Thanks.

6
  • 2
    Storing the first character as a TINYINT and comparing that first, thus hopefully finding matching records faster. MySQL will do the job. (That's why you created an index) Commented Nov 16, 2013 at 14:11
  • You are overthinking this... Commented Nov 16, 2013 at 14:41
  • @hek2mgl But wouldn't placing an index on the TINYINT column make it even faster? :) Commented Nov 19, 2013 at 12:22
  • @TheMonk I would use an index and I guess it will be faster (I said that). Have you tried to benchmark with and without the index? Commented Nov 19, 2013 at 12:25
  • 1
    That tinyint index is naive (imo). I don't expect that such tricks would lead to better performance as mysql is highly optimized. Try it, I like when someone shows me that I'm wrong. :) Commented Nov 19, 2013 at 12:38

2 Answers 2

1

I need to store 1000s (and possibly soon 100,000s maybe even millions) of unique random strings of 12 characters in a database

If they are truly random, the chance of a collision is the {number of actual records}/{number of possible records}

Even if the characterset you are choosing from only contains digits, then, with 10 million existing records the probability of a collision is 10,000,000/1,000,000,000,000 = 1/100,000, hence what you are describing is really a waste of time. Add a unique index on the values in the database - if you get a unique constraint violation trying to add a new value, then regenerate the value.

(with a 36 character repertoire, the probability of a collision is approx 1/473,838,000,000)

Sign up to request clarification or add additional context in comments.

2 Comments

No - you've not understood the birthday paradox. If you select 2 children at random then the probability of them having the same birthday is MUCH greater than the probability then if you selected a single child looking for her birthday to be on a specific date. If you don't believe me try running some simulations. Good try though.
Yes, now I see what you mean. You are saying that there will be a not insignificant probability of overall collision, but the index will take care of that, while the probability on each new ID will be negligible. I'm retracting my earlier comment - but I'll quote the URL, en.wikipedia.org/wiki/… - since I made an embarrassing mistake between overall and single shot.
1

If you do this way, then code generation will slow down gradually with time, what with the need of searching a larger database, and the higher number of collisions on larger datasets.

You could instead prepare a table with pre-generated random codes. Then, memorize the offset in the Codes table. Whenever you need a new code, just fetch the offset-th row from the Codes table and increment offset by one; this of course needs to be done atomically, with a READ LOCK.

An independent thread can generate random codes whenever suitable (e.g. whenever the system load is low enough; at night; etc.) and INSERT IGNORE them into the Codes table:

CREATE TABLE Codes (
    offset   INTEGER PRIMARY KEY NOT NULL AUTO_INCREMENT,
    sequence BINARY(12)
);

To "generate" a code you now have to perform only one query which executes in O(1) since it is a fetch to a fixed address. Maybe two queries if you store the address into Code offset zero:

LOCK TABLES test WRITE;
SELECT datum.sequence FROM Codes AS datum
    JOIN Codes AS ndx ON ( datum.offset = ndx.sequence AND ndx.offset = 0 );
UPDATE Codes SET sequence = sequence + 1 WHERE offset = 0;
UNLOCK TABLES;

The thread that inserts new codes will experience a slowdown, but not very much of it (it will also use a LOCK TABLES LOW PRIORITY WRITE on each block of INSERTs). But all processes requiring new codes will go blazing fast.

Of course the "replenishment" thread will read the current offset and the COUNT(*) from the Codes table, and decline to run if there's more than a give number of codes available.

Checking and redeeming

To do this we can just add a "redeemed" boolean column. To further improve speed, you could use horizontal partitioning, dividing the Codes table in N hashed partitions. That way not only any searches would only be done on a small subset of the data (which is not a great improvement above b-tree indexing...), but locking and updating can be spread between tables.

You can also that "by hand" and spread the tables between different servers, base on the first letter of the code. That way, you can scale up to billion codes and still have fantastic speed -- provided you supply enough servers.

3 Comments

+1 This is very helpful answer indeed. Thanks so much! Unfortunately it doesn't help with the checking if codes exist, though (which I should have mentioned earlier, sorry). While generation will definitely be improved by your suggestion, the codes will also need to be checked that they exist in the database when a user "redeems" one.
Also: Is there any reason you used CHAR(12) over BINARY(12)? Was it just an oversight?
Oversight. Correcting

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.