Quickly comparing random strings in a MySQL DB

Question

I need to store 1000s (and possibly soon 100,000s maybe even millions) of unique random strings of 12 characters in a database. Every time I have to generate a new code (actually done in batches of 10,000+s) I need to compare it against the existing database to ensure there will be no duplicates -- but also when a code is "redeemed" by a user, I need to ensure it exists.

Both of these tasks are likely to be very slow, so I wish to make them as streamlined as possible. To begin with I've ensured that the strings are stored in BINARY format on the DB with an INDEX on them. This is apparently quicker than CHAR, VARCHAR and VARBINARY.

I was thinking of trying to do further improvements, and I came up with this simple idea: Storing the first character as a TINYINT in an indexed column, and comparing that first -- thus hopefully finding matching records faster.

For example:

public function getFirstCharAsNum($code) {
    $firstChar = substr($code, 0);
    $firstCharHex = bin2hex($firstChar);
    $prefix = hexdec($firstCharHex);
    return $prefix;
}

public function isDuplicate($generatedCode) {

    $result = false;

    $params["code"] = $generatedCode;
    $params["prefix"] = getFirstCharAsNum($generatedCode);

    $STH = $this->_db->prepare("SELECT count(*) FROM codes
        WHERE prefix = :prefix AND code = :code;");

    try {
        $result = $STH->execute($params);
    } catch (PDOException $e) {
        throw new Exception($e->getMessage());
    }   

    $result = $STH->fetch(PDO::FETCH_COLUMN);

    if($result) {
        return true;
    } else {
        return false;
    }        

}

The idea is that it will only attempt the second part of the AND operation if it finds a match, and searching through TINYINTs should be a lot quicker than a whole BINARY(12) column.

Is this actually quicker? Or is adding an additional lookup going to slow me down?

Thanks.

Storing the first character as a TINYINT and comparing that first, thus hopefully finding matching records faster. MySQL will do the job. (That's why you created an index) — hek2mgl
– hek2mgl, Commented Nov 16, 2013 at 14:11
@hek2mgl But wouldn't placing an index on the TINYINT column make it even faster? :) — Chuck Le Butt
– Chuck Le Butt, Commented Nov 19, 2013 at 12:22
@TheMonk I would use an index and I guess it will be faster (I said that). Have you tried to benchmark with and without the index? — hek2mgl
– hek2mgl, Commented Nov 19, 2013 at 12:25
That tinyint index is naive (imo). I don't expect that such tricks would lead to better performance as mysql is highly optimized. Try it, I like when someone shows me that I'm wrong. :) — hek2mgl
– hek2mgl, Commented Nov 19, 2013 at 12:38

symcbean · Accepted Answer · 2013-11-19 13:43:11Z

1

I need to store 1000s (and possibly soon 100,000s maybe even millions) of unique random strings of 12 characters in a database

If they are truly random, the chance of a collision is the {number of actual records}/{number of possible records}

Even if the characterset you are choosing from only contains digits, then, with 10 million existing records the probability of a collision is 10,000,000/1,000,000,000,000 = 1/100,000, hence what you are describing is really a waste of time. Add a unique index on the values in the database - if you get a unique constraint violation trying to add a new value, then regenerate the value.

(with a 36 character repertoire, the probability of a collision is approx 1/473,838,000,000)

answered Nov 19, 2013 at 13:43

symcbean

48.4k6 gold badges63 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

symcbean Over a year ago

No - you've not understood the birthday paradox. If you select 2 children at random then the probability of them having the same birthday is MUCH greater than the probability then if you selected a single child looking for her birthday to be on a specific date. If you don't believe me try running some simulations. Good try though.

LSerni Over a year ago

Yes, now I see what you mean. You are saying that there will be a not insignificant probability of overall collision, but the index will take care of that, while the probability on each new ID will be negligible. I'm retracting my earlier comment - but I'll quote the URL, en.wikipedia.org/wiki/… - since I made an embarrassing mistake between overall and single shot.

LSerni · Accepted Answer · 2013-11-19 13:48:45Z

1

If you do this way, then code generation will slow down gradually with time, what with the need of searching a larger database, and the higher number of collisions on larger datasets.

You could instead prepare a table with pre-generated random codes. Then, memorize the offset in the Codes table. Whenever you need a new code, just fetch the offset-th row from the Codes table and increment offset by one; this of course needs to be done atomically, with a READ LOCK.

An independent thread can generate random codes whenever suitable (e.g. whenever the system load is low enough; at night; etc.) and INSERT IGNORE them into the Codes table:

CREATE TABLE Codes (
    offset   INTEGER PRIMARY KEY NOT NULL AUTO_INCREMENT,
    sequence BINARY(12)
);

To "generate" a code you now have to perform only one query which executes in O(1) since it is a fetch to a fixed address. Maybe two queries if you store the address into Code offset zero:

LOCK TABLES test WRITE;
SELECT datum.sequence FROM Codes AS datum
    JOIN Codes AS ndx ON ( datum.offset = ndx.sequence AND ndx.offset = 0 );
UPDATE Codes SET sequence = sequence + 1 WHERE offset = 0;
UNLOCK TABLES;

The thread that inserts new codes will experience a slowdown, but not very much of it (it will also use a LOCK TABLES LOW PRIORITY WRITE on each block of INSERTs). But all processes requiring new codes will go blazing fast.

Of course the "replenishment" thread will read the current offset and the COUNT(*) from the Codes table, and decline to run if there's more than a give number of codes available.

Checking and redeeming

To do this we can just add a "redeemed" boolean column. To further improve speed, you could use horizontal partitioning, dividing the Codes table in N hashed partitions. That way not only any searches would only be done on a small subset of the data (which is not a great improvement above b-tree indexing...), but locking and updating can be spread between tables.

You can also that "by hand" and spread the tables between different servers, base on the first letter of the code. That way, you can scale up to billion codes and still have fantastic speed -- provided you supply enough servers.

edited Nov 19, 2013 at 13:48

answered Nov 16, 2013 at 15:00

LSerni

57.7k10 gold badges70 silver badges110 bronze badges

3 Comments

Chuck Le Butt Over a year ago

+1 This is very helpful answer indeed. Thanks so much! Unfortunately it doesn't help with the checking if codes exist, though (which I should have mentioned earlier, sorry). While generation will definitely be improved by your suggestion, the codes will also need to be checked that they exist in the database when a user "redeems" one.

Chuck Le Butt Over a year ago

Also: Is there any reason you used CHAR(12) over BINARY(12)? Was it just an oversight?

LSerni Over a year ago

Oversight. Correcting

Collectives™ on Stack Overflow

Quickly comparing random strings in a MySQL DB

2 Answers 2

2 Comments

Checking and redeeming

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Checking and redeeming

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related