Generating Primary Keys
A primary key for each row of a table in a database is virtually a requirement of database design. Occasionally, the data for a table provides a primary key (e.g. username or email for an account table). More common is that one needs to generate primary key values for a table. Yet, tools for this in MySQL/Java are limited. MySQL offers auto_increment, but there are issues with replication, it can become a bottleneck for insert-heavy tables, it doesn’t provide globally unique ids and displaying these ids publicly may expose sensitive information. Java offers java.util.uuid, which gives pseudo-random 128-bit values. The chance of a collision is minuscule, but non-zero. More troubling is the size of the string representation: 36 characters. Since InnoDB uses the primary key index as the storage structure for the data and uses primary keys as data pointers for secondary indexes, long keys not only waste space, but make the database less efficient.
After evaluating these options and a few ideas of our own for primary key generation, we settled on a simple algorithm motivated by group theory. The advantages of this algorithm are numerous:
- Short Keys (6 characters yield 57 billion unique keys using only alphanumeric characters)
- Universal Uniqueness (no guessing to which table a key value refers)
- Pseudo-randomness (keys don’t follow an obvious pattern)
- No Duplicate-Checking (keys are guaranteed to be unique until a limit is reached)
- Block Generation (keys are generated in blocks to minimize lock contention)
Our generator uses one tiny bit of group theory: if k and n are coprime (aka relatively prime), the sequence of numbers generating by successively adding k (mod n) will not repeat through the first n values. This leads to the following algorithm for generating unique keys:
- Pick a size n
- Pick a value k which is coprime with n
- To generate the next key: nextKey = (lastKey + k) % n
You’ll be guaranteed to not see duplicates until you’ve generated n keys. The sequence you’d see with n=5 and k=3 is { 0, 3, 1, 4, 2, 0, 3, … }.
Note that the choices of n and k are quite important—they must be fixed and can never change. However, selecting reasonable values is not difficult. For n, select a character set and string length, then set n to be the number of possible unique strings. To get the 57 billion value above, use a string length of 6 and a character set of [0-9a-zA-Z] (62 characters). 57 billion is simply the number of unique, 6 character alphanumeric strings (62^6). If you grow to the point that you are worried about key collisions, switch to using 7 character strings (where n=62^7, appx. 3.5 trillion). Note that conversion from the key number value to string value is simply a conversion from base 10 to base 62 (or whatever # of characters you are using).
For k, we need a value that is coprime with n. To achieve pseudo-randomness, k should also not be too small (the same order as n is a good choice). Note that this “randomness” is quite weak in a mathematical sense, but was sufficient for our purposes. One way to select such a k is to multiply together prime numbers larger than the character set size. For our example, a reasonable choice would be k=67*71*73*79*83*89. If you don’t have your own prime number generator, consult the bear.
To put this algorithm into practice, one needs to ensure that keys are generated serially. We did this by creating a table with a single row with a single column storing the last key value. When we want to generate a key (or block of keys), we start a SERIALIZABLE transaction, read the last key value, generate key(s) per the above algorithm, then write back the last key value we generated and close the transaction. To minimize contention and since next key computation is much faster than a transaction, we generate keys in blocks and serve them out of memory via a synchronized HashMap. This causes key values to occasionally be permanently lost when a webapp is shut down, but the lossage is too small to be of any real concern.
We’ve been using this system for many months now and have yet to run into any problems. It satisfies all of our current needs and has the advantage that it can easily scale either by using longer character strings or increasing the key generation block size. Furthermore, it seems to be extremely lightweight, exerting minimal pressure on our database. We would love to hear what other solutions for primary key generation are used. How does ours compare?
It seems that the only benefit of this method over auto incrementing (i.e. k == 1) is pseudo-random keys, and I’m not really seeing what is so good about pseudo-random keys. Am I missing something?
You could make a bit more cryptographically secure by adding a simple pooling method. Pull a set of numbers from your algorithm. Shuffle the numbers (see Fisher–Yates shuffle in Wikippedia). Then take numbers sequentially from this pool. After taking a number from the pool, take another from your algorithim, add to pool and reshuffle. Your RNG needs to be good…
Dear Redditers,
The main point of this is to generate short (i.e. not 36 char UUIDs) without having to put a lock on a table (i.e. auto increment). The pseudo-randomness part is really to avoid people from scraping all of your content because they have an entire map of your keyspace.
Random numbers don’t guarantee that you won’t have colissions.
UUIDs look ugly in URLs.
Ergo, this solution.
Philip Jacob / Founder & CTO
I don’t understand how this method avoids having to lock the table. If two threads both go to make a new primary key, they need to make sure they get different values for lastKey or else it breaks, and the only way to do that is to have some sort of locking.
(I’m also of the school of thought that says you should never show a primary key to a user, but that is more subjective)
If I understand you, you want to have a unique ID for each record, that you would like to be short so you can include it in a URL, and you want them not to be sequential, so that people knowing one ID can’t guess the next.
And it’s this last requirement that keeps you from using the automatic sequencing provided by the database.
So you’re using modular arithmetic to generate a predictable sequence of non-sequential numbers. Seems to me that would work fine. But there is an alternative approach – provide a reversible mapping from the key value in the database to the ID you include in the URL.
Your database key values would be a sequence, 1, 2, 3, …, i, …
You’d need a modulus, n. And two integers, p and q, that are multiplicative inverses mod n. That is, p * q = 1 mod n.
You read the key, i, from the database, multiply it times p to create the ID you include in the URL. When you have a URL, you extract the ID and multiply times q to recover the database key.
Is this better than what you describe? In truth, it’s not much different. Unless the database’s built-in mechanisms for generating sequential key values are more efficient than the methods you are using to generate non-sequential keys. Which they might be.
Jeff, you’re leaving out one requirement:
# Block Generation (keys are generated in blocks to minimize lock contention)
This is important, especially in MySQL as you scale. And, yes, we’ve had to do all kinds of things because StyleFeeder is a big site. But otherwise, what you suggest is nice and simple and meets the other criteria.
I understand that an auto-increment field would only create a single new key at a time, and that wouldn’t scale for you. But I was under the impression that there were other auto-sequencing mechanisms in some rdbmses that would allow for the allocation of a number of keys at a time. I have no idea whether any of them are available to you in MySQL.
My suggestion could, of course, be implemented using the same single-row/single-field mechanism you have described. You could argue that it would be more efficient. To allocate 20 keys, you’d simply lock, read, increment the number by 20, then write and commit. Where with yours, you’d need to do the 20 multiplications mod-n. Which difference, in an I/O-bound application like a web app accessing a database, is vanishingly small.
If I were to write a new system, having the two methods in mind, I might go with yours. Simply because it’d save a step when I was bug-chasing, to be able to match the ID in the URL to a record in the database. I don’t think any increased efficiency would be measurable.
Step counting coprime modulus is not a safe way to obscure an ID series. You can usually notice it just by looking at a handful of IDs, and if you can find just two pairs of adjacent information in the database – two quick successive inserts if they’re temporally ordered should do it – then you can use those two pairs to find the coprimes.
This isn’t going to stop anyone. This is security-through-obscurity, and not even particularly well done at that. You’ve maybe slowed someone down for three minutes.
If your system sees ID traversal as a vulnerability, and also does not authorize access to IDs, then your system is fundamentally unsecure, no matter how much magic (and this isn’t much magic) you use to obscure sequences.
How do you make a conversion to base 62 ? Java max radix is 36 for BigInteger.toString(int radix) conversion. Is it any way you can share your algorithm for this ?