Unique ID in Distributed Systems
Last updated
Last updated
Recently at work, We were looking for a way to generate unique IDs across a distributed system that could also be used as the primary keys in the MySQL tables.
We knew in a single MySQL database we can simply use an auto-increment ID as the primary key, But this wonβt work in a sharded MySQL database.
So I looked at various existing solutions for this and finally learned about Twitter Snowflake β a simple 64-bit unique ID generator.
UUIDs are 128-bit hexadecimal numbers that are globally unique. The chances of the same UUID getting generated twice are negligible.
The problem with UUIDs is that they are very big in size and donβt index well. When your dataset increases, the index size increases as well and the query performance degrades.
Another problem with UUIDs is related to the user experience. Eventually, our users will be needed that unique identifiers. Imagine that a customer calls Customer Service and is asked to provide the identifier. Having to spell a complete UUID is not a pleasant experience.
Twitter snowflake is a dedicated service for generating 64-bit unique identifiers used in distributed computing for objects within Twitter such as Tweets, Direct Messages, Lists, etc.
These IDs are unique 64-bit unsigned integers, which are based on time. The full IDs are made up of the following components:
Epoch timestamp in millisecond β 41 bits (gives us 69 years with respect to any custom epoch)
Configured machine/node/shard Id β 10 bits (gives us up to a total of 2^10 i.e 1024 Ids)
Sequence number β 12 bits (A local counter per machine that sets to zero after every 4096 values)
The extra 1 reserved bit at the beginning is set as 0 to make the overall number as positive.
Since these use the timestamp as the first component, therefore, they are time sortable as well. Another benefit is its High Availability.
By default, 64-bit unsigned integers (long) will generate an Id whose length is 19, but sometimes it may be too long, our use case needed an Id whose length should not be greater than 10.
This article will share a simplified version of the unique ID generator that will work for any use-case of generating unique IDs in a distributed environment based on the concepts outlined in the Twitter snowflake service.
In our case, the full ID will be composed of a 20-bit timestamp, 5-bit worker number, and 6-bit sequence number.
The remaining 1-bit is the signed bit and it is always set to 0 to make the final value positive.
Our microservices can use this Random number generator to generate IDs independently. This is efficient and fits in the size of a int
(4 Bytes or 32 bits).
Here is the complete code in Java (Inspired by Twitter snowflake, code credits) -
Step 1 β We initialize the number of bits that each component will require :
Here, we are taking custom epoch as of Fri, 21 May 2021 03:00:20 GMT.
EPOCH_BITS
will be 20 bits and is filled with a current timestamp in seconds (You can also use millisecond if there is a possibility of multiple numbers of requests per second).
NODE_ID_BITS
will be 5 bits and is filled using the Mac address.
SEQUENCE_BITS
will be 6 bits and will act as a local counter which will start from 0, goes till 63, and then resets back to 0.
Step 2 β Creating a synchronized
function to generate the IDs :
Left Shifts
& Logical OR
operations?This is because Integer is represented by 32 bits and initially all are set to 0. To fill these bits we have to take each component separately, so first we took the epoch timestamp and shift it to 5 + 6 i.e 11 bits to left. Doing this has filled the first 21 bits with the first component (remember the first bit is always set to zero to make the overall number positive). The remaining 11 bits are still 0 and hence again we repeat the same thing with logical OR & the other two components as well thereby filling all the 32 bits and forming the complete number.
Step 3 β Utility function to generate the node id using the systemβs MAC address:
Letβs now understand its working with an example -
Letβs say itβs Sun, 23 May 2021 00:00:00 GMT right now. The epoch timestamp for this particular time is 1621728000.
First of all, we adjust our timestamp with respect to the custom epoch-
currentTimestamp = 1621728000- 1621566020 = 161980(Adjust for custom epoch)
So to start our ID, the first 20 bits of the ID (after the signed bit) will be filled with the epoch timestamp. Letβs this value with a left-shift :
id = currentTimestamp << (NODE_ID_BITS + SEQUENCE_BITS )
Next, we take the configured node ID/shard ID and fill the next 10 bits with that
id = id | nodeId << SEQUENCE_BITS
Finally, we take the next value of our auto-increment sequence and fill out the remaining 6 bits -
id = id | sequence // 6149376
That gives us our final ID π
And thatβs it! Primary keys that are unique across our application!
This article showed you a simple solution of how to generate a snowflake id whose length is >=7 and <=10.
By the way, you can adjust the bit count of the 3 components to adapt to your work.
NOTE : We should keep the generator as a singleton, it means that we should only create the single instance of SequenceGenerator per node. If not, it may generate some duplicate Ids.
Not only did twitter used it, Discord also uses snowflakes, with their epoch set to the first second of the year 2015.
Instagram uses a modified version of the format, with 41 bits for a timestamp, 13 bits for a shard ID, and 10 bits for a sequence number.
I hope this will help you! Thanks for reading :))