TSIDs strike the perfect balance between integers and UUIDs for most databases
When designing a database schema, an important decision is deciding on the type used for the key columns. These are the primary key and foreign key columns in a database, such as
posts.user_id, respectively. The two most common choices are auto-incrementing integer types and Universally Unique Identifiers (UUIDs).
In this article, we’ll examine each of these to understand the trade offs between them, and then examine a third option that our experience dictates offers the best of both worlds: Time-Sorted Unique Identifiers (TSIDs). We’ll explain how TSIDs work, what the pros and cons versus the other two options are, and look at an implementation of TSIDs in PostgreSQL that we’re currently using in our production systems today.
Auto-incrementing integer keys
The default choice in most database engines is to use an integer type. This is undoubtedly the type that anyone learning the fundamentals of database schema design or looking at examples of
CREATE TABLE statements will see.
Using an integer type is sufficient for the vast majority of use cases, so long as the integer type selected for the primary key column is large enough to not cause overflow issues in the given business domain. For example, a mistake that someone unfamiliar with this potential issue can make is setting the primary key type to a signed 32 bit integer (called just
INT in many databases), which will result in roughly 2.147 billion keys; a number that can certainly be exceeded in certain contexts. Most SQL databases have a 64 bit integer type with a name like
2 ^ 63 - 1—about 9 billion billion—keys available per table. This is a number so large that it is unreachable for most databases.
Benefits of auto-incrementing integers
There are several obvious benefits to using auto-incrementing integers. One of the most basic benefits is that they will be natively supported by essentially every single SQL database, and they will more than likely have excellent performance due to their modest space requirements and the excellent indexing characteristics of sequential integers. The reason for this is data locality, meaning that similar records live together on disk, allowing significantly more efficient data writes as well as data retrievals in many contexts, which we’ll cover in more detail later.
Another benefit for the purpose of debugging and auditing is that auto-incrementing integers are chronologically sorted, with newer records always having larger values than older records. This issue can be mitigated even when not using auto-incrementing integers simply by defining a
created_at column in every table with the default value being set to the current time, but experience dictates that many developers and other schema designers tend to omit this column. Sequential primary keys allow us to at least discern the relative age of the records in a table.
An additional and perhaps underrated benefit is that auto-incrementing integers are human readable. Integers that are just a few digits long can be temporarily recalled by most people for long enough to transcribe them into another window or computer, such as when investigating an issue and searching for them in logs. Even ten or twelve digit numbers can be transcribed by most people with two or three quick glances. When printed with separator characters such as commas (e.g. 3,212,303,404 rather than 3212303404), they’re particularly easy to read.
Downsides of auto-incrementing integers
There are several downsides to using auto-incrementing integers, however. One problem is that they cannot be generated by multiple separate nodes in tandem. This is a non-issue for systems with a single database that is responsible for generating all of its primary keys. However, this also means that determining the primary key value of a new record requires waiting for the completion of the
In situations where the client is to generate keys or multiple write-database nodes are needed, using simple auto-incrementing stops being a viable option. In these situations, some sort of orchestration is required to safely generate distinct primary keys in multiple places at the same time.
Another issue is what is referred to as the German tank problem, which is the idea that sequential serial numbers (or primary keys) can allow an external observer to make inferences about the total number of records of a given type in a given database. This is particularly an issue in highly sensitive business or even government contexts, where revealing any information at all about the underlying data is considered a risk.
Consider the following scenario: you are a SaaS company that sends out monthly invoices that are tied to an auto-incrementing integer primary key (e.g. visible via URL such as
example.com/invoices/10555). In this case, a customer that gets an invoice each month can see the total number of invoices you issue each month. If in successive months they receive invoices with id values of
12698, then they can assume that just over 1,000 invoices were issued each month, likely corresponding to the number of customers you have.
This might not only tell them how many customers you have, but with some additional contextual information about your company, they may even be able to approximate your average invoice size. Then, with both the number of invoices you issue each month as well as your average invoice size, they can guess your monthly revenue with reasonable accuracy, which is something that most companies prefer to conceal.
The German tank problem can be alleviated through mechanisms such as hashing (e.g.
/invoices/AbW0D21), but this requires significant extra effort. Many companies just decide what little information they may be revealing through integer ids is not worth concealing, and don’t bother with any special treatment.
Universally Unique Identifier keys
UUIDs are another option for primary keys. A UUID is a 128 bit integer that is usually represented as a 32 character hexadecimal string, and is typically displayed in an 8-4-4-4-12 format, such as
cd6aefb6-5898-49d9-906d-f7443450cb39. UUIDs can be generated in many different ways, but for the sake of most of this article we’ll predominantly focus on UUIDv4, which is a random UUID generation scheme and very popular whenever UUIDs are used for primary keys.
Benefits of UUIDs
Using UUIDs as database primary keys brings with it several benefits over auto-incrementing integers. First, and often touted by proponents of UUIDs, since UUIDs are randomly generated, they can be generated without relying on a central authority—in the database, in the client’s browser, in the server side application, by some external service—and are assumed to always be unique due to the vastness of possible values in the 128 bit number space. Whereas auto-incrementing integers always increase by 1, UUID values can for all intents and purposes be any 128 bit number (for UUIDv4, the randomly generated portion is 122 bits, but we won’t go into the technical details of UUID generation here).
This characteristic has many additional benefits that are not immediately obvious. For example, if two completely separate systems have users that are keyed on UUIDs, the user records from those two systems can be merged without any conflicts, while allowing each user record to retain its original primary key. In a situation where auto-incrementing integers were used in both systems, one system’s set of users would have to be chosen as the incumbent and the other system’s users would need to be reassigned new primary keys during an import process in order to resolve all conflicts.
UUIDs, especially the completely randomly generated UUIDv4 values, cannot be predicted and do not leak information. Since UUIDs are not monotonically incrementing, the gap between successively generated values will vary. Given this, there’s no potential for the German-tank problem to be an issue for record sets where UUIDs are the primary key. There’s no possibility of inferring how many records there are in a system from a UUID given that there’s no meaningful sequence that correlates with the cardinality of the underlying record set that the keys are being created for.
Downsides of UUIDs
UUIDs have several major disadvantages as well. First, the 128 bit integers they consist of take significantly more space than auto-incrementing integers. In a large database, the larger keys can take up significantly more total space. The issue with using UUIDs for primary keys specifically is that the entire UUID is stored not only in the table but also in all indexes created for that table. In narrow tables (tables with only a few small columns), the size of the UUID may be larger than all the rest of the data combined. An analysis of UUIDs conducted by Percona gave the following example and observation about the characteristics of UUIDs in schemas:
Let’s assume a table of 1B rows having UUID values as primary key and five secondary indexes. If you read the previous paragraph, you know the primary key values are stored six times for each row. That means a total of 6B char(36) values representing 216 GB. That is just the tip of the iceberg, as tables normally have foreign keys, explicit or not, pointing to other tables. When the schema is based on UUID values, all these columns and indexes supporting them are char(36). I recently analyzed a UUID based schema and found that about 70 percent of storage was for these values.
UUID support also varies across databases, from acceptable to poor. As a result of this, UUIDs typically have worse performance characteristics than normal integers, and are comparable at best. This is especially true when UUIDs are stored in a string column type, such as
VARCHAR, where index performance will not be as good as
INTEGER column types. The decision to use
VARCHAR may be for a variety of reasons—no better
UUID type at all, complexity associated with compiling/loading a UUID handling extension, or a mistake on the part of the developer creating the database. The
BINARY(16) type is another option to use in the case of no native UUID support, but as the previously linked analysis of UUIDs from Percona says, changing the representation of UUIDs only offers a marginal benefit:
The use of a smaller representation for the UUID values just allows more rows to fit in the buffer pool but in the long run, it doesn’t really help the performance, as the random insertion order dominates. If you are using random UUID values as primary keys, your performance is limited by the amount of memory you can afford.
Even if UUID type support is good, randomly generated UUIDs lead to a variety of issues arising from the non-sequential distribution of data, which can be detrimental for performance. Insert performance tends to diminish significantly when data being written uses random primary keys on databases that use normal hard drives (HDDs) rather than solid-state drives (SSDs). Since most SQL databases use B-trees for indexing primary keys, sequential keys lead to a B-tree that grows in a predictable and sequential manner. Conversely, UUIDs, due to their random nature, can lead to a more scattered B-tree. This scattering necessitates keeping larger portions of the tree in memory, potentially reducing overall efficiency.
With an auto-incrementing primary key, records are usually inserted in chronological order. This ordering creates a correlation between temporal and memory locality. As a result, the database can efficiently manage memory by keeping only the frequently accessed (or “hot”) portions of the table in memory, while less frequently accessed (or “cold”) data can be stored on disk.
The contiguous storage of temporally related data with auto-incrementing keys can also enhance CPU prefetching. Prefetching is a process where the CPU anticipates the need for certain data and loads it into faster memory ahead of time. When data is stored non-contiguously, as with UUIDs, the benefits of prefetching are diminished, since the CPU cannot as easily predict which data will be needed next.
An issue with UUIDv4 is that their total randomness means there’s no way to discern which keys are newer than others. As discussed above in the section on auto-incrementing integers, sequential keys provide information into the relative creation times of various records, making debugging and auditing easier in certain circumstances.
The good news is that as of this writing, there is a draft of an upcoming UUIDv7 specification, which will use a timestamp as a component of each generated UUID to ensure that successive values are sequential. It’s unclear when the specification will be finalized, and how much longer it will take before there’s a practical means of generating UUIDv7 values in your database or application of choice. If you are starting a new project at the time of this writing, UUIDv7 is likely not a practical option for you.
Another significant downside of UUIDs, even if sequential, is their lack of readability. Whereas even relatively large integers are easy to read, easy to say, and easy to transcribe, this is largely untenable for UUIDs. Sharing UUIDs between programs, systems, log files, spreadsheets, and wherever else can only be practically done via copy-and-paste. Manually doing so is both error prone and laborious.
The readability of UUIDs is also an issue in other contexts. Consider the following URLs:
With UUIDs, the URL becomes so long that it’s likely hard to see the whole thing. Depending on screen size and the visible length of the browser’s address bar, even the
/invoice/ portion of the URL may be obscured, making it impossible to see at a glance that we are in fact looking at an invoice.
Moreover, as already mentioned, remembering an integer is generally feasible. “Oh yeah, user 281714, that’s the same one that received the other invoice I was looking at.” UUIDs do not lend themselves to the same. It’s certainly possible to do something like “Oh yeah, user that started with c011fc.” But the onus is on the person to remember they’re looking at prefixes rather than suffixes, and in the unlikely event of a prefix collision, things can become very confusing.
The readability concerns persist in other contexts as well, such as when viewing data in a database client or spreadsheet. UUIDs lead to very large and visually noisy data. Consider the following two tables, which use the same IDs employed in the URLs above.
Table with integer identifiers
Table with UUID identifiers
We can see that with UUIDs, the table that contains otherwise the same data is significantly wider whether it be viewed in a database client, a spreadsheet, a log file, or embedded on an HTML page like this.
The last issue we’ll mention with UUIDs as primary keys is that they are incompatible with integers. What this means is that there’s no easy way to switch a given table’s primary key from auto-incrementing integers to UUIDs or vice versa. Doing so may require a significant migration procedure, and may be totally untenable if the data is synchronized across multiple systems (e.g. a nightly data feed or ETL process that is commonly used in B2B integrations or for data analysis purposes). In other words, if you chose auto-incrementing integers as your primary key, odds are you aren’t ever going to be able to change that primary key to UUIDs in a production system. The reverse is also true.
Time-sorted Unique Identifier keys
Now, let’s look at TSIDs, which is the name of a particular specification published for implementing time sorted identifiers. The specification by Fabio Lima we referenced in our use mentions that TSIDs combine ideas from Snowflake IDs developed and used by Twitter, and ULIDs, another time-sorted identifier that touts a shorter canonical form than UUIDs, amongst several other benefits.
Some of the key features are:
- TSIDs are generated in time-sortable order (as with other time-sorted identifiers, including the aforementioned UUIDv7)
- TSIDs are a 64 bit integer
- TSIDs can be represented as a 13 character string through Crockford base32 encoding
- The TSID generation algorithm can optionally include node IDs, to ensure that TSIDs generated at multiple sources (e.g. multiple databases or application servers) stays unique
Benefits of TSIDs
TSIDs combine many of the benefits of both auto-incrementing integers and UUIDs. As already mentioned, one of the fundamental ideas is that TSIDs are time-sorted, so they are naturally sequenced. TSIDs generated at least one millisecond apart will always maintain their generation order when numerically sorted. This is a benefit that for the purpose of most applications is shared with auto-incrementing integers.
However, unlike auto-incrementing integers, and like UUIDs, TSIDs also include a random component, so successively generated TSIDs cannot be predicted ahead of time. Equally importantly, it is impossible to discern how many TSIDs have already been generated by looking at one or even many TSIDs.
We can demonstrate the above behaviors through examples. Consider the following example in our PostgreSQL database, using our
generate_tsid() function (the implementation of which we’ll show later in this article). We’ll run the following query that invokes the function four times in extremely quick succession, and certainly in less than a millisecond.
select generate_tsid() as tsid, 'first' as ord
select generate_tsid() as tsid, 'second' as ord
select generate_tsid() as tsid, 'third' as ord
select generate_tsid() as tsid, 'fourth' as ord
order by tsid asc;
We see that the results are returned out of order. This is because a TSID uses millisecond precision in the timestamp component. At the same time, even though we generated four TSIDs in the same millisecond, the difference between the largest and the smallest one is a numerical value of 3,064,831, making it completely untenable to discern how many TSIDs were generated in a given millisecond, let alone across time.
We can use the
pg_sleep() function to slow down each query just slightly:
select tsid, ord from (
select generate_tsid() as tsid, 'first' as ord
select tsid, ord from (
select pg_sleep(0.001), generate_tsid() as tsid, 'second' as ord
select tsid, ord from (
select pg_sleep(0.001), generate_tsid() as tsid, 'third' as ord
select tsid, ord from (
select pg_sleep(0.001), generate_tsid() as tsid, 'fourth' as ord
) aliasorder by tsid asc;
Which results in:
We can see that delaying the execution of each call to
generate_tsid() because the timestamp component that leads the TSID flips forward between successive calls spaced one millisecond apart. We also see that now the difference between the largest and smallest TSID values in the above sequence generated about three milliseconds apart is more than 25 million—utterly obliterating any concerns pertaining to leaking information related to how many records there are in our set.
Looking at the above examples, we can also see that TSIDs are short. Their integer representation is only 18 characters long. Encoded in the aforementioned Crockford base32, a TSID is only 13 URL-safe characters. That’s about 65% shorter than the 36 characters that the standard display format of UUIDs consists of. This means that TSIDs are readable. Let’s consider the above comparison against integers again:
This is a huge improvement over UUIDs. Plain integers still have the upper hand when it comes to recognition and short term recall or transcription, but the additional cost of the TSID in this respect is modest compared to the UUID.
One of the most important elements of TSIDs as compared to UUIDs is that they are stored as integers. This means that all of the space, performance, and database support benefits of auto-incrementing integers are true for TSIDs as well. From the perspective of the database, the only distinction between auto-incrementing integers and TSIDs is how each of them are generated, with auto-incrementing integers relying on a per-table sequence that the database stores, and TSIDs relying on their own generation function. The means of storage, of indexing, and of retrieval of both are identical.
There’s another benefit that comes from TSIDs being integers, which is that TSIDs can be a drop in replacement for auto-incrementing integers in any database table whose primary key value is less than about 500,000,000,000,000,000 (that’s a “5” followed by seventeen zeros). I think it’s safe to say that this will likely not be an issue in most tables in most databases.
Once the means of generating TSIDs is defined, it takes at around two queries to switch to TSIDs. Here’s what that looks like for our
users table in our PostgreSQL database:
ALTER TABLE public.users ALTER COLUMN id TYPE int8 USING id::int8;
ALTER TABLE public.users ALTER COLUMN id SET DEFAULT generate_tsid();
And here’s the resulting data (with two rows existing before the switch, and two more being added after):
The gap in values may look jarring, but should be inconsequential if the keys are used solely as identifiers throughout the system. For new tables, it’s just as easy. We simply set the
DEFAULT value for the primary key column to our TSID generation function.
CREATE TABLE public.users (
id int8 NOT NULL DEFAULT generate_tsid(),
"name" varchar NOT NULL,
CONSTRAINT users_pk PRIMARY KEY (id)
Another aspect of TSIDs that may be either an upside or a downside is that how they are generated is configurable. The TSID specification defines several variables that can be changed to adjust the generation behavior for a specific context. For example, the node id may be omitted and replaced with a longer random component. While this configuration gives flexibility, if it is written or altered incorrectly, can lead to errors and requires careful adjustment.
One of the other aspects of TSID configuration is that the implementation used in a given database can change over time. As long as the first 42 bits of the TSID are implementing as the timestamp component, the remaining 22 bits can be any combination of random bits, a node id, and a counter component depending on the needs of the deployment, and this portion can change as needed, given that any potential collisions arising from this combination are only a risk in a given millisecond.
Downsides of TSIDs
TSIDs do have a few downsides. The first is that they rely on a timestamp component that must fit into the first 42 bits of each standard TSID. The current standard way to generate TSIDs is to use the milliseconds from
2020-01-01 as the timestamp component. Depending on how the epoch is calculated, this means that the timestamp fits into the allocated space for about 70 or 140 years. This may not be a practical concern for anyone building software today, but is worth noting.
As we saw in the example above, it is possible to generate TSIDs that are out of order if they are generated within the same millisecond. For multi-node generation, clock drift between machines is also an issue, so it is possible that TSIDs generated across machines will be generated out of order due to deviations in the machine clock time. It is worth remarking that clock drift will be a concern for all multi-machine time-sorted identifiers, including UUIDv7 and Snowflake IDs, so TSIDs are no worse in this regard.
Distributed TSID generation is not as straightforward as it is with UUIDs. When generating TSIDs in multiple places, a node ID or machine ID must be maintained. In addition to devising an orchestration scheme to manage the node ID itself, the presence of this node ID subtracts the number of bits available in the random component of the TSID, increasing the possibility of collisions. The probability of a collision remains extremely small, and is only a potential issue when large numbers of TSIDs are generated in a very brief period on a single node—at least a million identifiers per second before the risk appears—but this risk must be considered and managed in high throughput systems, or in cases where large blocks of identifiers are generated before they are actually needed.
A final significant downside is that since TSID generation is highly configurable, managing the configuration can be a burden if it needs to be ported across databases or even across completely distinct system components (e.g. a SQL database and an auxiliary service that both need to generate TSIDs will each need to maintain their own copy of the implementation that stays in agreement to eliminate the risk of collisions). It’s also possible that not every database may make it easy to define a custom TSID generation function. Our PostgreSQL implementation example below is rather compact and self-contained, but other databases may not make it so simple.
TSID implementation example
Let’s look at an example implementation of TSIDs that we use for single-database systems. The implementation below is based on an example implementation by Fabio Lima. It defines the PostgreSQL function
generate_tsid(), which we’ve referenced several times earlier in this article. The following must be ran once per database instance before the function becomes available for use:
drop sequence if exists "generate_tsid_seq";
create sequence "generate_tsid_seq" maxvalue 1024 as smallint cycle;
create or replace function generate_tsid() returns bigint as $$
-- Milliseconds precision
C_MILLI_PREC bigint -- Random component bit length: 12 bits
C_RANDOM_LEN bigint -- TSID epoch: seconds since 2020-01-01Z
-- extract(epoch from '2020-01-01'::date)
-- 42 bits
:= floor((extract('epoch' from clock_timestamp()) - C_TSID_EPOCH) * C_MILLI_PREC);
C_TIMESTAMP_COMPONENT bigint -- 12 bits
:= floor(random() * C_RANDOM_LEN);
C_RANDOM_COMPONENT bigint -- 10 bits
:= nextval('generate_tsid_seq') - 1;
C_COUNTER_COMPONENT bigint begin
return ((C_TIMESTAMP_COMPONENT << 22) | (C_RANDOM_COMPONENT << 10) | C_COUNTER_COMPONENT);
end $$ language plpgsql;
Some details about this implementation:
- We don’t have a node id, and instead we have a 10 bit counter component.
- The counter component of TSIDs generated by this function relies on a PostgreSQL sequence (named
generate_tsid_seq) that cycles between integers from 1 to 1,024. This sequence is shared across all invocations of the
generate_tsid()function across the whole database. In high throughput databases, this can easily be modified to use a sequence per table, significantly reducing the already low chance of collisions even further.
- We have a 12 bit random component.
- Leading each TSID, we have the aforementioned 42-bit timestamp component that is calculated using the number of milliseconds from
With the above function defined in a database, TSIDs can start to be used as keys in table columns (see the earlier
CREATE TABLE and
ALTER TABLE statements). This is all that is required for a single-database implementation of TSIDs.
Feature shootout: Auto-incrementing integers vs. UUIDs vs. TSIDs
Having made extensive comparisons of auto-incrementing integers, UUIDs, and TSIDs in this article, let’s now look at a summary of what we’ve covered. The below table shows a breakdown of how well each type fares by each feature or trait we’ve examined. Each feature includes an annotation as to whether it is a relative positive, negative, or neutral compared to the other types. This comparison is performed in the context of a typical B2B or B2C SaaS application backed by a single SQL database (spanning one or a few nodes), which describes the vast majority of applications built today.
|Variable size integer
|Unique within a database
|Unique across nodes
(larger than integers but smaller than UUIDs)
(time-sorted with random component)
(efficient indexing, inserts, reads)
(inefficient inserts, scattered indexes, read penalty)
(similar to integers)
(32 character strings)
(13 character strings)
(based on sequence)
|No inherent order
(based on time component)
|Feasible with node IDs
|Security (Inference Risk)
(German Tank Problem)
|Ease of Implementation
(varies by database)
(least support, requires function implementation, managing node IDs)
(limited by integer type)
(no practical limit)
(at least ~70 years, limited by timestamp size)
(can change to larger integer type)
(hard to change key type)
(drop-in compatible with integers)
TSIDs: Have your cake and eat it too
In the projects we’re currently working on, including our last production system, we’re employing TSIDs using the PostgreSQL implementation shared above (along with a language-specific implementation of Crockford base32 for use of TSIDs in URLs). Our experience has been very positive, with a completely seamless switch from predominantly using auto-incrementing integers (along with some use of UUIDs) in older systems to TSIDs in our current work. This ease of adoption was especially true once we settled on the specific TSID configuration we wanted to use.
In considering the trade offs of auto-incrementing integers, UUIDs, and TSIDs, our experience dictates that TSIDs bring together many of the benefits of both auto-incrementing integers and UUIDs and minimizing the downsides. As an organization, we’ve seen success with using TSIDs in production, and we strongly recommend evaluating and considering whether TSIDs are an improved means of primary key generation in your systems, especially to anyone who is already considering a switch from auto-incrementing integers to UUIDs or vice versa. The TSID may be exactly the right balance of features you are looking for.
Christian Charukiewicz is a Partner at Foxhound Systems. We’re a small team of Software Engineering leaders that can lead your organization towards its software development goals. Want to improve the effectiveness of your development team? Take a look at our Technical Guidance subscriptions or for a larger project, contact us.