Monday, January 23, 2012

Database Talk - Unique Identifiers vs. Numeric identities

Wiki
Just to make sure we're all on the same page –
Unique Identifier (which from now on will be referred to as ‘GUID', a.k.a. 'UUID'), is an algorithm-based 32-byte hex string (128 bit integer) which is supposed to generate completely unique values.

I’d like to compare the chances of two identical generated GUIDs to the chance earth will be destroyed by a meteor in the next five minutes.
Now, if that happens – you won’t be here to prove me wrong! J

Here’s one: 5B5D6F63-1FF0-4574-9B77-BD77D716FD66.
It is probably unique and there’s no other like it in the world.


Why should I?
With today’s increasing demand for fully-distributed systems, GUID usage is increasing accordingly.
It allows managing two or more disconnected databases which later on need to merge, knowing that each row on desired shared tables is unique.

Here’s a simple example –

Consider a very large company with many branches across the world.

Each branch has local users, and there’s a sync process which merges the entire users list.
The ‘traditional’ way is to maintain a range of numbers for each branch, for example:
-       use a prefix on the username to maintain uniqueness
-       Assign a numeric range for each branch (easier today with SQL 2012's sequences support)
-       Make a combined key of two columns, for example - UserId+BranchId columns

This is a classic sample for using GUID instead – make the UserId column GUID and problem is solved.


Sounds like fun! Any reason why I shouldn’t use it?
Well, yes.
Before you get all excited and change your “int’s” and “bigint’s” to GUIDs, consider the following:
-       GUID column takes 16 bytes, which is double than Bigint (long).
o    This means that index on the column will also take more space
-       Index Fragmentation is an issue and it is most likely to happen.
o    (See more about it on the Tips section below)
-       Not all databases support GUIDs.
o    So, if you migrate to mysql, you'll have to change the datatype to CHAR(32), so it takes even more space as a result


Some Best-Practices Tips:

[Tip #1]

The first, and probably the most important tip, is to avoid using CLUSTERED INDEX on GUID's.

This is because the GUID values are random – a physical sort will cause the table to be re-sorted very frequently and will also result with an excessive I/O consumption.
So do one of the following:
Either create another identity on the table, which also serves as the PRIMARY KEY (also a generic best practice of avoiding 'heap' tables)
OR -
When possible - use the NEWSEQUENTIALID function.
This will create unique GUIDs on a table but will also make sure the new generated GUID is greater than the last. (It is still globally unique as long as the machine has its own MAC address).
Let’s review a quick sample of the differences between the NEWID() function and the NEWSEQUENTIALID default value.
We'll create two tables; each uses one of the techniques:

      -- Creating the tables:    
      IF OBJECT_ID('UsersNewId') IS NULL
      BEGIN
            CREATE TABLE UsersNewId
            (
                  UserId UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED DEFAULT NEWID(),
                  UserName VarChar(255)
            )
      END
     
      IF OBJECT_ID('UsersNewSeqId') IS NULL
      BEGIN
            CREATE TABLE UsersNewSeqId
            (
                  UserId UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED DEFAULT NEWSEQUENTIALID(),
                  UserName VarChar(255)
            )
      END;
     
      -- Now, let's fill the tables with some data
      -- Note: I'm using SQL2008+ syntax here; On SQL2005 you need to separate the insert statements
      INSERT INTO UsersNewId(UserName) VALUES ('User1'),('User2'),('User3'),('User4'),('User5')
      INSERT INTO UsersNewSeqId(UserName) VALUES ('User1'),('User2'),('User3'),('User4'),('User5')
     
      -- Now, let's see the tables
      SELECT * FROM UsersNewId
      SELECT * FROM UsersNewSeqId

The results:

Note the GUID on the 2nd table looks the same on first sight, but actually it is not.
These are increasing values that still maintain uniqueness.
This is obviously much better for index fragmentation of any time!

[Tip #2]
Related to the above tip – as a DBA, try to keep the GUID generation on your side!  
Otherwise, if the GUID’s are generated by a different layer in your application (DAL or so), the GUIDs cannot be sequential.  
Besides, try to think as if the GUID’s are actually a new version of your INT identities; you wouldn’t let anyone control the auto-increase now would you?
[Tip #3]
In some cases you may want to consider surrogate keys, which will be used as the foreign key between the tables.
To use the sample at the beginning of this article – each user will have its own GUID, but with an additional “UserInternalId” column (int) which will be the table's primary key.

[Tip #4]
Don’t use GUID if you don’t have to.
When a table needs an ID which does not have to be unique across the enterprise, keep using the regular numeric id.
It takes less space and easier to manage.


Final Words:
GUIDs are great and simple solution to maintain uniqueness across the enterprise.
However, don’t overuse it when not required as it might hurt your databases performances.
This is a very wide subject – hope this in-a-nutshell post helps or at least gives you a good head start.