Universally Unique Identifiers: A closer look

A common data type we deal with in software construction is the Universally Unique Identifier, also known as Globally Unique Identifier to those of us in the Microsoft community. However, I often find that a lot of developers have a great deal of misunderstanding when it comes to using this type, and how they’re actually generated. The basic concept behind these Ids is well-known, they offer a guaranteed (or highly unlikely depending on the generation method) unique set of 32 hexadecimal characters (or 16 bytes/octets – 128 bits) that are by all appearances completely random (more on this later). Conventionally represented by 5 hyphen separated groups often encompassed with definite brackets, e.g. “{8-4-4-4-12}”.

Unique Identifiers are not Strings

I believe this is where a lot of the confusion stems from. A casual glance at the character representation of an Id like:

{F23B928A-D7DB-4CEB-AAF4-10396FDB069B}

and our brains automatically associate this with a string data type – since it’s just a set of random alphanumeric characters, or by all appearances a string.

I don’t know how many times during code reviews I’ve seen developers use strings and Ids synonymously: passing a string to a method only to immediately instantiate a new concrete identifier type to invoke another method with a “Guid” parameter in the signature, creating a string delimited list of Ids, etc.

Beside the obvious flaws in these approaches, I don’t believe most developers actually realize the unnecessary overhead in both computation time and memory.

The GUID Type

For example, when you use the string representation of an Id when you really need the concrete type it requires the following computational overhead: a new Id type created and placed on the stack, depending on the language specification this memory may be “zeroed” i.e. = { 0 } (Managed languages for example do this for all types), if the string representation is using the standard convention (encompassing definite brackets and hyphen group separators) it must be standardized – leaving you with the set of 32 hexadecimal digits.

Each of these 32 digits represents 4 bits in the data structure, so to get the binary representation you need to iterate each digit and get its value in a base 16 (hexadecimal) encoding table – each pair of digits will form a byte.

The “Guid” data structure memory alignment is segmented in 4 inline blocks: Data1 (32 bits), Data2 (16 bits), Data3 (16 bits) and Data4 (8 bit x 8 length character array) giving us the grand total of 128 bits.

Here is the C declaration:


typedef struct _GUID
{
    unsigned long Data1;
    unsigned short Data2;
    unsigned short Data3;
    unsigned char Data4[ 8 ];
} GUID;

While this is not necessarily a computational nightmare – it’s not free either. Here is some sample C code that demonstrates the process of Id to String conversion and back. Obviously not perfect, but you should get the general idea.

String Conversion


EXPORT GUID __stdcall FromStringA( const char * idString )
{
    wchar_t unicodeIdString[ 39 ];

    MultiByteToWideChar( CP_ACP, 0, idString, -1, unicodeIdString, 39 );

    return FromStringW( unicodeIdString );
}

EXPORT GUID __stdcall FromStringW( const wchar_t * idString )
{
    GUID id = { 0, 0, 0, { 0, 0, 0, 0, 0, 0, 0, 0 } };

    if( wcslen( idString ) == 36
    ||
    ( wcslen( idString ) == 38 && idString[ 0 ] == '{' && idString[ 37 ] == '}' )
    )
    {
        wchar_t buffer[ 33 ], data1[ 9 ], data2[ 5 ], data3[ 5 ], data4[ 3 ];
        int index, offset = 16;

        PurifyIdSet( idString, buffer );

        GetCharacterSet( data1, buffer, 0, 8 );
        id.Data1 = ( long ) Base16Encode( data1 );

        GetCharacterSet( data2, buffer, 8, 4 );
        id.Data2 = ( unsigned short ) Base16Encode( data2 );

        GetCharacterSet( data3, buffer, 12, 4 );
        id.Data3 = ( unsigned short ) Base16Encode( data3 );

        for( index = 0; index < 8; index++ )         {             GetCharacterSet( data4, buffer, offset, 2 );             id.Data4[ index ] = ( char ) Base16Encode( data4 );             offset += 2;         }     }     return id; } static void PurifyIdSet( const wchar_t * characters, wchar_t * buffer ) {     int index = 0;     while( *characters )     {         if( *characters != '{' && *characters != '}' && *characters != '-' )         buffer[ index++ ] = *characters;         characters++;     }     buffer[ index ] = '\0'; } 
 const char * GuidOutputFormat = "{%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x}"; static char * ConvertGuidToString( const GUID * guid ) { 	static char output[ 39 ]; 	static size_t len; 	if( len == 0 ) 	{ 		len = sizeof( output ); 		ZeroMemory( &output, len ); 	} 	sprintf_s 	( 		output, 		len, 		GuidOutputFormat, 		guid->Data1, guid->Data2, guid->Data3,
		guid->Data4[ 0 ], guid->Data4[ 1 ], guid->Data4[ 2 ], guid->Data4[ 3 ],
		guid->Data4[ 4 ], guid->Data4[ 5 ], guid->Data4[ 6 ], guid->Data4[ 7 ]
	);

	return output;
}

Then we have the additional memory allocation. The standard “Guid” data structure is a sequential 128 bit memory block, while the string representation is, assuming 8 bits per character for standard ASCII, we get 256 bits (8×32 for the base characters) +  32 bits for the four hyphens, 16 bits for the definite brackets and a null terminating character to end the string – a total of 312 bits, twice the size of the standard numerical representation – double this for Unicode representation. Again, not exactly devastating, but unnecessary overhead should be avoided when there is absolutely no reason for it.

Unique does not mean Random

Another very common misuse of the “Guid” type is using it to generate “random” data. A very common misconception of the “Guid” type is that since they’re guaranteed to be “Unique” they must be random, this is most definitely not the case.

As I mentioned previously, a generated Id appears completely random set of alphanumeric characters. However, depending on the actual implementation it’s not random at all – in fact the first iteration of the UUID protocol was so definite that you could practically track down where (originator machine) and when (to millisecond) it was generated. The only common factor to the generation protocols is that they’re identified by placing the version number in bits 49-52: {xxxxxxxx-xxxx-Vxxx-xxxxxxxxxxxx} where V is the version that generated the Id and x is a hexadecimal digit.

The first iteration of the UUID protocol (V1) used a combination of temporal and spatial locality to compute Ids. The original specification called for using the a unique 100 ms time-stamp for the first 64 bits ( a 10 bit multiplexer is used to limit the generated Ids to 1024 per ms, given that any modern CPU can perform multiples of this it places a unique constraint on our time-stamp). Also contained in the first 64 bits is a 4 bit version number, which displays the generation protocol, in this case 1 decimal or a “0001” nibble. The next 16 bits being the high / low order bits of the time since the machine started (CPU clock) and the last 48 bits being the MAC address of the machine (or if no MAC address is available, a 47 bit cryptographic quality random number with the most-significant bit of the first octet set to 1 – since this can never happen with MAC addresses). There are 4 other variations of this protocol V2-5: DCE, MD5, pseudo-random and SHA-1 hash respectfully.

Source Code & Example Implementation

For fun I’ve implemented a hybrid of V1 (temporal) and V4 (pseudo-random), the source code is available on my UUID repository on GitHub.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s