February 9, 2021

Azure Tips & Tricks: Preventing Headaches with Cosmos Document IDs

Post by: Todd Taylor

Todd M. Taylor is a Solution Architect with a passion for producing quality software using Microsoft Azure technologies.

An alternate title for this blog post could be, “Learn from other people’s mistakes so you don’t repeat them”. The following simple tip regarding Cosmos DB’s document identifier property may help you avoid creating a hard-to-find bug in your code.

To Auto-Generate or Not to Auto-Generate a Document ID…That is the Question

The JSON sample below represents the most basic Cosmos document generated using C# and the Azure Cosmos DB .NET SDK:

{
    "id": "26d66af7-3d5a-4871-9b70-041ea23be18b",
    "_rid": "kc1WAKbP0oUMAAAAAAAAAA==",
    "_self": "dbs/kc1WAA==/colls/kc1WAKbP0oU=/docs/kc1WAKbP0oUMAAAAAAAAAA==/",
    "_etag": "\"00000000-0000-0000-e2cc-b90017ab01d6\"",
    "_attachments": "attachments/",
    "_ts": 1609787195
}

Note that the “id” property, a.k.a. the document ID, is a GUID value disguised as a string. That’s because I didn’t tell the Cosmos SDK what value the document ID should be and the SDK decided the value for me. This is the default behavior of the Cosmos SDK. All of the other properties that have underscore prefixes have values generated by Cosmos and cannot be set by the developer, so we’ll ignore them for the sake of this blog.

Letting the Cosmos SDK auto-generate a GUID for the document ID is easy to do early in a project when requirements are being sorted-out and it’s not clear what the document ID should be. However, resist the temptation to auto-generate the document ID unless you have a darn-good reason to do so! It is a better practice to set the document ID to a value that has actual business meaning. (Database architects reading this blog will probably get all teary-eyed reading the previous sentence.)

What’s Wrong with an Auto-Generated Document ID?

To illustrate the issue that can arise with an auto-generated ID, I’m going to use C# snippets since it is through code that one interacts with the database. The following C# model class represents the basic JSON document with the Customer.ID property mapping to the JSON document “id”:

public class Customer
{
    [JsonProperty("id")]
    public string ID { get; set; }
}

If the document ID is a meaningless auto-generated value from the business’s perspective, it’s only a matter of time before an actual ID value is needed. There is likely a system-of-record that assigns each customer an ID that the business uses in many systems and understands.

When the developer realizes that the Customer.ID field value is an auto-generated GUID that doesn’t actually represent the business’s definition of a customer ID, the developer will add another ID property, like Customer.CustomerID:

public class Customer
{
    [JsonProperty("id")]
    public string ID { get; set; }

    [JsonProperty("customerId")]
    public string CustomerID { get; set; }
}

Through tribal knowledge, the development team might know that the Customer.ID property is a meaningless GUID value, but tribal knowledge rarely propagates throughout the entire tribe. Unless you are the developer that created the Customer class, it’s likely that you don’t know if the Customer.ID is meaningful or not.

If both the Customer.ID and Customer.CustomerID properties contain the same value in the Cosmos document, it’s unlikely that bugs will occur in code regardless of which property is used. However, if the Customer.ID is an auto-generated GUID and the Customer.CustomerId is not, this is a bug just waiting to torment you. (Don’t ask me how I know.) One developer will get the customer ID via Customer.ID and another developer will get the customer ID via Customer.CustomerID because it’s not remotely clear as to which value to use.
 

Clean Code Violation! Creating a class with two properties that appear to have the same meaning is a small but potentially costly mistake. Developers will be forced to guess at which value to use which will inevitably result in a bug that could’ve easily been avoided.

Therefore, strive to set the document ID value, i.e. Customer.ID value, to be the business identifier of the object for which it represents and not an auto-generated value.

What If It’s Too Late!?

If you’ve already created a Cosmos collection full of documents with auto-generated document ID values that don’t accurately represent the business’s unique identifier, then I recommend fixing this issue sooner than later despite how unpopular you will be for even suggesting such a thing. The longer you wait, the more costly the fix will be as more developers start randomly picking which ID value to use in more parts of the code base.

The easiest way to flesh-out the problem is to change the setter on the property that represents the document ID to be internal, assuming that your data access code is isolated from the rest of the code that uses it.

public class Customer
{
    [JsonProperty("id")]
    public string ID { get; internal set; }

    [JsonProperty("customerId")]
    public string CustomerID { get; set; }
}

Once this small code change is made, any code written by developers that accidentally picked the wrong ID will bubble to the surface via design-time compiler errors in Visual Studio. By setting the document ID to internal, the code responsible for inserting a Customer document into the database will still have access to the property but externally dependent code will not.

Once that small change has proliferated through the code base, one might consider changing the property’s access modifier from public to internal which will completely eliminate confusion.

Is It Ever OK to Auto-Generate the Document ID?

No! Just kidding ;). The default answer in software development is always, “It depends.”

One of the best uses of NoSQL databases like Cosmos DB is for logging data such as data generated by IoT devices. In which case, auto-generating the document ID for log data may be a good option. Even then it’s worth considering if the ID value should be a meaningful value, such as the concatenation of several values like the time stamp and device ID (for example). Creating a meaningful ID can make troubleshooting and support much easier than sifting through hundreds of thousands of records with GUID identifiers.

Subscribe to our Newsletter

Stay informed on the latest technology news and trends

Relevant Insights

Should You Disrupt Yourself to Accelerate Digital Transformation?

It has been interesting to watch Microsoft transition from a company that makes its money via licensing to one that...

Cybersecurity Myth Busted: Tools Are the Solution

When thinking about security, people often gravitate towards implementing various security tools, solutions, or products. If you bring up a...

Time to Reconsider MP-BGP EVPN for Your Datacenter Network?

VxLAN was defined in 2014 by RFC 7348 and has been used as a component in several SDN (software defined...
X