Trace-Context and the road toward trace tool interoperability
The W3C Trace Context specification is a set of new standards being developed by open source and commercial tool providers that defines a unified approach to context and event correlation within distributed systems, such as microservices environments. Having such a standard will enable end-to-end transaction tracing within distributed applications across a range of monitoring tools.
To fully understand the value that a unified context propagation specification can realistically provide, you should first understand the concepts of distributed tracing, context propagation, and the related challenges.
Why we need distributed tracing
Distributed tracing is used to understand the control flow within distributed systems (i.e., how transactions flow through multiple distributed services). While distributed tracing has been around for over a decade, it’s gained renewed interest in recent years with the rise of microservices architectures. While it may still be possible to track the execution flow of transactions within traditional “monolithic application” environments, this is certainly not the case when working with numerous microservices, where control flows can become highly dynamic (for example, with service meshes or circuit breakers that change the execution flow of transactions at runtime).
Context propagation: Core building block of distributed tracing
In order to make distributed tracing work, we need a way to pass context information from one transaction to the next. Such a transaction context, or simply “context” for short, is represented by one or more unique identifiers that enable linkage between the client-side and the server-side of each transaction.
Without context-propagation, distributed tracing is simply not possible as there is otherwise no reliable way of linking transactions together in a way that preserves their context.
Below is a simple example of context being used to link two transactions together. In this example, we use a header called context that contains two fields, transaction ID and parent ID. These two identifiers can subsequently be used to link two parts of a transaction.
Why context propagation breaks
Up to this point, the concept of Trace Context sounds pretty straightforward. It seems that all you need to do is forward a simple header — then distributed tracing works out-of-the-box, taking care of the details for you. Unfortunately, it’s not this simple. In the real world, there are challenges that must be addressed before distributed tracing can be deployed successfully within distributed environments.
There is currently no agreement as to what these tracing headers should be called. Each tool vendor uses its own HTTP header to store context information. This wasn’t an issue in the past as traces were rarely monitored by multiple tools. Today, things are much different. In many cases, cloud applications are monitored both at the application level (by application developers) and by the cloud vendors themselves. If different tracing headers are used in such scenarios, traces are likely to break when they cross the boundaries of the respective tracing tools.
Incompatible tracing headers aren’t the only problem. As tracing headers aren’t standardized, they aren’t automatically forwarded by middleware such as routers, service meshes, or messaging systems. Again, when headers are dropped, traces break.
TraceParent: An agreed-upon header
The challenges detailed above are why tool providers have agreed on a new standard header called TraceParent. This header won’t be dropped as it’s recognized as a standard header that must be forwarded by both tracing tools and middleware.
TraceParent might at first sound like a weird name for this header. Why not simply call it TraceContext? As always, there’s a story behind this. First of all, the Trace Context standard defines both the header itself and also the values that the header may contain. The TraceParent header accepts values that provide the essential information needed to enable distributed tracing: the transaction id and the parent id. Distributed traces can be reconstructed based on these two provided values. So, as the header identifies the parent, TraceParent isn’t such an odd name after all.
For completeness, it’s worth mentioning that there is a third part of the tracing header that defines the sampling behavior that determines which traces are captured (or not). This information is required as most tracing system only capture a fraction of overall traces. This information must be communicated to ensure that tracers within different application tiers capture the right traces and don’t create too much overhead by capturing traces that will later be discarded.
Also, TraceParent isn’t the only header used for tracing. There’s a second header called TraceState.
TraceState: Going beyond parent correlation
The Trace-Parent header enables parent-based correlation for the reconstruction of distributed traces. At first glance, this appears to be everything we need to maintain transaction context within distributed applications across tools. However, most implementations require more information than what can be defined within a TraceParent header (for example, tenant data within a SaaS environment and other information a system needs to optimize the routing and processing of data).
Using the TraceState and TraceParent headers in combination enables tools to collaborate on creating distributed traces as tools can then rely on all information being properly forwarded.
Tracing beyond backend systems
While trace context has primarily been defined to enable tracing within distributed server-side systems, it’s in no way limited to this. The advantages of starting traces on the client side in the browser are obvious. With this approach, instead of receiving only end-to-end traces that begin at the web server, you can instead receive traces that begin at the moment a user initiates a transaction in the browser.
This is already possible today, but again, there are no standardized means for forwarding this tracing information, resulting in the same challenges detailed above. So, eventually, trace context must be extended all the way to the browser.
The TraceContext specification
In short, the Trace Context specification is a collection of standardized HTTP headers that allow distributed tracers to communicate without dropping context information. Having the standard in place will enable a milestone in improved visibility for developers and operators of distributed systems. TraceContext will only be the first specification that enables more interoperability between tracing systems. An obvious next step is an agreed-upon data format that will enable to combine tracing information collected from different tools.