OpenTelemetry's Impact on Observability: Insights From Grafana Labs' Juraci Paixão Kröhling

OpenTelemetry is transforming observability by standardizing telemetry data. Juraci Paixão Kröhling shares insights on OTel's impact and future.

Tom Smith

CORE ·

Apr. 22, 24 · Interview

Likes (2)

Comment

Save

2.0K Views

OpenTelemetry has taken cloud-native by storm since its introduction in 2019. As one of the CNCF’s most active projects, “OTel” is providing much-needed standardization for telemetry data. By giving logs, traces, metrics, and profiling standard protocols and consistent attributes—OTel is reshaping the observability domain by making telemetry data a vastly more open, interoperable, and consistent experience for platform teams and developers. As OTel approaches graduation from the CNCF, DZone spoke with Juraci Paixão Kröhling — Principal Engineer at Grafana Labs and OpenTelemetry Governing Board Member — to learn more about the exciting progress of this standard, and the implications of the “own your own telemetry data” movement it is driving in cloud native.

Q: What’s the general problem that OTel sought to solve when it was originally released, and why do you think this is a “right place, right time” type of situation for the project? How would you explain its popularity?

A: The origin of OpenTelemetry really began with the success of two different open-source projects that were on a collision course. OpenTracing was trying to make distributed tracing more popular by standardizing instrumentation APIs and semantic conventions, as well as best practices for distributed tracing, while OpenCensus was tackling a similar opportunity by having a “batteries-included” solution for distributed tracing and metrics instrumentation, without specifically aiming to be a standard. While the projects mostly complemented each other, there were parts that were competing, causing fragmentation in the wider distributed tracing (and observability) communities.

OpenTelemetry joined these two different OpenTracing and OpenCensus libraries, bridged the fragmentation of their communities, so there was a clear winner, and then gave a recipe for common semantic conventions and a common mental model not just for tracing, but for all telemetry types.

I think the reason why OTel has become so popular, so fast, is that distributed systems have such a deep requirement for open standards and interoperability. The industry’s been racing forward in cloud-native with a heavy investment in Kubernetes and microservices—basically, massive aggregations of infrastructure and polyglot applications—and platform and development teams need consistent signals they can trust for observing these systems in real-time. OTel isn’t just giving a common and standardized schema for how to work with telemetry data—it’s also been a force multiplier in all the databases, frameworks, and programming language libraries conforming to a standard approach to telemetry data.

Before OTel, observability vendors were creating monetization strategies around proprietary instrumentation and data formats, and creating mazes that made it difficult for enterprises to switch to other providers, and OTel has opened that all up and made observability vendors compete instead on the strength of their platforms, while the underlying telemetry is agnostic.

Q: As telemetry data today is spanning not just infrastructure, but applications, what are you seeing in the evolution of the signals that platform teams and engineers are working with, and why is polyglot support so important to OTel?

A: Things have really gotten much finer-grained. In the past, we would look at a specific application and we’d measure things like how long a specific HTTP call was taking--very coarse-grained metrics like that. Today we can go much deeper into the business transaction touching several microservices and understand what is happening. Why is it taking that long? Is it a specific algorithm? Is it a downstream service? We have the ability to not only know that an HTTP call is slow, but why.

As a natural extension of that telemetry data evolution to finer-grained, OTel libraries can be integrated with many of the most popular libraries and frameworks, so that we can get deeper instrumentation data to see the details of the run-time of our applications. We are also seeing OTel being added natively as part of programming languages and frameworks. This is really interesting to watch in terms of its evolution because the instrumentation can more intelligently appreciate the primitives of the languages themselves, and the expected performance time across language-specific conventions. When we think about languages like Python for AI, or Java for concurrency—each language has its own native capabilities, and so this standardization on OTel is pushing a lot more intelligence not only into how infrastructure and applications can be observed side-by-side but also deeper drills into how applications written in specific languages are behaving.

Q: Given all the activity around the project, can you summarize where the most active contributions have been in recent years and the main areas the community is evolving its capabilities?

A: The Collector is our biggest SIG (Special Interest Group) at the moment, but we have many contributions as well around our semantic conventions and specification repositories. SIGs related to popular programming languages, like the Java SIG, are also very active. We are seeing continued progress both on new fronts, like the new Profiling signal, as well as on stabilization of our current projects, like the Collector or specific semantic conventions. I’m also happy to see movement on important topics for our future like establishing a new working group focused on standards for metrics and semantic conventions for environmental sustainability. We have also a growing End-User community, where our users share their experiences with other community members, including the maintainers of the code they use.

If you use OTel, you are invited not only to join our Monthly Discussion Groups, but also to regularly take our surveys, and, why not, start contributing to the project: it’s likely that the SIG producing the code you are running can use your help.

Q: What has been the disruptive impact of OTel on the observability vendor ecosystem?

A: In the past, we’d SSH into a server to get to the origin of a problem. Those days are long gone. Today, hundreds of pods are running a distributed service, and it would be infeasible to log into all those services. So with distributed computing, we started to collect and ship telemetry data to central locations. But the way that was done before OTel, that data wasn’t aware of which machines that data was coming from, and there wasn’t much cross-coordination between the telemetry data types, or even between the same telemetry (logs, for instance) across programming languages or frameworks. Sometimes, we’d record the URI of a request as “request.uri”, but sometimes it would be “URL”.

OTel came in with a very clear way to name and label telemetry. It also provides its own transport protocol to be optionally used, so all signals can be transmitted using the same basic mechanism to different backends. Now the specification makes it possible to tie the layers together, hop between infrastructure observability to application observability, and draw correlations that were very difficult before.

Q: What do you see as the big new frontiers for OTel, beyond where it’s already thriving today? What’s around the corner?

A: We made progress in many areas and are stabilizing others. While we have new “core” features being proposed and developed within the OTel community, I believe that what we have right now will enable us and other communities to go wider, expanding on domains that might not necessarily be our main focus. For instance, we are seeing new communities forming around CI/CD, environmental sustainability, cost tracking, and LLM, among others.

Stabilization also opens the door for a much-needed time for reflection. What would we do differently with the knowledge we have today? The newly formed Entities specification SIG comes to mind in that context. Similarly, I can’t wait to see what’s next after we have a Collector v1.

Our profiling story is also just at the beginning: we have the specification for OTLP Profiles, and while we know that we need to integrate that with our current projects (SDKs, Collector, …), I’m eager to see what the community will come up with next. What else can we do now that we have the ability to do a deeper correlation between profiles and the other signals?

While we have Android and Swift SIGs already, I believe we’ll see more movement around mobile observability in the future as well. I hear quite frequently from developers working at retailers and FinTechs that while their backend is observable nowadays, their mobile applications still need some OTel love, given how important their apps are for their businesses today.

Of course, we can’t talk about the future without mentioning GenAI. To me, we have a vast exploration area for GenAI, starting with the obvious ones: does it make sense to create tooling that generates “manual” instrumentation for existing code? Can we use GenAI to improve existing instrumentation by ensuring it adheres to semantic conventions?

Grafana Observability Telemetry

Opinions expressed by DZone contributors are their own.

Related

Trending

OpenTelemetry's Impact on Observability: Insights From Grafana Labs' Juraci Paixão Kröhling

OpenTelemetry is transforming observability by standardizing telemetry data. Juraci Paixão Kröhling shares insights on OTel's impact and future.

Related

Partner Resources