PromCon EU 2023: Observability Recap in Berlin

Explore a recap of PromCon EU 2023, the community-organized event focused on the technology and implementations around the open-source Prometheus project.

Eric D. Schabell

CORE ·

Oct. 03, 23 · Review

Likes (4)

Comment

Save

2.9K Views

As previously mentioned, last week I was on-site at the PromCon EU 2023 event for two days in Berlin, Germany.

This is a community-organized event focused on the technology and implementations around the open-source Prometheus project, including, for example, PromQL and PromLens.

Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks, it was invaluable to have the common discussions and chats that happen between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project.

Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear there.

Let's dive right in and see what the event had to offer this year in Berlin.

This overview will be my impressions of each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event:

OpenTelemetry interoperability (in all flavors) is the hot topic of the year.
Native Histograms were a big topic the last two years; this year, showing up as having a lot of promise here and there, but not a big topic in this year's talks.
Perses dashboard and visualization project presented their Alpha release as a truly open-source project based on the Apache 2.0 license.
By my count, there were ~150 attendees, and they also live-streamed all talks/lightning talks, which will also be made available on their YouTube channel post-event.

Day 1

The day started with a lovely walk through the center of Berlin and to the venue located on the Spree River. The event opened and jumped right into the following series of talks (insights provided inline):

What's New in Prometheus and Its Ecosystem

Native Histograms - Efficiency and more details
Documentation note on prometheus.io: "...Native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated."
stringlabels - Storing labels differently for significant memory reduction
keep_firing_for field faded to alerting rules - How long an alert will continue firing after the condition has occurred
scrape_config_files - Split prom scrape configs into multiple files, avoiding having to have big config files
OTLP receiver (v2.47) - Experimental support for receiving OTLP metrics
SNMP Exporter (v0.24) - Breaking changes: new configuration format; splits connection settings from metrics details, simpler to change. Also added the ability to query multiple modules in one scrape using just one scrape.
MySQLd Exporter (v0.15) - Multi-target support, use a single exporter to monitor multiple MySQL-alike servers
Java client (v1.0.0) - client_java with OpenTelemetry metrics and tracing support, Native Histograms
Alertmanager - New receivers. MS Teams, Discord, Webex
Windows Exporter - Now an official exporter; was delayed due to licensing but is in the final stages now
Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar https://promtheus.io/community.

What’s Coming

New AlertManager UI
Metadata Improvements
Exemplary Improvements
Remote Write v2

Perses: The CNCF Candidate for Observability Visualization

Summary

An announcement was given of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility - purpose-built for observability data; a truly open-source alternative with the Apache 2.0 license.

Perses was born from the CNCF landscape missing visualization tooling projects:

Perses - An exploration of a standard dashboard format
Chronosphere, Red Hat, and Amadeus are displayed as founding members
GitOps friendly, static validation, Kubernetes support; you can use the Perses binary in your development environment
Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console.
There is an exploration of its usage with Prometheus/PromLens.
Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry
Logs are on the future wishlist.
Feature details presented for the development of dashboards
Includes Grafana migration tooling

I was chatting with core maintainer Augustin Husson after the talk, and they are interested in submitting Perses as an applicant for the CNCF Sandbox status.

Towards Making Prometheus OpenTelemetry Native

Summary

OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental.

Details on the Effort

OTLP ingestion is there experimentally.
The experience with target_info is a big pain point at the moment.
Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip
New Arrow-based OTLP protocol promises half the bandwidth again at half the CPU cost; may inspire Prometheus remote write 2.0
GitHub milestone to track
Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients

Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos

Summary

Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus.

Details on the Effort

Huge Ruby shop, latency-sensitive, large scaling events around the retail cycle and flash sales
HPA struggles with scaling up quickly enough
Using StatsD to get around Ruby/Python/PHP-specific limitations on shared counters
Backend is Thanos-based, but have added a lot on top of it (custom work)
Have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution)
Have a router layer on top of Thanos to decouple ingestion and storage; sounds like they're evolving into a a Mimir-like setup
Split the query layer into two deployments: one for short-term queries and one for longer-term queries
Team and service-centric UI for alerting, integrated with SLO tracking
Native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work; as they stated, "This changed the game for us."
When migrating from the previous observability vendor, they decided not to convert dashboards; instead, worked with developers to build new cleaner ones.
Developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory, so it's not a big issue.

Lightning Talks

Summary

It's always fun to end the day with a quick series of talks that are ad-hoc collected from the attendees. Below is a list of ones I thought were interesting as well as a short summary, should you want to find them in the recordings:

AlertManager UI: Alertmanager will get a new UI in React. ELM didn't get traction as a common language; considering alternatives to Bootstrap
Implementing integrals with Prometheus and Grafana: Integrals in PromQL- inverse of rates, Pure-PromQL version of the delta counter we do; using sum_over_time and Grafana variables to simplify getting all the right factors.
Metrics have a DX Problem: Looking at how to do developer-focused metrics from the IDE using autometrics-dev project on Git Hub; framework for instrumenting by function, with IDE integration to explore prod metrics; interesting idea to integrate this deeply

Day 2

After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline):

Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus

Summary

Overview of the metrics story at Shopify, with over 1k teams running it:

Originally forwarding metrics "from observability vendor agent"
Issues because that was multiplying the cardinality across exporter instances; same with sidecar model
Built a StatsD protocol-aware load balancer
Running as a sidecar also had ownership issues, stating, "We would be on call for every application"
DaemonSet deployment meant resource usage and hot-spotting concerns; also cardinality, but at a lower level
Didn't want per-instance metrics because of cardinality and metrics are more domain-level
Roughly one exporter per 50-100 nodes
Load balancer sanitizes label values and drops labels
Pre-aggregation on short time scales to deal with "hot loop instrumentation;" resulted in roughly 20x reduction in bandwidth use
Compensating for lack of per-instance metrics by looking at infrastructure metrics (KSM, cAdvisor)
"We have close to a thousand teams right now"

Prometheus Java Client 1.0.0

Summary

V1.0.0 was released last week. This talk was an overview of some of their updates featuring native histograms and OpenTelemetry support.

Rewrote the underlying model, so breaking changes with the migration module for Prom simpleclient metrics
JavaDoc can be found here.
Almost as simple as importing changes in your Java app to use; going to update my workshop Java example for instrumentation to the new API
Includes good examples in the project
Exposes native + classic histograms by default, scraper's choice
A lot more configurations available as Java properties
Callback metrics (this is great for writing exporters)
OTel push support (on a configurable interval)
Allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format
Integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away
Despite supporting OTel, this is still a performance-minded client library
All metric types support concurrent updates
Dropped Pushgateway support for now, but will port it forward
Once JMX exporter is updated, as a side effect, you can update
Not aiming to become a full OTel library, only future-proofing your instrumentation; more lightweight and performance-focused

Lightning Talks

Summary

Again, here is a list of lightning talks I thought were interesting from the final day and a short summary, should you want to find them in the recordings:

Tracking object storage costs
- Trying to measure object storage costs, as they are the number 2 cost in their cloud bills; built a Prometheus Price Exporter
- Object storage cost is ~half of Grafana's cloud bill; varies by customer (can be as low as 2%)
- Trick for extending sparse metrics with zeroes: or on() vector(0)
- They have a prices exporter in the works; promised to open source it
Prom operator - what's next?
- Tour of some more features coming in the Prometheus operator; shards autoscaling, scrape classes, support Kubernetes events, and Prometheus-agent deployment as DaemonSet
Prometheus adoption stats
- 868k users in 2023 (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled

Final Impressions

Final impressions of this event left me for the second straight year with the feeling that the attendees were both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have "getting started" sessions. Most of this assumes you are coming for in-depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus.

It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus: you will gain insights into the status of features in the monitoring world.

Observability Open source Telemetry

Published at DZone with permission of Eric D. Schabell, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

PromCon EU 2023: Observability Recap in Berlin

Explore a recap of PromCon EU 2023, the community-organized event focused on the technology and implementations around the open-source Prometheus project.

Day 1

What's New in Prometheus and Its Ecosystem

What’s Coming

Perses: The CNCF Candidate for Observability Visualization

Summary

Towards Making Prometheus OpenTelemetry Native

Summary

Details on the Effort

Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos

Summary

Details on the Effort

Lightning Talks

Summary

Day 2

Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus

Summary

Prometheus Java Client 1.0.0

Summary

Lightning Talks

Summary

Final Impressions

Related

Partner Resources