PromCon EU 2023: Observability Recap in Berlin
Explore a recap of PromCon EU 2023, the community-organized event focused on the technology and implementations around the open-source Prometheus project.
Join the DZone community and get the full member experience.
Join For FreeAs previously mentioned, last week I was on-site at the PromCon EU 2023 event for two days in Berlin, Germany.
This is a community-organized event focused on the technology and implementations around the open-source Prometheus project, including, for example, PromQL and PromLens.
Below you'll find an overview covering insights into the talks given, often with a short recap if you don't want to browse the details. Along with the talks, it was invaluable to have the common discussions and chats that happen between talks in the breaks where you can connect with core maintainers of various aspects of the Prometheus project.
Be sure to keep an eye on the event video playlist, as all sessions were recorded and will appear there.
Let's dive right in and see what the event had to offer this year in Berlin.
This overview will be my impressions of each day of the event, but not all the sessions will be covered. Let's start with a short overview of the insights taken after sessions, chats, and the social event:
- OpenTelemetry interoperability (in all flavors) is the hot topic of the year.
- Native Histograms were a big topic the last two years; this year, showing up as having a lot of promise here and there, but not a big topic in this year's talks.
- Perses dashboard and visualization project presented their Alpha release as a truly open-source project based on the Apache 2.0 license.
- By my count, there were ~150 attendees, and they also live-streamed all talks/lightning talks, which will also be made available on their YouTube channel post-event.
Day 1
The day started with a lovely walk through the center of Berlin and to the venue located on the Spree River. The event opened and jumped right into the following series of talks (insights provided inline):
What's New in Prometheus and Its Ecosystem
- Native Histograms - Efficiency and more details
- Documentation note on prometheus.io: "...Native histograms (added as an experimental feature in Prometheus v2.40). Once native histograms are closer to becoming a stable feature, this document will be thoroughly updated."
stringlabels
- Storing labels differently for significant memory reductionkeep_firing_for
field faded to alerting rules - How long an alert will continue firing after the condition has occurredscrape_config_files
- Split prom scrape configs into multiple files, avoiding having to have big config files- OTLP receiver (v2.47) - Experimental support for receiving OTLP metrics
- SNMP Exporter (v0.24) - Breaking changes: new configuration format; splits connection settings from metrics details, simpler to change. Also added the ability to query multiple modules in one scrape using just one scrape.
- MySQLd Exporter (v0.15) - Multi-target support, use a single exporter to monitor multiple MySQL-alike servers
- Java client (v1.0.0) -
client_java
with OpenTelemetry metrics and tracing support, Native Histograms - Alertmanager - New receivers. MS Teams, Discord, Webex
- Windows Exporter - Now an official exporter; was delayed due to licensing but is in the final stages now
- Every Tuesday Prometheus meets for Bug Scrub at 11:00 UTC. Calendar https://promtheus.io/community.
What’s Coming
- New AlertManager UI
- Metadata Improvements
- Exemplary Improvements
- Remote Write v2
Perses: The CNCF Candidate for Observability Visualization
Summary
An announcement was given of the Alpha launch of the Perses dashboard and visualization project with GitOps compatibility - purpose-built for observability data; a truly open-source alternative with the Apache 2.0 license.
Perses was born from the CNCF landscape missing visualization tooling projects:
- Perses - An exploration of a standard dashboard format
- Chronosphere, Red Hat, and Amadeus are displayed as founding members
- GitOps friendly, static validation, Kubernetes support; you can use the Perses binary in your development environment
- Chronosphere supported its development and Red Hat is integrating the Perses package into the OpenShift Console.
- There is an exploration of its usage with Prometheus/PromLens.
- Currently only metrics display, but ongoing by Red Hat integrating tracing with OpenTelemetry
- Logs are on the future wishlist.
- Feature details presented for the development of dashboards
- Includes Grafana migration tooling
I was chatting with core maintainer Augustin Husson after the talk, and they are interested in submitting Perses as an applicant for the CNCF Sandbox status.
Towards Making Prometheus OpenTelemetry Native
Summary
OpenTelemetry protocol (OTLP) support in Prometheus for metrics ingestion is experimental.
Details on the Effort
- OTLP ingestion is there experimentally.
- The experience with
target_info
is a big pain point at the moment. - Takes about half the bandwidth of remote write, 30-40% more CPU due to gzip
- New Arrow-based OTLP protocol promises half the bandwidth again at half the CPU cost; may inspire Prometheus remote write 2.0
- GitHub milestone to track
- Thinking about using collector remote config to solve "split configuration" between Prometheus server and OpenTelemetry clients
Planet Scale Monitoring: Handling Billions of Active Series With Prometheus and Thanos
Summary
Shopify states they are running “highly scalable globally distributed and highly dynamic” cloud infrastructure, so they are on “Planet Scale” with Prometheus.
Details on the Effort
- Huge Ruby shop, latency-sensitive, large scaling events around the retail cycle and flash sales
- HPA struggles with scaling up quickly enough
- Using StatsD to get around Ruby/Python/PHP-specific limitations on shared counters
- Backend is Thanos-based, but have added a lot on top of it (custom work)
- Have a custom operator to scale Prometheus agents by scraping the targets and seeing how many time series they have (including redistribution)
- Have a router layer on top of Thanos to decouple ingestion and storage; sounds like they're evolving into a a Mimir-like setup
- Split the query layer into two deployments: one for short-term queries and one for longer-term queries
- Team and service-centric UI for alerting, integrated with SLO tracking
- Native histograms solved cardinality challenges and combined with Thanos' distributed querier to make very high cardinality queries work; as they stated, "This changed the game for us."
- When migrating from the previous observability vendor, they decided not to convert dashboards; instead, worked with developers to build new cleaner ones.
- Developers are not scoping queries well, so most fan out to all regional stores, but performance on empty responses is satisfactory, so it's not a big issue.
Lightning Talks
Summary
It's always fun to end the day with a quick series of talks that are ad-hoc collected from the attendees. Below is a list of ones I thought were interesting as well as a short summary, should you want to find them in the recordings:
- AlertManager UI: Alertmanager will get a new UI in React. ELM didn't get traction as a common language; considering alternatives to Bootstrap
- Implementing integrals with Prometheus and Grafana: Integrals in PromQL- inverse of rates, Pure-PromQL version of the delta counter we do; using
sum_over_time
and Grafana variables to simplify getting all the right factors. - Metrics have a DX Problem: Looking at how to do developer-focused metrics from the IDE using autometrics-dev project on Git Hub; framework for instrumenting by function, with IDE integration to explore prod metrics; interesting idea to integrate this deeply
Day 2
After the morning walk through the center of Berlin, day two provided us with some interesting material (insights provided inline):
Taming the Tsunami: Low Latency Ingestion of Push-Based Metrics in Prometheus
Summary
Overview of the metrics story at Shopify, with over 1k teams running it:
- Originally forwarding metrics "from observability vendor agent"
- Issues because that was multiplying the cardinality across exporter instances; same with sidecar model
- Built a StatsD protocol-aware load balancer
- Running as a sidecar also had ownership issues, stating, "We would be on call for every application"
- DaemonSet deployment meant resource usage and hot-spotting concerns; also cardinality, but at a lower level
- Didn't want per-instance metrics because of cardinality and metrics are more domain-level
- Roughly one exporter per 50-100 nodes
- Load balancer sanitizes label values and drops labels
- Pre-aggregation on short time scales to deal with "hot loop instrumentation;" resulted in roughly 20x reduction in bandwidth use
- Compensating for lack of per-instance metrics by looking at infrastructure metrics (KSM, cAdvisor)
- "We have close to a thousand teams right now"
Prometheus Java Client 1.0.0
Summary
V1.0.0 was released last week. This talk was an overview of some of their updates featuring native histograms and OpenTelemetry support.
- Rewrote the underlying model, so breaking changes with the migration module for Prom
simpleclient
metrics - JavaDoc can be found here.
- Almost as simple as importing changes in your Java app to use; going to update my workshop Java example for instrumentation to the new API
- Includes good examples in the project
- Exposes native + classic histograms by default, scraper's choice
- A lot more configurations available as Java properties
- Callback metrics (this is great for writing exporters)
- OTel push support (on a configurable interval)
- Allows standard OTel names (with dots), automatically replaces dots with underscores for Prometheus format
- Integrates with OTel tracing client to make exemplars work - picks exemplars from tracing context, extends tracing context to mark that trace to not get sampled away
- Despite supporting OTel, this is still a performance-minded client library
- All metric types support concurrent updates
- Dropped Pushgateway support for now, but will port it forward
- Once JMX exporter is updated, as a side effect, you can update
- Not aiming to become a full OTel library, only future-proofing your instrumentation; more lightweight and performance-focused
Lightning Talks
Summary
Again, here is a list of lightning talks I thought were interesting from the final day and a short summary, should you want to find them in the recordings:
- Tracking object storage costs
- Trying to measure object storage costs, as they are the number 2 cost in their cloud bills; built a Prometheus Price Exporter
- Object storage cost is ~half of Grafana's cloud bill; varies by customer (can be as low as 2%)
- Trick for extending sparse metrics with zeroes: or on() vector(0)
- They have a prices exporter in the works; promised to open source it
- Prom operator - what's next?
- Tour of some more features coming in the Prometheus operator; shards autoscaling, scrape classes, support Kubernetes events, and Prometheus-agent deployment as DaemonSet
- Prometheus adoption stats
- 868k users in 2023 (up from 774k last year), based on Grafana instances which have at least one Prometheus data source enabled
Final Impressions
Final impressions of this event left me for the second straight year with the feeling that the attendees were both passionate and knowledgeable about the metrics monitoring tooling around the Prometheus ecosystem. This event did not really have "getting started" sessions. Most of this assumes you are coming for in-depth dives into the various elements of the Prometheus project, almost giving you glimpses into the research progress behind features being improved in the coming versions of Prometheus.
It remains well worth your time if you are active in the monitoring world, even if you are not using open source or Prometheus: you will gain insights into the status of features in the monitoring world.
Published at DZone with permission of Eric D. Schabell, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments