Now that we have Prometheus configured and up and running, let's start working with the key topics of data and alerts.
Collecting Data
Firstly, let's investigate how to collect data from targets.
Exporters
Prometheus' popularity as the de facto standard for monitoring is in no small part down to the richness of the exporter ecosystem. Out of the box, Prometheus supports several official exporters, whilst the community has created a portfolio of exporters that cover popular hardware, software, databases, logging tools, etc.
An exporter is responsible for extracting native metrics from the underlying system, packaging them into a set of metrics, and then exposing them over an HTTP endpoint so Prometheus can poll them. A good example is the node_exporter
, which exports key operating system metrics such as the uptime, as shown below:
node_boot_time_seconds{instance="host.docker.internal:9100", job="m1pro"}
1657525619.404136
To debug exporters, navigate to the /metrics
endpoint on the target host where the metrics will be displayed in plain text in a browser, or use the Status | Targets panel on the Prometheus dashboard, as shown:
Figure 2
In the example above, two targets are configured, and both are currently up. This view also shows the associated labels, last scrape time, and scrape duration.
It is also relatively easy to write custom exporters if you are unable to find one that matches your exact needs. To discover the available exporters, use the ExporterHub portal, the PromCat portal, or the official GitHub repository.
Service Discovery
Once we have configured exporters, Prometheus needs to know how to discover these targets, and this is another area where Prometheus really shines. As discussed in the configuration section above, Prometheus can use static configurations; however, it is the variety of dynamic discovery options that are particularly powerful.
Some of the prominent integrated service discovery mechanisms include Docker, Kubernetes, OpenStack, Azure, EC2, and GCE. As an example, let's look at monitoring a Docker instance: Firstly, Docker must be configured to emit metrics, and then a configuration needs to be made to the Prometheus configuration by adding the docker_sd_configs:
to reference the Docker host. These are the two necessary steps to take for monitoring and hosting containers.
Application Instrumentation
There is one more important source of metric data: via application instrumentation libraries. Since Prometheus is agnostic to the meaning of the metrics, application developers can use client libraries to emit any application metric of their choosing to Prometheus — for example, the number of failed accesses to a database. By monitoring application-level metrics, developers are able to diagnose both runtime errors and performance bottlenecks.
Official support is provided for the following languages: Go, Java or Scala, Python, Ruby, and Rust. Many other unofficial libraries also exist for additional popular languages.
The use of client libraries is best illustrated with a client code sample, in this case, Python:
from prometheus_client import start_http_server, Summary
import random
import time
# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')
# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
"""A dummy function that takes some time."""
time.sleep(t)
if __name__ == '__main__':
# Start up the server to expose the metrics.
start_http_server(8000)
# Generate some requests.
while True:
process_request(random.random())
This simple code snippet shows what is necessary to emit metrics from Python, namely:
- Import the
prometheus_client
library
- Create a metric called
'request_processing_seconds'
- Use a Python decorator,
@REQUEST_TIME.time()
, to measure the process_request()
method
- Call the
start_http_server(8000)
method to start the metric server
The metrics are now exposed on http://localhost:8000/
.
Push Gateway
Finally, there is one more method to serve metrics to Prometheus: the Pushgateway. This solves a specific problem related to the Prometheus pull model, specifically, how to ingest metrics from short-lived tasks, which will disappear before Prometheus is able to poll them. To use the Pushgateway, it is necessary to run an instance of the server on-premises, configure Prometheus to pull from that, and then use the appropriate Pushgateway client code for emitting to the Pushgateway (push model).
The Prometheus team does not recommend the widespread use of this model as it adds complexity and single point of failure.
Working With Data
Now that we looked at the various ways data can be ingested, let's explore how we can work with this data.
Querying Data With PromQL
Core to Prometheus is the dedicated query language, PromQL, which is designed specifically for querying a time series database. If you come from a background of using structured query languages (SQL), the query structure will take some getting used to but is easily mastered by practice or using the many samples provided.
A query returns one of four data types, namely:
- Instant vector – A range of time series comprising a single sample for each time series, all taken at the same instant in time. Typically, this retrieves the current value.
- Range vector – A range of time series comprising a series of data points over a specified time for each time series. Think of this as a two-dimensional array.
- Scalar – A simple numeric floating-point value (e.g., sum, average).
- String – A simple string value (not in general use at present).
To illustrate basic PromQL queries and results, I will use an instance of Prometheus in Docker and the node_exporter
running on the host to produce demo data.
The most basic query retries the last sample in a series, for example:
A typical result is returned below, in this case indicating that the 16 threads are active:
go_threads{instance="host.docker.internal:9100", job="m1pro"} 16
go_threads{instance="localhost:9090", job="prometheus"} 11
Note that two times series are returned since two targets are being monitored, both of which export go_threads
. We can apply a label selector to the query to further refine the result set as follows:
In my demo instance, the following result is returned:
go_threads{instance="host.docker.internal:9100", job="m1pro"} 16
Selectors can be combined and support regular expressions.
Now, let's add a range selector to return a data series over a 1-minute window:
go_threads{job="m1pro"}[1m]
In my demo instance, the following result is returned:
go_threads{instance="host.docker.internal:9100", job="m1pro"}
16 @1657708139.523
16 @1657708154.523
16 @1657708169.525
16 @1657708184.526
We now have a time series of data for 1 minute (using a 15-second poll interval) rather than the scalar values received previously.
PromQL supports a number of standard binary operators (e.g., addition, subtraction, division, multiplication), which can be used to produce calculated metrics. Beware of mixing different incompatible result types — for example, mixing a range vector with a scalar. In addition to these binary operators, Prometheus also supports several aggregation operators, which can produce summary results such as sum()
, min()
, max()
, average()
, etc.
Additionally, Prometheus has powerful built-in functions that can perform mathematical and statistical operations on time series data.
Recording Rules
Prometheus provides a feature called recording rules to perform precomputed queries on a time series to produce a new, derived time series. The primary advantage of a precomputed time series is to avoid repeated, computationally intensive queries — it is more efficient to perform the computation once and store the result. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh.
Visualizing Data With Grafana
The jewel in the crown for visualizing Prometheus metrics is via integration with the equally popular open-source tool, Grafana. Recent versions of Grafana provide native support for Prometheus servers, allowing developers to run Prometheus queries within Grafana to produce dashboards that include graphs, heatmaps, trends, dials, etc.
Here is an example Grafana dashboard showing local node_exporter
:
Figure 3
The only steps required for connecting Grafana to Prometheus are to add a data source and then to add the widgets or dashboards as needed. Grafana provides numerous ready-baked dashboards to visualize common scenarios, and Prometheus is well supported.
Working With Alerts
The key to the Prometheus philosophy is focusing on what it does best and delegating other features, such as visualization with Grafana. This holds true for managing and generating alerts from the underlying Prometheus data series.
Using Alertmanager
The standalone component, Alertmanager, performs the key functions related to alerts, such as deduplication of alerts, grouping of alerts to relevant receivers, integration with standard alert providers, and suppression of noisy alerts as necessary. Additionally, as a separate component, Alertmanager is designed with high availability (HA) in mind.
Alertmanager is also configured via a configuration file that specifies the following:
- Global settings such as SMTP and various timeouts
- Configuration of routes for alerts to be sent
- Configuration of one or more receivers that will receive the alerts
Common receivers include email, Slack, OpsGenie, PagerDuty, Telegram, and standard webhooks.
Alerting Rules
To generate alerts, Prometheus is configured with alerting rules, which use exactly the same format as the recording rules discussed previously — they are simply Prometheus queries. Template rules can be used to prettify the alert content sent by using the Go templating engine to enrich the alerts. Again, the community shines with a fine selection of alerting rules.
Using Prometheus in Production
Finally, let's review some considerations for using Prometheus in production.
Scaling and Augmenting Prometheus
Prometheus is used in major enterprise organizations and can scale to enormous capacity. To understand your scaling requirements, consider the following questions:
- How many metrics can your monitoring system ingest, and how many do you need?
- What's your metrics cardinality? Cardinality is the number of labels that each metric can have. This is a common issue with metrics coming from dynamic environments, where containers get a different ID or name every time they start, restart, or are moved between nodes.
- Do you need high availability?
- How long do you need to keep metrics, and with what resolution?
Implementing HA is tricky because clustering requires third-party add-ons to the Prometheus server, you need to deal with backups and restores, and storing metrics for an extended period will make your database grow exponentially.
Long-Term Storage for Prometheus
Prometheus servers provide persistent storage, but Prometheus was not created for distributed metrics storage across multiple nodes in a cluster with replication and healing capabilities. This is known as long-term storage, which is a requirement in a few common use cases, such as:
- Capacity planning to monitor how your infrastructure needs to evolve
- Chargebacks, so you can account for and bill different teams or departments for their specific use of the infrastructure
- Analyzing usage trends
- High availability and resiliency
- Adhering to regulations for certain verticals like banking, insurance, etc.
Security
A note on security considerations for operating Prometheus in production environments — presently, Prometheus and associated components allow open, unauthenticated access, including to raw data and administration functionality. As such, users should ensure the following:
- Restrict access to the network segment associated with Prometheus.
- Restrict physical access to the servers hosting Prometheus.
At the time of writing, Prometheus had added experimental support for basic authentication and TLS on the web UI.
According to the Prometheus team, "In the future, server-side TLS support will be rolled out to the different Prometheus projects. Those projects include Prometheus, Alertmanager, Pushgateway, and the official exporters. Authentication of clients by TLS client certs will also be supported."
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}