Optimizing Prometheus Queries With PromQL
Count worker nodes and track resource changes in Prometheus using PromQL. Explore queries, best practices, and dynamic thresholds for Kubernetes monitoring.
Join the DZone community and get the full member experience.
Join For FreePrometheus is a powerful monitoring tool that provides extensive metrics and insights into your infrastructure and applications, especially in k8s and OCP (enterprise k8s). While crafting PromQL (Prometheus Query Language) expressions, ensuring accuracy and compatibility is essential, especially when comparing metrics or calculating thresholds.
In this article, we will explore how to count worker nodes and track changes in resources effectively using PromQL.
Counting Worker Nodes in PromQL
To get the number of worker nodes in your Kubernetes cluster, the kube_node_info
metric is often used. However, this metric includes all nodes, such as master, infra, and logging nodes, in addition to worker nodes. To filter only the worker nodes, you can refine your query using label matchers.
Here is a query to count only worker nodes:
count(kube_node_info{node=~".*worker.*"})
Explanation
kube_node_info
is the metric that provides information about all nodes.{node=~".*worker.*"}
filters nodes whose names contain the substring "worker."count()
calculates the total number of matching nodes.
This query ensures that only worker nodes are counted, which is often required for scaling metrics or thresholds in PromQL.
Tracking Changes in Resource Usage
A common use case in Kubernetes monitoring is tracking the change in the number of pods over time. For example, you might want to detect if pods have increased significantly within the last 30 minutes. Combining this with the worker node count allows you to set thresholds that scale with your cluster's size.
Consider the following query:
max(apiserver_storage_objects{resource="pods"}) - max(apiserver_storage_objects{resource="pods"} offset 30m) > (20 * count(kube_node_info{node=~".*worker.*"}))
Breakdown
1. Left-Hand Side
max(apiserver_storage_objects{resource="pods"})
gets the maximum number of pods currently in the cluster.max(apiserver_storage_objects{resource="pods"} offset 30m)
retrieves the maximum number of pods 30 minutes ago.- Subtraction changes the number of pods over the last 30 minutes.
2. Right-Hand Side
count(kube_node_info{node=~".*worker.*"})
counts the number of worker nodes.- Multiplying this by 20 sets a dynamic threshold based on the number of worker nodes.
3. Comparison
- The query checks if the change in pod count exceeds the calculated threshold.
Addressing Syntax Issues in PromQL
While crafting PromQL expressions, syntax errors or mismatched types can lead to unexpected results. In the example above, the left-hand side of the query might return multiple time series, while the right-hand side is a scalar. To ensure compatibility, you can wrap the left-hand side in a max()
function to reduce it to a scalar:
max(max(apiserver_storage_objects{resource="pods"}) - max(apiserver_storage_objects{resource="pods"} offset 30m)) > (20 * count(kube_node_info{node=~".*worker.*"}))
Why Use max()?
The max()
function ensures that the result of the subtraction is a single scalar value, making it compatible with the right-hand side.
General Best Practices
- Understand your metrics: Always familiarize yourself with the metrics you are querying. Use
label_values()
or the Prometheus UI to inspect available labels and their values. - Test incrementally: Start with smaller queries and validate their results before building complex expressions.
- Ensure scalar compatibility: When comparing values, ensure both sides of the comparison are scalars. Use aggregation functions like
max()
,sum()
, oravg()
as needed. - Dynamic thresholds: Use cluster-specific metrics (e.g., node count) to set thresholds that scale dynamically with your infrastructure.
Conclusion
PromQL is a powerful tool, but crafting accurate and efficient queries requires careful attention to detail. By using refined expressions like count(kube_node_info{node=~".*worker.*"})
to count worker nodes and dynamic thresholds based on cluster size, you can create robust monitoring solutions that adapt to your environment. Always test and validate your queries to ensure they provide meaningful insights.
Feel free to use the examples and best practices discussed here to enhance your monitoring setup and stay ahead of potential issues in your Kubernetes cluster.
Opinions expressed by DZone contributors are their own.
Comments