Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
DORA Metrics: Tracking and Observability With Jenkins, Prometheus, and Observe
From Observability To Architectural Observability: Shifting Left for Resiliency
When it comes to observability Grafana is the go-to tool for visualization. A Grafana dashboard consists of various forms of visualizations which are usually backed by a database. This is not always the case. Sometimes instead of pushing the data from the database as is, you might want to refine the data. This cannot always be achieved through the functionalities the DB provides. For example, you might want to fetch results from a proprietary API. This is where the grafana-infinity-datasource plugin kicks in. With grafana-infinity-datasource, you can create visualizations based on JSON, XML, CSV, etc. You can issue an HTTP request to a REST API and plot the received data. Tutorial Let’s assume we have an eShop application. We will create a simple Python API using FastAPI to manage the items of the eShop and the purchase volume. Through this API, we will add items and purchase-volume entries. Python from fastapi import FastAPI, HTTPException from pydantic import BaseModel from typing import List from datetime import datetime app = FastAPI() class Item(BaseModel): id: int name: str description: str = None price: float class Purchase(BaseModel): price: float time: datetime items = [] purchases = [] @app.post("/items/", response_model=Item) def create_item(item: Item): items.append(item) return item @app.get("/items/", response_model=List[Item]) def read_items(): return items @app.get("/items/{item_id}", response_model=Item) def read_item(item_id: int): for item in items: if item.id == item_id: return item raise HTTPException(status_code=404, detail="Item not found") @app.delete("/items/{item_id}", response_model=Item) def delete_item(item_id: int): for idx, item in enumerate(items): if item.id == item_id: return items.pop(idx) raise HTTPException(status_code=404, detail="Item not found") @app.post("/purchases/", response_model=Purchase) def create_purchase(purchase: Purchase): purchases.append(purchase) return purchase @app.get("/purchases/", response_model=List[Purchase]) def read_purchases(): return purchases We also need FastAPI to be added to the requirements.txt: Properties files fastapi We will host the application through Docker; thus, we will create a Dockerfile: Dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY main.py main.py EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] We should proceed to the Grafana visualizations. Essentially, we have two different sources of data. The model Item will be visualized in a table and the model purchase will be visualized through a time series graph. I shall use Docker Compose to provision Grafana as well as the Python application: Dockerfile version: '3.8' services: app: build: . ports: - 8000:8000 grafana: image: grafana/grafana:latest ports: - "3000:3000" volumes: - ./grafana:/var/lib/grafana environment: - GF_SECURITY_ADMIN_USER=test - GF_SECURITY_ADMIN_PASSWORD=infinity - GF_INSTALL_PLUGINS=yesoreyeram-infinity-datasource Essentially through the environment variable on Docker, I enable the the infinity-datasource plugin. We can get our instances up and running by issuing the following: Shell docker compose up Docker Compose V2 is out there with many good features. We can now populate the application with some data: Shell $ curl -X POST "http://127.0.0.1:8000/purchases/" -H "Content-Type: application/json" -d '{"time": "2024-07-15T12:40:56","price":2.5}' $ curl -X POST "http://127.0.0.1:8000/purchases/" -H "Content-Type: application/json" -d '{"time": "2024-07-15T12:41:56","price":4.0}' $ curl -X POST "http://127.0.0.1:8000/purchases/" -H "Content-Type: application/json" -d '{"time": "2024-07-15T12:42:56","price":1.5}' $ curl -X POST "http://127.0.0.1:8000/purchases/" -H "Content-Type: application/json" -d '{"time": "2024-07-15T12:43:56","price":3.5}' $ curl -X POST "http://127.0.0.1:8000/items/" -H "Content-Type: application/json" -d '{"id": 1, "name": "Item 1", "description": "This is item 1", "price": 10.5, "tax": 0.5}' Moving onward, create a dashboard on Grafana. One visualization for items: One visualization for purchase volume: As you can see in both cases, I used the http://app:8000 endpoint which is our application, and the DNS that the Compose application can resolve. That’s it! We plotted our data from a REST API using Grafana.
When we, developers, find some bugs in our logs, this sometimes is worse than a dragon fight! Let's start with the basics. We have this order of severity of logs, from most detailed to no detail at all: TRACEDEBUGINFOWARNERRORFATALOFF The default severity log for your classes is INFO. You don't need to change your configuration file (application.yaml). logging: level: root: INFO Let's create a sample controller to test some of the severity logs: @RestController @RequestMapping("/api") public class LoggingController { private static final Logger logger = LoggerFactory.getLogger(LoggingController.class); @GetMapping("/test") public String getTest() { testLogs(); return "Ok"; } public void testLogs() { System.out.println(" ==== LOGS ==== "); logger.error("This is an ERROR level log message!"); logger.warn("This is a WARN level log message!"); logger.info("This is an INFO level log message!"); logger.debug("This is a DEBUG level log message!"); logger.trace("This is a TRACE level log message!"); } } We can test it with HTTPie or any other REST client: Shell $ http GET :8080/api/test HTTP/1.1 200 Ok Checking in Spring Boot logs, we will see something like this: PowerShell ==== LOGS ==== 2024-09-08T20:50:15.872-03:00 ERROR 77555 --- [nio-8080-exec-5] LoggingController : This is an ERROR level log message! 2024-09-08T20:50:15.872-03:00 WARN 77555 --- [nio-8080-exec-5] LoggingController : This is a WARN level log message! 2024-09-08T20:50:15.872-03:00 INFO 77555 --- [nio-8080-exec-5] LoggingController : This is an INFO level log message! If we need to change it to DEBUG all my com.boaglio classes, we need to add this info to the application.yaml file and restart the application: logging: level: com.boaglio: DEBUG Now repeating the test, we will see a new debug line: PowerShell ==== LOGS ==== 2024-09-08T20:56:35.082-03:00 ERROR 81780 --- [nio-8080-exec-1] LoggingController : This is an ERROR level log message! 2024-09-08T20:56:35.082-03:00 WARN 81780 --- [nio-8080-exec-1] LoggingController : This is a WARN level log message! 2024-09-08T20:56:35.083-03:00 INFO 81780 --- [nio-8080-exec-1] LoggingController : This is an INFO level log message! 2024-09-08T20:56:35.083-03:00 DEBUG 81780 --- [nio-8080-exec-1] LoggingController : This is a DEBUG level log message! This is good, but sometimes we are running in production and we need to change from INFO to TRACE, just for quick research. This is possible with the LoggingSystem class. Let's add to our controller a POST API to change all logs to TRACE: @Autowired private LoggingSystem loggingSystem; @PostMapping("/trace") public void setLogLevelTrace() { loggingSystem.setLogLevel("com.boaglio",LogLevel.TRACE); logger.info("TRACE active"); testLogs(); } We are using the LoggingSystem.setLogLevel method, changing all logs from the package com.boaglio to TRACE. Let's try to call out POST API to enable TRACE: Shell $ http POST :8080/api/trace HTTP/1.1 200 Now we can check that the trace was finally enabled: PowerShell 2024-09-08T21:04:03.791-03:00 INFO 82087 --- [nio-8080-exec-3] LoggingController : TRACE active ==== LOGS ==== 2024-09-08T21:04:03.791-03:00 ERROR 82087 --- [nio-8080-exec-3] LoggingController : This is an ERROR level log message! 2024-09-08T21:04:03.791-03:00 WARN 82087 --- [nio-8080-exec-3] LoggingController : This is a WARN level log message! 2024-09-08T21:04:03.791-03:00 INFO 82087 --- [nio-8080-exec-3] LoggingController : This is an INFO level log message! 2024-09-08T21:04:03.791-03:00 DEBUG 82087 --- [nio-8080-exec-3] LoggingController : This is a DEBUG level log message! 2024-09-08T21:04:03.791-03:00 TRACE 82087 --- [nio-8080-exec-3] LoggingController : This is a TRACE level log message! And a bonus tip here, to enable DEBUG or TRACE just for the Spring Boot framework (which is great sometimes to understand what is going on under the hood), we can simply add this to our application.yaml: Shell debug:true or trace: true Let the game of trace begin!
Logging is one of the most important parts of the distributed systems. Many things can break, but when the logging breaks, then we are completely lost. In this blog post, we will understand log levels and how to log efficiently in distributed systems. Logging Levels Whenever we log a message, we need to specify the log level or log severity. It’s an indicator of how important the message is and who should be concerned. Log levels have been with us for a long time. Since the very early days of computer systems, we had to provide log files and define the levels. Typically, we can use the following levels: ErrorWarningInformationDebugTrace Sometimes a longer hierarchy is used and we may have additional categories like: EmergencyAlertCriticalNotice We use the log levels in two ways: We define the level for each message that we log. In other words, when logging a message in the application source code, we need to specify the level of this particular messageWe specify which messages are stored in the log storage. In other words, when adding a new log storage (like a log file or log table), we specify the maximum log level that we want to store No matter what levels we can use (which depends on the libraries and monitoring systems we use), we have some best practices that we should follow. Let’s go through them one by one. Trace Level Logs at this level typically show “everything” that happened. They are supposed to explain everything to debug any issue. This includes showing raw data packets that are sent over the network or hexadecimal representation of memory bytes. Logs on this level may consume a lot of space due to the raw data that we log. For instance, if your application needs to send 1GB of data over the network, it would store the content of the data in the log file which would be at least 1GB. This level may also pose a security risk as we emit everything, including passwords or sensitive information. Therefore, this level shouldn’t be used in production systems. Messages on this level are typically useful for the developers as they help debug bugs. They are rarely helpful for database administrators and nearly never interesting for regular users. Debug Level This level is similar to the trace level but focuses on the curated messages helpful for programmers. On this level, we rarely store the raw content of everything. We rather focus on very detailed pieces of information showing crucial data that may help the programmers. For instance, we wouldn’t store the actual password sent over the network, but we would store the metadata of the connection and which encryption schemes are used. Logs on this level are still very big and shouldn’t be enabled in production. They may also contain sensitive data, so they may pose a security risk. Messages are useful for programmers and are very rarely useful for others. Information Level Logs on this level typically show “what happens." They just inform that something happened, typically presented from the business perspective. This is the level we typically use initially and only change it if we generate too many logs. Messages on this level focus on business context and business signals. They very rarely show actual data sent by the users. More often they focus on the facts that happened. Messages on this level are very helpful for the typical users of the system who want to monitor its behavior. This includes database administrators and programmers. Messages on this level are rarely helpful for regular users. This is the level that we should use by default unless we get too many logs (which may happen if our system handles thousands of transactions per second). Warning Level On this level, we log messages related to potentially erroneous situations that the system handles automatically. For instance, we may inform that the user logged in with an insecure password. The system still worked correctly and it’s not an error in theory, but still, this situation requires some attention. Logs on this level are useful for database administrators and users. For instance, they may indicate issues that can be prevented if taken care of early enough or things that may break after updating the system to the next edition. We should always store these logs in the log file. Error Level This level shows situations that are errors and the system couldn’t handle them automatically. Depending on the context, one situation may be an “error” from the user's perspective and not an “error” from the system's perspective or the other way around. For instance, if the user provides an incorrect password when logging in, the user should get an error message (and so the database client should log an error), whereas the system should only log a warning message (as this is potentially a brute-force attack). Logs on this level should always be stored in the log file. Depending on the error severity, we may want to build alerts around these logs and notify the users accordingly. Error logs are useful for users and administrators. What to Log When logging information, we should always include everything that “is needed” and not more. We need to find a balance between logging too much (which consumes the disk space) and logging not enough (so we can’t investigate the problem later). Having distributed systems in mind, we should log the following: Timestamp: Human-friendly timestamp of the message; for instance, “2024.06.21 13:08:23.125422”. It’s better to start the timestamp with the year to make sorting easier. Depending on the system we’re dealing with, we should log at least hours and minutes, but we can go as far down as nanosecondsApplication name: This is the name of the application that we understand. It shouldn’t be generic like “database” but rather something with business meaning like “tickets database.”Server identifier: When we scale to multiple servers, we should always log the server identifier to understand where the issue occurred.Process identifier: This is the identifier of the process as reported by the operating system. This is helpful when we scale out our application by forking it into multiple processes.Thread identifier: This is the identifier of the thread that logged the message. We may have multiple identifiers of this kind (the operating system level identifier and the runtime thread identifier) and we should log all of them. The name of the thread can be helpful as well. Thread identifiers are helpful when analyzing memory dumps.Log level: Obviously, the level of the messageCorrelation identifier: The unique identifier of the operation we’re executing. This can be Trace ID and Span ID from the OpenTelemetry or something equivalent.Activity: The human-friendly name of the workflow we’re executing; for instance, “ticket purchase”Logical time: Any other ordering identifier that we have, for instance, Lamport’s happened-before time or vector time in the distributed systemLogger ID: The identifier of the logger that we used (especially if we have multiple subsystems within the process) There are more things that we can log depending on our needs. Consider this a good starting point. We should typically store the logs up to the Information level in the log files. If we face issues with logs becoming too big too early or with too much disk activity due to the log writing, we should consider limiting the log files to store messages up to the Warning level. How to Effectively Use Logging With Observability There are some best practices that we should follow when configuring the logging. Let’s see them. Rolling Old Files We typically store the log messages in the log files. They may easily become very big (in tens or hundreds of gigabytes), so we need to roll the files often. Typically, each day or each hour. After we roll the file, we should compress the old file and send it over to a centralized place. Rolling files each hour gives us some more reliability (as in the worst case we will lose one hour of logs) but also makes the number of files higher. Asynchronous Logging and Centralization Logging is expensive. When your system handles thousands of transactions each second, we simply can’t log all the data synchronously. Therefore, always offload logging to a background thread and log asynchronously. However, logging asynchronously may break the order of logs. To fix that, always include logical ordering ID (like Lamport’s happened-before) to be able to reconstruct the proper order. Remember that the timestamp is good enough only when compared within one thread (and even then it may be incorrect) and so should be avoided in distributed systems. It’s better to log to local log files (to minimize the writing time), and then periodically extract the logs and send them to some external storage. This is typically achieved with a background daemon that reads the files, compresses them, and sends them over to some central monitoring place. Anomaly Detection Once we have the log files in a centralized store, we can look for anomalies. For instance, we can simply search the logs and look for words like “exception”, “error’, “issue”, or “fatal”. We can then correlate these logs with other signals like metrics (especially business dimensions) or traces and data distribution. We can make anomaly detection better with the help of artificial intelligence. We can analyze the patterns and look for suspicious trends in the data. This is typically provided out-of-the-box in the cloud infrastructures. Filtering by the Correlation ID The correlation ID can help us debug scenarios much more easily. Whenever there is an issue, we can show the correlation ID to the user and ask them to mention the ID when reaching back to use. Once we have the correlation ID, we simply filter the logs based on the correlation ID, order the log entries by the logical clock (like vector clock), then order them by the timestamp, and finally visualize. This way, we can easily see all the interactions within our system. Tracing on Demand It’s good to have the ability to enable tracing on demand, for instance, based on the cookie, query string parameter, or some well-known session ID. This helps us live debug issues when they happen. With such a feature, we can dynamically reconfigure the log file to store all the log messages (up to the trace level) only for some of the transactions. This way, we don’t overload the logging system but we still get the crucial logs for the erroneous workflows. Logging In Database Logging in databases doesn’t differ much from the regular distributed systems. We need to configure the following: What log collector to use - This specifies where we log messages. This can be for instance the output stream or the file.Where to store logs - In case of logging to a file, we need to specify the directory, names, and file formats (like CSV or JSON).How to handle log files - Whether they are rotated daily, hourly, or even more often, or if the files are trimmed on the system restartMessage formats - What information to include; for instance, the transaction identifier or some sequence numbering, what timezone to use, etc.Log level - Which messages to store in the log fileSampling - For instance, if log only longer queries or slower transactionsCategories - What to log; for instance, whether to log connection attempts or only transactions As mentioned before, we need to find a balance between logging too much and not logging enough. For instance, take a look at PostgreSQL documentation to understand which things can be configured and how they may affect the logging system. Typically, you can check your log sizes with SQL queries. For instance for PostgreSQL, this is how you can check the size of the current log file: SELECT size FROM pg_stat_file(pg_current_logfile()); size 94504316 This query shows you the total size of all log files: SELECT sum(stat.size) FROM pg_ls_dir(current_setting('log_directory')) AS logs CROSS JOIN LATERAL pg_stat_file(current_setting('log_directory') || '/' || logs) AS stat; sum 54875270066 You can verify the same with the command line. Go to your database installation directory (that could be /var/lib/postgresql/data/log or similar and you can check it with SHOW data_directory;) and run this command: du -cb 54893969026 . 54893969026 total Depending on how you host the database, you may get your logs delivered to external systems out of the box. For instance, you can configure PostgreSQL in RDS to send logs to Amazon CloudWatch and then you can see the logs within the log group /aws/rds/instance/instance_name/postgresql. There are many types of logs you can see, for instance: Startup and shutdown logs - These describe your database startup and shutdown procedures.Query logs - These show the SQL queries that are executed.Query duration logs - They show how long queries took to complete.Error logs - They show errors like invalid SQL queries, typos, constraint violations, etc.Connection and disconnection logs - They show how users connect to the database.Write-Ahead Logs (WAL) - They show checkpoints and when the data is saved to the drive. Depending on your configuration, your log files may easily grow in size. For instance, if you decide to log every query and you have thousands of transactions per second, then the log file will store all the history of the queries which may be enormous after a couple of hours. Summary Logs are the most important part of the distributed system. Anything can break, but if logs break, then we are blind. Therefore, it’s crucial to understand how to log, what to log, and how to deal with logs to debug issues efficiently.
Are you ready to start your journey on the road to collecting telemetry data from your applications? Great observability begins with great instrumentation! In this series, you'll explore how to adopt OpenTelemetry (OTel) and how to instrument an application to collect tracing telemetry. You'll learn how to leverage out-of-the-box automatic instrumentation tools and understand when it's necessary to explore more advanced manual instrumentation for your applications. By the end of this series, you'll have an understanding of how telemetry travels from your applications to the OpenTelemetry Collector, and be ready to bring OpenTelemetry to your future projects. Everything discussed here is supported by a hands-on, self-paced workshop authored by Paige Cruze. This first article takes time to lay the foundation, defining common observability and monitoring terms that help us gain an understanding of the role of OpenTelemetry and its components. This introduction needs to start with some really basic definitions so that we're all on the same page when we move on to more complex topics. Defining Our Needs First, there is the term observability. This is how effectively you can understand system behavior from the outside using the data that the system generates. Monitoring is a bit more subtle of a term, where we are continuously in the process of watching and tracking system health based on a pre-defined set of data. This often is done using dashboards that represent queries on that set of data. When we talk about data, we are actually talking about telemetry, which is the process of recording and sending data from remote components to a backend. Examples of telemetry are data types that include metrics, logs, events, and traces. Finally, we have to look at instrumentation. This is the actual code that records and measures the behavior of an application or infrastructure component. There are two types of instrumentation at which we'll be looking: auto-instrumentation and manual instrumentation. The first is provided out of the box by the provided instrumentation library, usually just by adding its usage and flipping the switch. The second is achieved by custom code added to applications, usually narrowing the scope or specific focus of your data needs. What Is OpenTelemetry? The project OTel has been part of the Cloud Native Computing Foundation (CNCF) since 2019 and was born from the merging of the projects OpenCensus and OpenTracing. OTel is a set of standardized vendor-agnostic Software Developer Kits (SDKs), Application Programming Interfaces (APIs), and other tools for ingesting, transforming, and sending telemetry to observability back-ends. Below is a basic architecture of OTel showing a typical cloud-native environment with microservices, infrastructure components, and instrumented client applications pushing telemetry data through the OTel Collector to eventual observability backends. Included in the OTel tooling are the OTel API and SDK, shown being used in the microservices for auto-instrumentation and manually instrumenting those services. The API defines the data types available to us and how to generate our telemetry data. When it becomes necessary to create a language-specific implementation of the API, configuration, data processing, and exporting. As of this writing, the following is the status listing for specific language API/SDK support. See the OTel documentation for current details on specific language support: We can also make use of the client instrumentation libraries. Check out the 770+ that are available in the OTel Registry for instrumenting our specific needs. It's always important to include in the introduction section what a technology can't do. The first one is that OTel is NOT only a tracing tool, only being able to collect tracing data. The specifications have been expanded to include metrics and logs processing. OTel does NOT provide for any telemetry backend or storage systems. It leaves this to other projects such as Jaeger and Prometheus. Finally, OTel does NOT provide any sort of observability UI. It focuses on the generation, collection, management, and export of telemetry. OTel leaves the storing, querying, and visualizing of telemetry data to other projects or vendors. The Collector The OpenTelemetry Collector is a proxy that can receive, process, and export telemetry data. It is a vendor-agnostic implementation that supports open-source observability data formats such as the CNCF projects Jaeger, Prometheus, Fluent Bit, and more. The collector does this using the OpenTelemetry Protocol (OTLP), a specification describing the encoding, transportation, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors, and telemetry backends. Simply stated, OTLP is a general-purpose telemetry data delivery protocol. Within OTel there is the concept of an OTel Resource, which represents the entity producing telemetry as resource attributes. Imagine a service running in a container on Kubernetes and producing telemetry such as service.name, pod.name, namespace, etc. All of these attributes can be included in the resource. This resource information is used to investigate interesting behavior, such as latency in your system being narrowed down to a specific container, pod, or service. Instrumentation Types There are three types of instrumentation that you will be exploring listed here. They each include a code example from the previously mentioned hands-on workshop application. An automatic agent runs alongside the application and adds instrumentation without code changes. # Nothing to see here, no code changes to my application, # agent injects instrumentation at runtime. # from flask import Flask, request app = Flask(__name__) @app.route("/server_request") def server_request(): print(request.args.get("param")) return "served" if __name__ == "__main__": app.run(port=8082) Programmatic is a mix of both, where you pull in pre-instrumented dependencies and manually add metadata (e.g., labels). # Use framework specific instrumentation library to capture # basic method calls, requiring configuration and code changes. # from flask import Flask, request from opentelemetry.instrumentation.flask import FlaskInstrumentor from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import ( BatchSpanProcessor, ConsoleSpanExporter, ) from opentelemetry.trace import get_tracer_provider, set_tracer_provider set_tracer_provider(TracerProvider()) get_tracer_provider().add_span_processor( BatchSpanProcessor(ConsoleSpanExporter()) ) instrumentor = FlaskInstrumentor() app = Flask(__name__) instrumentor.instrument_app(app) @app .route("/server_request") def server_request(): print(request.args.get("param")) return "served" if __name__ == "__main__": app.run(port= 8082) With manual, you set up an observability library and add instrumentation code. # Requires configuring OpenTelemetry libraries and # instrumenting every method call you care about. # from flask import Flask, request from opentelemetry.instrumentation.wsgi import collect_request_attributes from opentelemetry.propagate import extract from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import ( BatchSpanProcessor, ConsoleSpanExporter, ) from opentelemetry.trace import ( SpanKind, get_tracer_provider, set_tracer_provider, ) app = Flask(__name__) set_tracer_provider(TracerProvider()) tracer = get_tracer_provider().get_tracer(__name__) get_tracer_provider().add_span_processor( BatchSpanProcessor(ConsoleSpanExporter()) ) @app .route("/server_request") def server_request(): with tracer.start_as_current_span( "server_request", context=extract(request.headers), kind=SpanKind.SERVER, attributes=collect_request_attributes(request.environ), ): print(request.args.get("param")) return "served" if __name__ == "__main__": app.run(port= 8082) These three examples use code from a Python application that you can explore in the provided hands-on workshop. What's Next? This article defined some common observability and monitoring terms, helping you gain an understanding of the role OpenTelemetry and its components play in observability solutions. Next up, installing OpenTelemetry on our local machine, configuring the SDK, running the demo application, and viewing trace data in the console.
In the world of software development, making data-driven decisions is crucial. However, the approaches to data collection are often ambiguous, leading to misunderstandings between technical developers and product management teams. While developers focus on application performance and outage management, product managers are more concerned with user interactions and identifying friction points. Therefore, a comprehensive application monitoring strategy is often an afterthought in the development process. To avoid this pitfall, it is essential for development teams to build a shared understanding of telemetry objectives. A coherent team, aligned in their goals, can effectively measure and analyze data to drive meaningful insights. This article explores various data collection approaches for application telemetry, emphasizing its significance for both developers and product managers. What Is Application Telemetry? Application telemetry involves the automatic recording and transmission of data from multiple sources to an IT system for monitoring and analysis. It provides actionable insights by answering key questions for both developers and product managers. For Developers How fast is the application responding to users in different geographic locations?Can the servers handle peak loads, such as during Black Friday?What infrastructure upgrades are needed next?How does the end-of-life of technologies impact application stability?What is the overall reliability of the application? For Product Managers What are users primarily looking for?Should the company invest in building support for the particular operating system, browser, or form factor?What features should be developed or retired next? Application Telemetry Objectives It is important to note that both developers and product managers have different objectives and expectations from telemetry. Having a particular toolset implemented doesn’t cover it all. There are several monitoring tools available in the market, but one size doesn’t fit all. It is alright to have multiple data collection tools added to the application depending on the data collection goal. For Product Managers Product monitoring/feature usage: Understanding which features are most used and identifying areas for improvement.Backlog prioritization: Prioritizing development efforts based on user data and feedback. For Developers Predicting future anomalies: Identifying patterns that may indicate future issues.Build failures: Monitoring build processes to detect and resolve failures quickly.Outage monitoring: Ensuring the application is available and performant.Security: Detecting and mitigating security vulnerabilities. Techniques for Application Telemetry Effective application telemetry involves a combination of tools and techniques tailored to the needs of both product managers and developers. Here are detailed descriptions of key techniques and examples of tools used for each: Product Manager-Focused Techniques Application analytics: Application analytics tools collect data on how users interact with the application, providing insights into user behavior, feature usage, and overall engagement. These tools help product managers understand which features are most popular, which parts of the application might be causing user frustration, and how users are navigating through the application. Google Analytics: Widely used for tracking user interactions and engagement. It provides detailed reports on user demographics, behavior flow, and conversion rates.Adobe Analytics: Offers advanced capabilities for real-time analytics and segmentation. It allows product managers to create detailed reports and visualizations of user data.Mixpanel: Focuses on user behavior analytics, providing insights into how users interact with specific features and tracking events to measure user engagement and retention.Real User Monitoring (RUM): RUM tools monitor the actual experiences of users as they interact with the application in real time. These tools collect data on page load times, user interactions, and performance metrics from the user’s perspective, helping product managers and developers understand the real-world performance of the application. New Relic Browser: Provides real-time insights into user experiences by monitoring page load times, JavaScript errors, and browser performance metrics.Dynatrace RUM: Offers detailed user session analysis, capturing every user interaction and providing insights into application performance from the user's perspective.Pingdom: Monitors website performance and user experience by tracking page load times, uptime, and user interactions. Developer Focused Techniques Error tracking: Error tracking tools help developers detect, diagnose, and resolve errors and exceptions in real-time. These tools provide detailed error reports, including stack traces, environment details, and user context, enabling developers to quickly identify and fix issues. Sentry: Provides real-time error tracking and performance monitoring. It captures exceptions, performance issues, and release health, allowing developers to trace problems back to their source.Rollbar: Offers real-time error monitoring and debugging tools. It provides detailed error reports and integrates with various development workflows.Airbrake: Monitors errors and exceptions, providing detailed error reports and notifications. It integrates with various development tools and platforms.Elastic Stack (Elasticsearch, Logstash, Kibana): Elasticsearch stores and searches large volumes of log data, Logstash collects and processes logs, and Kibana visualizes the data, making it easier to identify and resolve errors. 2. Server Metrics: Server metrics tools monitor the performance and health of server infrastructure. These tools track metrics such as CPU usage, memory consumption, disk I/O, and network latency, helping developers ensure that the server environment can support the application’s demands. Prometheus: An open-source monitoring system that collects and stores metrics as time series data. It provides powerful querying capabilities and integrates well with Grafana for visualization.Grafana: A visualization and analytics platform that works with Prometheus and other data sources. It allows developers to create interactive and customizable dashboards to monitor server metrics.Datadog: Provides comprehensive monitoring and analytics for servers, databases, applications, and other infrastructure. It offers real-time monitoring, alerting, and visualization of server metrics.Dynatrace: Provides full-stack monitoring and analytics, offering deep insights into application and infrastructure performance, including server metrics. The Phases of Application Telemetry Application telemetry is a continuous journey rather than a one-time task. It should evolve alongside the application and its users. Implementing telemetry early in the development process is crucial to avoid compromising application stability and performance. Additionally, early integration provides valuable trend and timeline data for benchmarking. 1. Fundamental Step Track standard elements of application performance, such as loading times for most users.Collect descriptive statistics on users and usage patterns. 2. Business Insights Analyze user engagement and determine if users are interacting with the application as expected. 3. Prediction Predict trends and application failures using machine learning models.Analyze data to determine user needs and anticipate future requirements. Early in the development cycle, a team can start by implementing basic telemetry tools like Google Analytics and Prometheus to monitor user visits, page load times, and server response rates, ensuring the app functions smoothly for most users. Don’t limit yourself by thinking that early on the application won’t have significant traffic. It is always beneficial to have these tools added early so that application telemetry forms part of the application’s core DNA. As the app gains traction, you can integrate Mixpanel and Dynatrace RUM to analyze user interactions with features like posting updates and commenting, identifying which features are most engaging and where users encounter difficulties. In this stage, ensure to include real-time user feedback collection, to have both quantitative and qualitative data collection. In the final mature stage, leverage tools such as Elastic Stack and Datadog to predict trends and potential issues, using machine learning to anticipate peak usage times and proactively address performance bottlenecks and potential failures, thus maintaining a seamless user experience and preparing for future growth. Conclusion Early implementation of an application telemetry strategy is vital for successful application development. By embedding telemetry tools at the onset, organizations can ensure that data collection is integral to the application’s core DNA. This proactive approach enables both product and development teams to make informed, data-driven decisions that benefit the application and its users. Application telemetry bridges the gap between these teams, allowing developers to monitor performance and predict anomalies while product managers gain insights into user behavior and feature usage. As the application evolves, leveraging advanced tools like Elastic Stack and Datadog ensures that trends are anticipated and issues are addressed before they impact the user experience. Ultimately, a robust telemetry strategy ensures that customer interests remain at the forefront of development efforts, creating more reliable, performant, and user-centric applications. Teams should prioritize telemetry from the beginning to avoid compromising application stability and harness the full potential of data-driven insights, benefiting both developers and product managers alike.
Effective monitoring and troubleshooting are critical for maintaining the performance and reliability of Atlassian products like Jira and Confluence and software configuration management (SCM) tools like Bitbucket. This article explores leveraging various monitoring tools to identify, diagnose, and resolve issues in these essential development and collaboration platforms. Before we discuss the monitoring tools, let's clarify the importance of monitoring. Monitoring Atlassian tools is crucial for several reasons: Proactive issue detectionPerformance optimizationCapacity planningSecurity and complianceMinimizing downtime By implementing robust monitoring practices, IT teams can ensure smooth operations, enhance user experience, and maximize the value of Atlassian investments. Essential Monitoring Tools 1. Atlassian's Built-in Monitoring Tools Atlassian provides several built-in tools for monitoring and troubleshooting: Troubleshooting and Support Tools This app, included by default in Atlassian products, offers features like log analysis, health checks, and support zip creation. It helps identify common issues and provides links to relevant knowledge-based articles. Instance Health Check This feature, available in the administration console, scans for potential problems and offers recommendations for resolving them. Application Metrics Atlassian products expose various performance metrics via JMX (Java Management Extensions). External monitoring tools can be utilized to gather and examine these metrics. 2. Log Analysis Log files contain a resource of information for troubleshooting. Critical log files to monitor include: Application logs (e.g., atlassian-jira.log, atlassian-confluence.log)Tomcat logs (catalina.out)Database logs Log aggregation tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk can centralize, search, and analyze log data from multiple sources. 3. Prometheus and Grafana Prometheus and Grafana are popular open-source tools for monitoring and visualization: Prometheus: Collects and stores time-series data from configured targetsGrafana: Creates dashboards and visualizations based on the collected metrics Atlassian provides documentation on setting up Prometheus and Grafana to monitor Jira and Confluence. This combination allows for: Real-time performance monitoringCustom dashboards for different stakeholdersAlerting based on predefined thresholds 4. Application Performance Monitoring (APM) Tools APM solutions offer comprehensive visibility into how applications are functioning and how users are experiencing them. Popular options include: DynatraceAppDynamicsNew Relic These tools can help identify bottlenecks, trace transactions, and determine the root cause for performance issues across the application stack. 5. Infrastructure Monitoring Monitoring the underlying infrastructure is crucial for maintaining optimal performance. Key areas to monitor include: CPU, memory, and disk usageNetwork performanceDatabase performance Monitoring tools like Nagios, Zabbix, or cloud-native solutions (e.g., AWS CloudWatch) can monitor infrastructure components. 6. Synthetic Monitoring and User Experience Synthetic monitoring involves simulating user interactions to identify issues proactively. Synthetic monitoring tools like Selenium or Atlassian's own Statuspage can be used to: Monitor critical user journeysCheck availability from different geographic locationsMeasure response times for crucial operations The section below will examine some of the frequent issues with Atlassian tools and troubleshooting techniques for these common issues. Troubleshooting Techniques 1. Performance Degradation When facing performance issues: Check system resources (CPU, memory, disk I/O) for bottlenecks.Analyze application logs for errors or warnings.Review recent configuration changes.Examine database performance metrics.Use APM tools to identify slow transactions or API calls. 2. Out of Memory Errors For out-of-memory errors: Analyze garbage collection logs.Review memory usage trends in monitoring tools.Check for memory leaks using profiling tools.Adjust JVM memory settings if necessary. 3. Database-Related Issues When troubleshooting database problems: Monitor database connection pool metrics.Analyze slow query logs.Check for database locks or deadlocks.Review database configuration settings. 4. Integration and Plugin Issues For issues related to integrations or plugins: Check plugin logs for errors.Review recent plugin updates or configuration changes.Disable suspect plugins to isolate the issue.Monitor plugin-specific metrics if available. In the section below, let's look at some of the best practices for effective monitoring. Best Practices for Effective Monitoring Establish baselines: Create performance baselines during normal operations to quickly identify deviations.Set up alerts: Configure alerts for critical metrics to enable rapid response to issues.Use dashboards: Create custom dashboards for different teams (e.g., operations, development, management) to provide relevant insights.Regular health checks: Perform periodic health checks using Atlassian's built-in tools and third-party monitoring solutions.Monitor trends: Look for long-term performance metrics trends to address potential issues proactively.Correlate data: Use tools like PerfStack to correlate configuration changes with performance metrics.Continuous improvement: Review and refine your monitoring strategy based on lessons learned from past incidents. Conclusion Effective monitoring and troubleshooting of Atlassian tools necessitate a blend of built-in features, third-party tools, and best practices. Organizations can ensure optimal performance, minimize downtime, and provide the best possible user experience by implementing a comprehensive monitoring strategy. Remember that monitoring is an ongoing process. As your Atlassian environments evolve, so should your monitoring and troubleshooting approaches. Keep yourself updated on new tools and techniques, and be ready to adapt your strategy as necessary to align with your organization's evolving needs.
Distributed caching is a method for storing and managing data across multiple servers, ensuring high availability, fault tolerance, and improved read/write performance. In cloud environments like AWS (Amazon Web Services), distributed caching is pivotal in enhancing application performance by reducing database load, decreasing latency, and providing scalable data storage solutions. Understanding Distributed Caching Why Distributed Caching? With applications increasingly requiring high-speed data processing, traditional single-node caching systems can become bottlenecks. Distributed caching helps overcome these limitations by partitioning data across multiple servers, allowing simultaneous read/write operations, and eliminating points of failure associated with centralized systems. Key Components In a distributed cache, data is stored in a server cluster. Each server in the cluster stores a subset of the cached data. The system uses hashing to determine which server will store and retrieve a particular data piece, thus ensuring efficient data location and retrieval. AWS Solutions for Distributed Caching Amazon ElastiCache Amazon ElastiCache is a popular choice for implementing distributed caching on AWS. It supports key-value data stores and offers two engines: Redis and Memcached. Redis ElastiCache for Redis is a fully managed Redis service that supports data partitioning across multiple Redis nodes, a feature known as sharding. This service is well-suited for use cases requiring complex data types, data persistence, and replication. Memcached ElastiCache for Memcached is a high-performance, distributed memory object caching system. It is designed for simplicity and scalability and focuses on caching small chunks of arbitrary data from database calls, API calls, or page rendering. DAX: DynamoDB Accelerator DAX is a fully managed, highly available, in-memory cache for DynamoDB. It delivers up to a 10x read performance improvement — even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables without requiring developers to manage cache invalidation, data population, or cluster management. Implementing Caching Strategies Write-Through Cache In this strategy, data is simultaneously written into the cache and the corresponding database. The advantage is that the data in the cache is never stale, and the read performance is excellent. However, write performance might be slower as the cache and the database must be updated together. Lazy-Loading (Write-Around Cache) With lazy loading, data is only written to the cache when it's requested by a client. This approach reduces the data stored in the cache, potentially saving memory space. However, it can result in stale data and cache misses, where requested data is unavailable. Cache-Aside In the cache-aside strategy, the application is responsible for reading from and writing to the cache. The application first attempts to read data from the cache. If the data is not found (a cache miss), it's retrieved from the database and stored in the cache for future requests. TTL (Time-To-Live) Eviction TTL eviction is crucial for managing the lifecycle of data in caches. Assigning a TTL value to each data item automatically evicts items from the cache once the TTL expires. This strategy is useful for ensuring that data doesn't occupy memory space indefinitely and helps manage cache size. Monitoring and Optimization Monitoring With Amazon CloudWatch Amazon CloudWatch provides monitoring services for AWS cloud resources. With CloudWatch, you can collect and track metrics, collect and monitor log files, and set alarms. For distributed caching, CloudWatch allows you to monitor metrics like cache hit-and-miss rates, memory usage, and CPU utilization. Optimization Techniques To maximize the efficiency of a distributed cache, consider data partitioning strategies, load balancing, read replicas for scaling read operations, and implementing failover mechanisms for high availability. Regular performance testing is also critical to identify bottlenecks and optimize resource allocation. FAQs How Do I Choose Between ElastiCache Redis and ElastiCache Memcached? Your choice depends on your application's needs. Redis is ideal if you require support for rich data types, data persistence, and complex operational capabilities, including transactions and pub/sub messaging systems. It's also beneficial for scenarios where automatic failover from a primary node to a read replica is crucial for high availability. On the other hand, Memcached is suited for scenarios where you need a simple caching model and horizontal scaling. It's designed for simplicity and high-speed caching for large-scale web applications. What Happens if a Node Fails in My Distributed Cache on AWS? For ElastiCache Redis, AWS provides a failover mechanism. If a primary node fails, a replica is promoted to be the new primary, minimizing the downtime. However, for Memcached, the data in the failed node is lost, and there's no automatic failover. In the case of DAX, it's resilient because the service automatically handles the failover seamlessly in the background, redirecting the requests to a healthy node in a different Availability Zone if necessary. How Can I Secure My Cache Data in Transit and Rest on AWS? AWS supports in-transit encryption via SSL/TLS, ensuring secure data transfer between your application and the cache. For data at rest, ElastiCache for Redis provides at-rest encryption to protect sensitive data stored in cache memory and backups. DAX also offers similar at-rest encryption. Additionally, both services integrate with AWS Identity and Access Management (IAM), allowing detailed access control to your cache resources. How Do I Handle Cache Warming in a Distributed Caching Environment? Cache warming strategies depend on your application's behavior. After a deployment or node restart, you can preload the cache with high-usage keys, ensuring hot data is immediately available. Automating cache warming processes through AWS Lambda functions triggered by specific events is another efficient approach. Alternatively, a gradual warm-up during the application's standard operations is simpler but may lead to initial performance dips due to cache misses. Can I Use Distributed Caching for Real-Time Data Processing? Yes, both ElastiCache Redis and DAX are suitable for real-time data processing. ElastiCache Redis supports real-time messaging and allows the use of data structures and Lua scripting for transactional logic, making it ideal for real-time applications. DAX provides microsecond latency performance, which is crucial for workloads requiring real-time data access, such as gaming services, financial systems, or online transaction processing (OLTP) systems. However, the architecture must ensure data consistency and efficient read-write load management for optimal real-time performance. Conclusion Implementing distributed caching on AWS can significantly improve your application's performance, scalability, and availability. By leveraging AWS's robust infrastructure and services like ElastiCache and DAX, businesses can meet their performance requirements and focus on building and improving their applications without worrying about the underlying caching mechanisms. Remember, the choice of the caching strategy and tool depends on your specific use case, data consistency requirements, and your application's read-write patterns. Continuous monitoring and optimization are key to maintaining a high-performing distributed caching environment.
In the summer of 2023, my team entered into a code yellow to stabilize the health of the service we own. This service powers the visualization on the dashboard product. The decision was made following high-severity incidents impacting the availability of the service. For context, the service provides aggregation data to dashboard visualizations by gathering aggregations through the data pipeline. This is a critical path service for dashboard rendering. Any impact on the availability of this service manifests itself as dashboard viewers experiencing delays in rendering visualizations and rendering failures in some cases. Exit Criteria for Code Yellow Fix the scaling issues on the service using the below mechanisms: Enhance observability into service metricsImplement service protection mechanismsInvestigate and implement asynchronous execution of long-running requestsMitigation mechanisms to recover service below 10 minutes Execution Plan 1. Enhance observability into service metrics Enhanced service request body loggingReplaying traffic to observe patterns after the incident. Since this service was read-only, and the underlying data was not being modified hence we could rely on the replays. 2. Service protection mechanisms An auto restart of the child thread on request hangs.Request throttling to protect against overwhelming 3. Investigate and implement asynchronous execution of long-running requests Replace deprecated packages (Request with Axios) Optimizing slow-running operations Key Takeaways Enhanced tooling helped to isolate problematic requests and requests with poor time complexity blocking the node event loop. We set up traffic capture and replay of the traffic on demand. Also, CPU profiling and distributed tracing as observability improvements.Optimizations on the critical path: Efforts to optimize operations on the critical path have yielded ~40% improvement in the average latencies across DCs. These efforts include (but are not limited to) package upgrades such as the replacement of Request with Axios (Promise-based HTTP client), caching optimizations — unintentional cache misses, and identification of caching opportunities.Scale testing and continuous load testing framework are set up to monitor the service’s scale needs. Mitigation mechanisms rolled out. This included node clustering mode with an auto restart of the thread when an event loop gets blocked. Request throttling is implemented to protect the service in case of bad requestsBetter alert configuration bringing down the time to detect anomalies below 5 minutes in most recent incidentsClear definition of Sev-1/Sev-2 Criteria: We now have clear sev-1/sev-2 criteria defined. This is to help on-calls quickly assess if the system is in a degraded state and whether or not they need to pull the sev-2 trigger to get help. Next Steps To further the momentum of the operation excellence, the plan is to perform quarterly resiliency game days to find the system’s weaknesses and respond gracefully in the event of failures.Re-evaluate the Northstar architecture of the service to meet the scaling needs of the future. At this point, I feel more confident in our overall operational posture and better equipped to deal with potential incidents in the future. At the same time, I recognize that operational improvements are a continuous process, and we will continue to build on top of the work done as a part of Code Yellow. All the opinions are mine and are not affiliated with any product/company.
Recently, I encountered a task where a business was using AWS Elastic Beanstalk but was struggling to understand the system state due to the lack of comprehensive metrics in CloudWatch. By default, CloudWatch only provides a few basic metrics such as CPU and Networks. However, it’s worth noting that Memory and Disk metrics are not included in the default metric collection. Fortunately, each Elastic Beanstalk virtual machine (VM) comes with a CloudWatch agent that can be easily configured to collect additional metrics. For example, if you need information about VM memory consumption, which AWS does not provide out of the box, you can configure the CloudWatch agent to collect this data. This can greatly enhance your visibility into the performance and health of your Elastic Beanstalk environment, allowing you to make informed decisions and optimize your application’s performance. How To Configure Custom Metrics in AWS Elastic Beanstalk To accomplish this, you’ll need to edit your Elastic Beanstalk zip bundle and include a cloudwatch.config file in the .ebextensions folder at the top of your bundle. Please note that the configuration file should be chosen based on your operating system, as described in this article. By doing so, you’ll be able to customize the CloudWatch agent settings and enable the collection of additional metrics, such as memory consumption, to gain deeper insights into your Elastic Beanstalk environment. This will allow you to effectively monitor and optimize the performance of your application on AWS. Linux-Based Config: YAML files: "/opt/aws/amazon-cloudwatch-agent/bin/config.json": mode: "000600" owner: root group: root content: | { "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "append_dimensions": { "InstanceId": "$${aws:InstanceId}" }, "metrics_collected": { "mem": { "measurement": [ "mem_total", "mem_available", "mem_used", "mem_free", "mem_used_percent" ] } } } } container_commands: apply_config_metrics: command: /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json Windows-Based Config: YAML files: "C:\\Program Files\\Amazon\\AmazonCloudWatchAgent\\cw-memory-config.json": content: | { "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "append_dimensions": { "InstanceId": "$${aws:InstanceId}" }, "metrics_collected": { "mem": { "measurement": [ "mem_total", "mem_available", "mem_used", "mem_free", "mem_used_percent" ] } } } } container_commands: 01_set_config_and_reinitialize_cw_agent: command: powershell.exe cd 'C:\Program Files\Amazon\AmazonCloudWatchAgent'; powershell.exe -ExecutionPolicy Bypass -File ./amazon-cloudwatch-agent-ctl.ps1 -a append-config -m ec2 -c file:cw-memory-config.json -s; powershell.exe -ExecutionPolicy Bypass -File ./amazon-cloudwatch-agent-ctl.ps1 -a start; exit As you may have noticed, I enabled only a few memory-related metrics such as mem_total, mem_available, mem_used, mem_free, and mem_used_percent. However, you can enable more metrics as needed. The complete list of available metrics can be found here. Once you have updated your application, it would be beneficial to create a CloudWatch dashboard to visualize these metrics. To do so, navigate to the AWS CloudWatch console, select Dashboards, and click on Create dashboard. From there, you can create a widget by clicking the Add widget button and selecting Line to create a line chart that displays the desired metrics. Customizing a dashboard with relevant metrics can provide valuable insights into the performance and health of your Elastic Beanstalk environment, making it easier to monitor and optimize your application on AWS. In the case of the example above, we’ll see 5 new metrics in the section CWAgent. Based on them, we may configure a memory widget and get something like this. Final Thoughts Feel free to explore the wide variety of metrics and AWS widgets available in CloudWatch to further customize your dashboard. If you have any questions or need assistance, feel free to ask me in the comments.
In today’s Information Technology (IT) digital transformation world, many applications are getting hosted in cloud environments every day. Monitoring and maintaining these applications daily is very challenging and we need proper metrics in place to measure and take action. This is where the importance of implementing SLAs, SLOs, and SLIs comes into the picture and it helps in effective monitoring and maintaining the system performance. Defining SLA, SLO, SLI, and SRE What Is an SLA? (Commitment) A Service Level Agreement is an agreement that exists between the cloud provider and client/user about measurable metrics; for example, uptime check, etc. This is normally handled by the company's legal department as per business and legal terms. It includes all the factors to be considered as part of the agreement and the consequences if it fails; for example, credits, penalties, etc. It is mostly applicable for paid services and not for free services. What Is an SLO? (Objective) A Service Level Objective is an objective the cloud provider must meet to satisfy the agreement made with the client. It is used to mention specific individual metric expectations that cloud providers must meet to satisfy a client’s expectation (i.e., availability, etc). This will help clients to improve overall service quality and reliability. What Is an SLI? (How Did We Do?) A Service Level Indicator measures compliance with an SLO and actual measurement of SLI. It gives a quantified view of the service's performance (i.e., 99.92% of latency, etc.). Who Is an SRE? A Site Reliability Engineer is an engineer who always thinks about minimizing gaps between software development and operations. This term is slightly related to DevOps, which focuses on identifying the gaps. An SRE creates and uses automation tools to monitor and observe software reliability in production environments. In this article, we will discuss the importance of SLOs/SLIs/SLAs and how to implement them into production applications by a Site Reliability Engineer (SRE). Implementation of SLOs and SLIs Let’s assume we have an application service that is up and running in a production environment. The first step is to determine what an SLO should be and what it should cover. Example of SLOs SLO = Target Above this target, GOODBelow this target, BAD: Needs an action item While setting up a Target, please do not consider it 100% reliable. It is practically not possible and it fails most of the items due to patches, deployments, downtime, etc. This is where Error Budget (EB) comes into the picture. EB is the maximum amount of time that a service can fail without contractual consequences. For example: SLA = 99.99% uptime EB = 55 mins and 35 secs per year, or 4 mins and 23 secs per month, the system can go down without consequences. A step is how to measure this SLO, and it is where SLI comes into the picture, which is an indicator of the level of service that you are providing. Example of SLIs HTTP reqs = No. of success/total requests Common SLI Metrics DurabilityResponse timeLatencyAvailabilityError rateThroughput Leverage automation of deployment monitoring and reporting tools to check SLIs and detect deviations from SLOs in real-time (i.e., Prometheus, Grafana, etc.). CategorySLOSLIAvailability99.92% uptime/monthX % of the time app is availableLatency92% of reqs with response time under 240 msX average resp time for user reqsError rateLess than 0.8% of requests result in errorsX % of reqs that fail Challenges SLA: Normally, SLAs are written by business or legal teams with no input from technical teams, which results in missing key aspects to measure. SLO: Not able to measure or too broad to calculate SLI: There are too many metrics and differences in capturing and calculating the measures. It leads to lots of effort for the SREs and gives less beneficial results. Best Practices SLA: Involve the technical team when SLAs are written by the company's business/legal team and the provider. This will help to reflect exact tech scenarios into the agreement. SLO: This should be simple, and easily measurable to check, whether we are in line with objectives or not. SLI: Define all standard metrics to monitor and measure. It will help SREs to check the reliability and performance of the services. Conclusion Implementation of SLAs, SLOs, and SLIs should be included as part of the system requirements and design and it should be in continuous improvement mode. SREs need to understand and take responsibility for how the systems serve the business needs and take necessary measures to minimize the impact.
Joana Carvalho
Observability and Monitoring Specialist,
Sage
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone