What Is a Performance Engineer and How to Become One: Part 2
Learn more about some of the essential skills required for performance engineers to meet the current expectations and needs of companies and stakeholders.
Join the DZone community and get the full member experience.
Join For Free(Note: This is a continuation of the article "What Is a Performance Engineer and How to Become One: Part 1.")
Create a Monitoring Framework/Strategy
A monitoring strategy is a plan for monitoring, measuring trends, and analyzing your systems, apps, components, and websites. A good monitoring system will allow real-time visibility of the application's availability, performance, and reliability to the end users, and a first look at the inside of the application through logs, traces, and metrics will allow the performance engineers to fix problems early. Performance engineers must understand how to:
- Define goals and objectives
- Identify what to monitor (e.g., critical metrics)
- Set benchmarks, thresholds, and alerts
- Create dashboards
- Choose the right monitoring tool
- Collect data on monitored metrics
- Analyze the data regularly
- Develop an effective incident response plan
- Implement real-time monitoring
- Identify trends and patterns
They should continuously optimize the monitoring framework to maximize its effectiveness, ensuring optimal performance, preventing costly downtime, and creating a positive user experience. Finally, if we keep your systems in check, they will keep your business ahead.
Know How to Set Up Dashboards for Server Metrics and Application Logs
Setting up dashboards for server metrics and logs depends on how structured your logs are, how deeply nested the metrics are, and how your data would correlate and provide business value. Understanding the key metrics and logs an organization wants/needs to track is probably the most critical aspect of creating meaningful dashboards. A solid dashboard can only be created by combining the needs and preferences of business users with the data, visualization, and automation capabilities of a dashboard solution.
Performance engineers must spend as much time asking questions and learning about the business's needs as you would spend on building the dashboard itself and understanding how to monitor and get the statistics of the custom metrics of your applications and other metrics not provided by Amazon Web Services (AWS) by default, for example. Creating a metric filter for the appropriate log group where the logs are being captured for the required data, knowing metric details, assigning the metrics, creating a graph from that relevant metric, and adding it to the dashboard are some of the key concepts for performance engineers to know. The most common approach is to use one from hundreds of pre-defined templates to create dashboards and use the system to correlate your logs, traces, and metrics.
Propose Effective Performance Optimization Solutions
Proposing effective performance optimization solutions for any application or system is not a straightforward exercise for performance engineers, and it requires lots of skills and experience. The first cause of a performance problem is the selection of the wrong algorithms for the task, which is sometimes caused by choosing the wrong data structures. Performance engineers and developers need to figure out where your bottlenecks are and focus their time on them.
Once we know what to optimize, we will need to figure out the limitations. For example, is it the algorithmic complexity, or is it disk I/O, or is it thrashing memory, etc? Implementing a permanent performance fix effectively requires a lot of understanding of the architecture design and a thorough analysis of hardware and software components to identify any areas that require improvement.
There is no point in optimizing some parts of the code that are not contributing significantly to your application's overall CPU, Disk, or responsiveness. With the help of metrics, logs, and traces from the monitoring tools, we can isolate the guilty layer or component from the architecture first and start fine-tuning them for desired performance.
In today's constantly changing distributed complex environments, a structured performance optimization approach must be taken for all aspects of the system to avoid quick and dirty performance fixes. Performance engineers should stop overlooking the initial analysis and focus on optimizing the common cases, considering the impact of the newest part of the system first, for example.
Build a CI/CD Pipeline for Doing Performance Testing
A continuous integration and continuous deployment (CI/CD) pipeline in performance testing is an automated process that runs performance checks at each stage of software development. Performance engineers must choose the right tools (e.g., Loadrunner, JMeter, Gatling, BlazeMeter, etc.) to play nice with CI/CD performance testing, which integrates load, stress, and spike tests into your development process. This helps to catch performance issues early and prevent costly fixes and downtime.
CI/CD pipelines automate the process of merging code changes, testing, and deploying applications, which results in faster and more consistent software releases.
The primary goals of developing a CI/CD pipeline for performance testing are to:
- Understand the importance and benefits of incorporating performance tests into CI/CD workflows.
- Become more familiar with the tools and technologies that support CI/CD performance testing.
- Analyze test results ahead of deployment before they hit production.
Additionally, performance engineers should be ready to change the test approach as the application evolves and new features roll out. In many cases, the tests should change and be reviewed regularly.
Gain Knowledge of Microservices
Firstly, as performance engineers, we need to learn about your application’s performance problems as early as possible in the software development life cycle. Software performance testing and engineering are challenging, and in some ways, they are even more difficult than application correctness testing, which is extremely difficult.
If you’re developing a new application or microservice, performance testing is a task we need to do a little bit every single day, even before you begin writing code. Performance engineers and developers should profile your code as they write it so that you are continuously aware of how the code you write spends its users’ time.
Microservice applications evolved from monolithic architectures to better address application responses to highly erratic user traffic. As applications have grown in size and complexity, development time and cost have also increased. Besides that, we can notice many major difficulties in scaling, synchronizing changes, delivering new features, and replacing frameworks or libraries. In this scenario, the microservices architecture is a possible solution for overcoming these difficulties.
In a microservices application, many supporting services operate faster or slower, depending on total application traffic or the state of the resource. With microservices performance testing, it is important to check that the microservice covers all functionality in the service and completely exercises that with a variety of valid and invalid arguments.
If the microservice is being used in a repetitive or bundled manner, it is good to check its performance as well. This ensures that the overall rendered time for any transaction is not affected. To test that, we must:
- Create a single-user script with the microservice.
- Execute it for a definitive period of time using the tool you are using, like LoadRunner or RPT.
- Check the average response times, throughput, and performance degradations.
Mentor and Train People for a High-Performance Culture
High performance is about helping people become more of what they already are, not what we think they should be. One way we achieve a high-performance culture is by accomplishing big things together, and this is why, as a team, a shared mission, goals, and core values are so important. This begins with hiring the correct people for the job, but it goes far beyond that. A high-performance culture involves being passionate about creating an atmosphere that encourages and inspires employees to succeed. We must build teamship by grouping people in ways where training and learning skills are maximized and by spending fun time together occasionally and keep reading and being knowledgeable. Stay updated about everything that goes around you, and this helps whether you add a burden to the shoulders of your clients or help strip it away.
Get Familiar With Chaos/Resilience Engineering Tools
How do you implement chaos engineering for your organization? There's lots of discussion around "You are not Netflix; you don't need it." I think that misses the point of chaos engineering here. Your production systems will have failures/downtimes, and if that failure is impacting your business, chaos engineering is about testing/correcting those latent issues when everyone is around versus waiting for them to occur.
Netflix and Gremlin are two companies that I know have strong engineering practices regarding chaos engineering, especially with tools like Gremlin, Chaos Monkey, and Kube-monkey. For example, Netflix uses this approach with its chaos engineering tests, and it runs over 1,000 experiments daily, which has helped reduce outages by 70% since 2018. Another example is Amazon's 2018 Prime Day sale, which saw a 54% increase in orders per second compared to the previous year. Without rigorous performance testing and chaos testing, their systems might have buckled under the pressure.
I think it's very important for performance engineers to learn chaos tools when evaluating practices like chaos engineering to understand whether your own business can even benefit from it and use of such tools might be appropriate for very large service providers who have strict uptime requirements.
Why Should Every Performance Engineer Learn Chaos Engineering and Its Tools?
- If a performance testing team has a short turnaround for a deadline, the product/application has to cut corners, which is often the case with quality protocols and testing time.
- As performance engineers, sometimes we test the minimum features and hope they will work with the desired performance. This is unacceptable, and we often end up with bad performance.
- In software, once your application is labeled as bad with respect to performance, you rarely get a second chance to achieve the expected ROI.
In such scenarios, performance engineers can craft effective chaos tests, starting from minimal scenarios to real-world complexities, to understand what will happen when you run a certain attack, what could go wrong, what can actually happen, etc. There is a big investment in tooling required, so it's easy to get started with manual chaos testing as well. Simple attacks to start with include:
- Restarting a process
- Rebooting a host
- Introducing network latency or isolation to check the resilience of the systems
Know What the Operating Systems Are Doing
A performance engineer should have a good understanding of the operating system used by their applications. They need to know how changes in the OS settings can impact the application's performance under test.
The users using the applications will have multiple configurations; they vary based on different operating systems (Windows, RHEL, CentOS, Ubuntu, etc.), and all of these need to be tested to check for better performance.
As performance engineers, we should understand how operating systems work, especially when:
- Writing a multi-threaded app
- Running a cron job with the help of scheduling
- Building a distributed app
- Dealing with memory management
- Developing an Android app
These activities all require knowledge of the OS to ensure proper interaction with system resources.
For example, as many performance engineers are now working on the performance engineering side by performing analysis on many servers to predict and fix performance failures, we need to understand the functionality of an OS and all other interactions with I/O to make sure that data is collected properly.
When the application is load tested, the programs that are interpreted, compiled, or even executed inside the browser and servers are all processed by the operating system, which decides to allocate CPU cycles and memory allocation for the program to run.
Performance engineers need to understand how an OS manages resources like memory, CPU, and I/O devices. This knowledge helps in:
- Optimizing application performance issues
- Optimizing resource usage
- Debugging and troubleshooting issues related to system performance, crashes, and resource leaks
You do not generally need operating system expertise, knowledge, and experience, but you may still find it useful in many situations that need the most attention.
Experience With SQL and NoSQL Databases
If you are serious about being good and advancing your career in performance engineering, especially on the technical ladder, then yes, it's a must to have experience on both SQL and NoSQL databases, as we get to work in distributed environments with different tech stacks every time. It's never been a waste of time for performance engineers to learn more about databases; one can't say that SQL is better or NoSQL is better, and it completely depends on your application needs and other requirements.
For example, if the data is continuously growing at a high rate, then we should use NoSQL. Otherwise, we can go for SQL; this is not one direct recommendation, though. SQL databases are vertically scalable, which means that an increase in load can be managed by increasing the CPU, RAM, SSD, etc., on a single server. On the contrary, NoSQL databases can be scaled horizontally, which can be done by increasing the number of servers to handle the increase in traffic.
Performance engineers must gain extensive knowledge and experience on how your application stack creates and manages database connections, how your application uses database connection pools, properly disposes of a database connection, performs updates/inserts/deletes, and commits the transaction, become proficient in optimizing long running queries (e.g., right indexing, avoid full table scans, using joins effectively, etc.).
From a performance engineering perspective, there may not be a single root cause, and it often involves multiple issues. Hence, performance engineers need to understand the database design, execution plans, the role of indexes and tradeoffs (for example, optimized read times and write times), caching mechanisms (e.g., Redis, Memcached and understand how caching can relieve the load on the database), familiarity with tools such as SQL Profiler, Oracle AWR, MySQL Sow query log, etc.). Finally, as performance engineers, we don't handle any data directly. It is recommended to be good with these concepts to address the performance issues from databases.
Cloud Knowledge
In today's world of cloud computing, performance is a very important requirement. If you don't measure it, how would you know if you're meeting the expectations of your consumers, managers, investors, and end users? Performance testing on a cloud doesn't necessarily guarantee desired performance.
Performance engineers need to understand the need for performance testing and how it changes with changes in technology. For instance, more and more applications are being moved to the cloud.
Many cloud services available in the market already provide comprehensive performance monitoring solutions as part of the package. The key benefits of moving performance and load-based application testing to the cloud include:
- Lower capital and operational costs
- Support for distributed development and testing teams
- The ability to simulate load tests with millions of concurrent users from multiple geographical locations
Cloud is a good choice for organizations that do not want to have a full dedicated investment in testing infrastructure, as it fulfills all test environment needs and requirements. Automation and scripting are key components in cloud performance testing and engineering. For example, we can:
- Develop a script that constantly monitors system performance
- Send alerts if any issues are detected, which helps minimize downtime and improve overall system stability
We must keep ourselves updated with any cloud service (AWS, Azure, GCP, etc.) of our choice, as there should be no difference in the functionality. If we are moving our existing application from physical machines to cloud VMs, it's good to:
- Test and compare the results for both
- Use similar server configurations in VMs as your physical ones to find out which performs better and why
Be Familiar With the Networking Concepts
Whether you are a small enterprise or a large organization, the performance of your network infrastructure can make or break your success. In today’s interconnected world, the smooth flow of data is essential for virtually every aspect of today’s businesses. Performance testing is not just a way to assess network performance but helps to identify areas where the throughput is not as expected, causing network issues. For example, you might want to measure the throughput, jitter, packet loss, or response time of your network. You also need to specify the baseline and target values for each metric and the acceptable range of deviation from network aspects. Another challenge with most protocol-based load-testing frameworks is writing dynamic load-test scripts that involve sessions or cookies, and identifying performance bottlenecks is a very challenging task for performance engineers and network administrators as well. It demands significant resources and manual effort to assess and measure network performance accurately in most cases.
There are numerous network conditions that affect the performance of an infrastructure, such as the specifications of its routers and switches, the way it is designed and configured, the type of internet connection, and so on. As a performance engineer, you shouldn't be learning completely about how networks work but rather learn how to make your application traffic more resilient. Due to a lack of understanding of how application data flows between systems basically, what talks to what, on what ports, which host initiates the connections, protocols, the OSI layer they operate at, and their method of transport (usually UDP, TCP, or any other). This will help performance engineers understand why or why not when an application fails or times out. It might be the network, and knowing how to use tools such as ping, traceroute, netcat, or Nmap will help diagnose network performance problems between applications.
Learn How to Leverage AI
The future of performance testing is bright, playing a key role in changing performance engineering. It is not that every performance engineer doing performance testing should learn complete AI but instead gain knowledge on how AI tools provide valuable insights, allowing performance engineers to quickly identify bottlenecks and make necessary recommendations on application, system, and network performance. The role of AI in performance testing and engineering is expected to grow, and it will continue to provide deeper insights and more efficient performance testing processes, helping customers and businesses with high-performance, reliable, and resilient systems. Performance engineers, by leveraging AI tools and their features in various performance testing tools, performance testers can simplify complex processes, reduce errors, and accelerate development timelines.
Many companies and stakeholders are now focusing on building AI performance testing tools that automatically create scripts, run tests and analyze results, and provide in-depth explanations of performance metrics with intelligent monitoring practices. For example, when a performance engineer runs a load test, the AI tools with new features can interpret graphs and analyze test results, identifying bottlenecks in the system under various anticipated traffic conditions. This not only helps in understanding the current performance but also helps in identifying potential bottlenecks and anomalies.
Conclusion
It is very clear that learning an extensive list of basic to complex technologies is not quite possible. Still, performance testing and engineering is a challenging field for a number of reasons, including the fact that it is subjective and complex, there may not be a single root cause, and it often involves multiple issues.
A number of job roles contribute to performance, including system administrators, site reliability engineers, application developers, network engineers, database administrators, web administrators, and other support teams. For many of these roles, performance is only one aspect of the job, and performance analysis focuses on the role's area of responsibility: the network team checks the network, the database team checks the database, and so forth.
Companies are hiring multiple performance engineers that allow individuals to specialize in one or more areas, providing deeper levels of support with strong root cause analysis skills. For example, a large performance engineering team may include specialists in OS performance, client performance, network performance, cloud performance, language performance (e.g., Java, .NET), runtime performance (e.g., the JVM or CLR), performance tooling, and many more. As performance engineers, we must be multi-skilled and regularly keep ourselves updated with emerging technology trends, best practices, and cutting-edge technologies to meet current market expectations and business requirements.
Opinions expressed by DZone contributors are their own.
Comments