Maintenance Resources

DZone's Featured Maintenance Resources

Over-Architected? Maybe, Maybe Not

By Scott Sosna

CORE

An oft-heard criticism of way-to-many software solutions is that it's over-architected, implying that the design, abstractions, implementation, deployment, or whatever is unnecessarily complex, difficult to understand, unmaintainable, unnecessary, or wrong. Criticisms are often thrown into the ether without context or supporting narrative; criticisms that often stick. So what's gained by labeling a solution as over-architected? "Software architecture comparisons @obscuradigitalsf" by raganmd is licensed under CC BY-SA 2.0. Often it's the ultimate get-out-of-jail card for software engineers, deflecting blame for larger-than-expected efforts and timelines without objective analysis or even understanding of the current implementation. Unfortunately, senior leadership often blindly accepts the explanation with a sympathetic Ah, over-architected, I'm so sorry shrug, and especially effective when the original engineers aren't available to fill in the blanks on requirements, decisions, implementation, etc. Yes, the original architecture implemented may be orthogonal to your preferred architecture — the tech stack, the business assumptions, abstractions, etc. — but are you in a position to unequivocally say it's over-architected? Have your company's business goals or technical direction changed since the original implementation? What were the non-functional requirements that needed to be considered? Is there anything interesting in commit messages? Most importantly, do your concerns define over-architected or is something less sinister going on? Hmmmm, I'm not sure! Software Engineers Engineer Software Shakespeare and the Internet: To maintain, or to not maintain, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous implementation, Or to take arms against a sea of troubles And to start anew with a clean slate. Hands up if you prefer maintaining existing rather than writing new code. Keep your hands up if that includes maintaining someone else's code. Ah, thought so. While Hamlet's original soliloquy discusses his potential suicide, my (poor) adaptation references software development: software engineers often go out of their way to take steps leading to the unfettered creation of new code over the long-term responsibility of someone else's code. And when given maintenance work — in whatever form it may take — engineers look for angles that allow them to revamp, restructure, rewrite, and in general make the work more interesting. Correct decision if the blame shows excessive churn, the cyclomatic complexity is obnoxiously high, or the code has truly become unmaintainable. Granted, product owners and scrum masters won't be pleased, but at some point, the sins of the past can no longer be ignored. Publicly calling the existing code over-architected and you may get your wish. However, occasionally someone requests a deeper explanation which calls into question the reasons, dubious because the engineer: Doesn't understand the current implementation, nor makes an effort to learnDoesn't understand business or technical requirements or other underlying assumptionsDoesn't agree with design patterns or code structure usedDoesn't like the code style or formatting, or even the programming language itself And, to no one's surprise, the rewrite becomes larger than envisioned because, ironically, s/he has to finally understand the existing solution to reimplement it: despite claims you're not paving the cowpath, inevitably you may need to, which isn't apparent without overall comprehension. Is your organization really prepared to go API First, microservices, NoSQL, or whatever else is being added to the resume? The timelines increase, leadership frustrations grow, and other essential work is delayed and deferred, all because the engineer insisted it was necessary. We've all seen it, many of us have been the instigators, and it's not usually a pretty story. And the question remains: Was it truly over-architected? Problematic Architectures Architectures applied within software engineering disciplines are labeled in a myriad of ways: e.g., enterprise, system, application, software, cloud, and integration. To add to the confusion, organizations often define the architecture disciplines to suit their internal needs, making it difficult to clearly define each discipline's responsibilities and boundaries. It's definitely not an apples-to-apples comparison. That said, defining problematic architectures is possible if you talk generically and don't try to define specifics within an architectural discipline. I see three basic types of potentially problematic architectures. 1. Differently-Architected Differently-architected solutions are those solutions where opinions differ on how the non-functional requirements are addressed within the solution. Should a solution intended for the cloud be cloud-native or cloud-agnostic? Is stability and reliability more or less important than throughput and performance? Are supporting resources selected based on cost, capabilities, or both? Any concerns or complaints you have with the underlying architecture must be balanced against the non-functional requirements identified as important for a successful solution. It may be that you disagree with the non-functional requirements; therefore, your concerns or complaints may only be relevant when non-functional requirements are re-prioritized. The solution is valid if the non-functional requirements are met, regardless of your feelings. Even when you are in complete agreement about the non-functional requirements, it's guaranteed that different engineers will create different solutions: synchronous vs. asynchronous, object-oriented inheritance vs. composition, functional or CRUID-based API endpoints SQL vs. NoSQL, API First vs. MVC. It's subjective because each solution is essentially correct; it's just that your approach is different than mine. "Hello, World!" can be implemented in thousands of different ways, each equally right or wrong, so different engineers and architects are going to approach each problem differently. Is that wrong? No. Is that over-architected? If the non-functional requirements are clearly identified and implemented, probably not. But you still likely disagree with what I designed and implemented. 2. Mis-Architected Identifying a mis-architected solution usually requires deconstructing a flawed implementation from conception to deployment: poorly defined or undefined requirements, poor code quality, poor project execution, unrealistic timelines, and an unhelpful architecture. The problems — and blame — are usually multi-faceted. Setting those complexities aside, there are common characteristics of a mis-architected solution: Non-functional requirements aren't addressed, assuming any are identified.The technical skills and background necessary for successful implementation don't exist within the organization.The architecture requires building components outside the organization's core competency, especially when existing components exist.The deployed solution is unstable and requires regular attention to avoid outages.Maintenance and extension rely on individuals who are deemed irreplaceable. You likely recognize one of those if you've ever participated in a train wreck project. You might also realize that what you thought was mis-architected was actually non-architected. Most importantly, these projects and their resultant solutions are often unsalvageable, and everything involved with it was a waste of time. 3. Over-Architected Is any solution over-architected? Probably. Is this light switch overkill? Perhaps for most of us, but the Rube Golbergs and Makers in this world would beg to differ. This following experience may represent an over-architected; it may also just be mis-architected, definitely differently-architected. Problem Statement In my days as an independent software consultant, I was contracted as a backfill for a just-departed employee. The ex-employee had architected, designed, and (mostly) implemented a custom framework for building the company's first true web application: s/he left sample implementations and little (no?) documentation (design, usage, etc.), and the remaining staff had to try and pick up the pieces. [For the Millennials reading this, remember that the early days of web development had no real standards or best practices, open source was in its infancy, and it was the wild west as everyone searched for answers. Underpowered users' computers struggled with simple browser-side scripting (pre-DOM), and each browser vendor's browser liberally interpreted the HTML standard. It's when Internet Explorer started its reign of terror (and yet somehow still exists today). So completely different than today.] My assignment: own the framework, learn its secrets, and define a roadmap for developing the app. Understand and Internalize Conceptually simple from 100K feet/30.5K meters: server-side generation of HTML to create, orchestrated by a Java Servlet app (the framework) that handles requests, responses, navigation, etc. At 50K feet/15K meters, it became less simple. At 25K feet/7600 meters, it became downright scary. The concerns increased the more I dug. Pages were tightly coupled and seemingly small changes quickly cascaded to neighboring pages. Pages directly generated HTML, making it difficult to ensure consistency across the app. Changing default framework behavior relied on finding the corresponding hooks, which were often numerous and confusing. Orchestration worked in the original servlet engine (Sun Java Web Server) but wouldn't if changed (starting to look at Tomcat for performance). Local development required regular syncing of other engineers' code, made difficult with the version control of the day (CVS, Subversion, Visual Source Safe, etc.). I've forgotten other horrors concerns, but the barriers were daunting, and the chance of success dimmed. Hard decisions needed to be made. Decision Time My conclusion: the framework was fit for purpose (barely) but not fit for development; any attempt to continue would overwhelm the remaining engineers and likely never be completed. The manager, to her credit, agreed and decided to cut her losses. Long story short: a simpler, more focused framework was designed, and within three weeks, engineers could start app development on a working model, successfully demoing our progress after two months Ultimately, very successful, and used for multiple apps before being retired. The Verdict Reasons to consider the original framework over-architected: Missing non-functional requirements: No guardrails to reign in design and implementationEgo-driven design: Attempt to show how smart you are by including everything but the kitchen sinkOverly complex: Difficult to implement, maintain, and supportUn-differentiator: Architecture functionality is not a competitive differentiator and substantially extended time-to-market You could argue mis-architected and perhaps differently-architected, but that's semantics. Regardless, we did our due diligence before changing tact, objectively justifying instead of just saying over-architected. Final Thoughts What passes as architecture for a software solution varies widely, based on any number of external factors: business vs. software shop, experience and skill set of technical staff, technologies available, organizational maturity, blah blah blah. It's expected. Too often, an implementation is quickly dismissed without understanding the context in which the work was done. Legitimate reasons may justify a partial or complete re-implementation, but may also be self-serving for engineers. As responsible software engineers, we should justify work efforts that most benefit the organization, not software for software's sake. This does not imply that nothing but features and technology be damned, but does imply that we should be able to justify the technical work proposed. Investigate, document, narrate, justify. And don't just say, "It's over-architected!" More

Protect Your Alerts: The Importance of Independent Incident Alert Management

By Judit Sharon

In a world where IT infrastructure underpins countless businesses and organizations, maintaining operational integrity during critical failures or outages is non-negotiable. A key element in achieving this is ensuring that your incident alert management system remains active and accessible under all circumstances. Unfortunately, a significant vulnerability can arise when the incident alert management system shares the same cloud provider as your primary services. If that cloud provider experiences an outage, your alert management system could become unavailable just when it is needed the most. This could lead to delayed responses, prolonged downtimes, and potentially catastrophic consequences for your business operations. Understanding the Role of Redundancy in Incident Management Redundancy is a fundamental principle in IT management, especially when it comes to ensuring continuous operations. Consider a scenario where your services are hosted on a major cloud provider like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud. While these platforms are indeed robust and reliable, they are not infallible. They can and have experienced failures caused by various factors such as Distributed Denial of Service (DDoS) attacks, major hardware failures, software bugs, or even human error resulting in misconfigurations. In such situations, if your incident alert management system is also hosted on the same cloud, the very tools you rely on to notify you of the outage might be compromised as well. This could leave your IT team in the dark, unaware of the issues, and unable to respond promptly. A real-world example of this occurred during a notable CrowdStrike incident, where a Microsoft Azure outage, triggered by a DDoS attack, led to delays in critical alerts and response efforts. The repercussions of this incident highlight the dangers of putting all your eggs in one basket. If the incident alert management system had been hosted on an independent platform, the response to the incident could have been more timely and effective, potentially mitigating the overall impact. The Benefits of Hosting Incident Management Separately The decision to host your incident alert management system separately from your primary cloud provider is more than just a precaution — it’s a strategic move that can greatly enhance your organization’s operational resilience. Below are the key benefits of maintaining a separate incident management system: 1. Increased Reliability By hosting your incident alert management system on a different cloud provider or in a redundant hosting facility, you ensure that it remains operational even if your primary cloud provider goes down. This independent setup significantly increases the reliability of your alerting system, ensuring that your team can always be informed of critical issues. 2. Faster Response Times In the event of an outage, every second counts. With a separate alert management system, notifications are delivered promptly, enabling your on-call team to take immediate action. This reduces the time between incident detection and response, minimizing potential damage. 3. Improved Disaster Recovery Redundancy is a cornerstone of an effective disaster recovery strategy. When your incident management system is hosted independently, you create a safety net that can catch failures before they escalate into full-blown crises. 4. Reduced Downtime The ultimate goal of incident management is to minimize downtime and its associated impacts. By receiving timely alerts and having the tools to respond without delay, your organization can reduce the duration of outages. This not only preserves your business operations but also protects your reputation by ensuring that your customers experience minimal disruption. Conclusion: Building Resilience Through Decoupled Incident Management While cloud providers offer a powerful and flexible infrastructure for hosting services, they are not invincible. No system is completely immune to failures, and when outages occur, the consequences can be far-reaching. By decoupling your incident alert management from your primary cloud environment, you create a layer of protection that ensures your IT team remains informed and capable of responding to issues, even in the most challenging circumstances. This approach not only enhances your organization’s resilience but also demonstrates a proactive commitment to maintaining uptime and reliability. In a world where downtime can have serious financial and reputational costs, having a robust and independent incident alert management system is not just a good idea—it’s essential. Protect your alerts by ensuring that your incident management system is always ready to do its job, even when the unexpected happens. More

Safeguarding Democracy in the Digital Age: Insights from Day 1 at Black Hat 2024 and Las Vegas Officials

By Tom Smith

CORE

Digital Garden

By Anton Belyaev

Enhance IaC Security With Mend Scans

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Poetry Explained: Perils of the Unpinned Dependencies

Imagine this: You are days away from a release, and your Python codebase is versioned, tagged, and marked as a Release Candidate. Hours from the release, out of nowhere appears a BUG! You set up a War Room, dig through the logs and the code, the team is struggling…and then it hits you: A DEPENDENCY UPDATE! A seemingly minor version bump in a library your code depends on, introduced a breaking change, causing your cautiously crafted codebase into disarray. This problem compounds itself when your project relies on multiple libraries which in turn have their own set of dependencies. In such cases, a single incompatible update can trigger an avalanche of conflicts leading to “Dependency Hell.” Introducing Poetry: Stability and Control Poetry tackles dependency management issues with powerful features like version pinning, conflict resolutions, environment-specific dependency declarations, and grouped dependencies. At the crux of Poetry’s dependency management is a pyproject.toml file, which acts as a central hub to manage the project metadata and dependencies. Once you set this file up, Poetry helps you to lock the dependencies and their specific versions using a poetry.lock file. Advantages of Poetry Dependency resolution: Poetry ensures dependency resolution, with a strict emphasis on compatibility with other packages and avoiding any conflictsLocking: The poetry.lock file captures the exact state of your dependency graph, guaranteeing identical builds across multiple environments and infrastructuresConstrained updates: Using the poetry update command, you can update your dependencies while respecting the version constraints Getting Started With Poetry Installation Install poetry across your system or inside a specific virtual environment using the command: Shell pip install poetry Initialize a Poetry Project In any folder, you can initialize a poetry environment by running the following command: Shell poetry init Here I started on a fresh project, went through the metadata configuration, and added main project dependencies. Note, how you can let poetry select the latest, compatible versions or specify the required versions for each package. Adding Dependencies Now, we add development environment dependencies. See how poetry groups these dependencies under a `dev` group later in the pyproject.toml. Once we confirm generation, here is the generated pyproject.toml that drives the project configuration and various dependencies. TOML [tool.poetry] name = "my-smaple-app" version = "0.0.1" description = "Sample app to show poetry features" authors = ["princebose <boseprince11@gmail.com>"] license = "None" readme = "README.md" [tool.poetry.dependencies] python = "^3.11" requests = "^2.23.0" Flask = "^3.0.3" [tool.poetry.group.dev.dependencies] pytest = "^8.2.2" black = "^24.4.2" ruff = "^0.5.1" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api" Installing the Dependencies Now, to install and initialize the Python environment in your machine, run the following command: Shell poetry install Once installed, poetry creates a poetry.lock file that locks the exact dependency state and ensures identical environments, wherever you install. You can further investigate the poetry.lock file to find your dependency listed in the pyproject.toml and find the version constraints created. Adding More Dependencies While developing, we always work on implementing new features that may require some additional dependencies. Can Poetry add some dependencies after creating the pyproject.toml and locking the versions? Absolutely. Here, I add one main dependency and one dev-only dependency using the following commands: Shell poetry add <package-name> poetry add --group <group-name> <package-name> Final Result You now have a version pinned dependency list, which ensures a huge reduction in tech debt. TOML [tool.poetry] name = "my-smaple-app" version = "0.0.1" description = "Sample app to show poetry features" authors = ["princebose <boseprince11@gmail.com>"] license = "None" readme = "README.md" [tool.poetry.dependencies] python = "^3.11" requests = "^2.23.0" Flask = "^3.0.3" fastapi = "^0.111.0" [tool.poetry.group.dev.dependencies] pytest = "^8.2.2" black = "^24.4.2" ruff = "^0.5.1" isort = "^5.13.2" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api" Advanced Features Poetry offers a range of advanced features to efficiently manage complex dependency scenarios. Private/external repository: If you create and host your own Python packages in your personal/ enterprise repository/registry, you can link a particular dependency to be pulled in from an external repository (E.g., Artifact Registry, JFrog, GitHub, etc.).Plugins: Poetry offers the ability to extend its functionality to suit your specific needs using plugins. Think of these as custom-defined shortcuts within poetry. Conclusion By incorporating version pinning in your development lifecycle, you can avoid the commonly faced dependency management nuances that we come across while working in open-source packages, including, but not limited to: Automatic updates that backfire: As convenient as they sound, automatic updates can sometimes introduce breaking changes that cause chaos in your codebase.Minor bumps that cause major issues: Minor changes in a dependency—which can cause a breaking change — can vary from a renamed function to a revamped API.Security vulnerabilities: The consumers often do not use the latest versions, since they may pose unknown security exploits.Technical debt: Avoid the headaches of unexpected breaking changes and dependency conflicts, thereby saving valuable development time and resources.Collaboration: Identical dependency graphs ensure everyone on the team works with the same, predictable codebase.

By Prince Bose

Terraform Tips for Efficient Infrastructure Management

Terraform is a popular tool for defining and provisioning infrastructure as code (IaC), improving consistency, repeatability, and version control. But you need to know how to use it properly to extract maximum value from it as an infrastructure management tool. In this article, we explore best practices for using Terraform to streamline your infrastructure management effectively and avoid common pitfalls. 1. State Management The state file is a vital element of Terraform that tracks the resources you've created. It maps Terraform configurations to real-world resources, allowing Terraform to assess and manage what’s there. The state file contains sensitive information including resource configurations and metadata, so you need to prioritize its security. Use Remote State Storage A remote backend gives multiple users a central location to access the same state file and enables state locking to avoid concurrent modifications. As well as preventing conflicts, it ensures everyone is working with the latest infrastructure state. The configuration for the remote state storage in the S3 bucket from the Terraform side would look like this: JSON terraform { required_providers { aws = { source = "hashicorp/aws" version = "5.50.0" } } backend "s3" { bucket = "your-bucket-name" key = "state/terraform.tfstate" region = "us-west-2" encrypt = true } } Secure the State Files To ensure secure Terraform state files, store them in a remote backend like Azure Blob Storage or Amazon S3 with encryption enabled and strict access controls. For files stored in S3 buckets, for example, you can use a tool like AWS Key Management Service (KMS). Never store state files in version control systems like Git because they may contain sensitive data or other Terraform secret values, such as passwords or API keys, in plain text. 2. Variables It is advisable to use a separate .tfvars file to store Terraform variable values instead of hardcoding them in the Terraform configuration files or passing them via command-line arguments. Sensitive values, such as passwords or API keys, should be stored securely outside the Terraform configuration file to reduce the risk of unintended exposure. Naming Conventions Use a consistent naming convention when declaring variables to prevent naming conflicts. Variable names should describe their purpose or value. For example, aws_region or instance_type are better than x or y. If you're working on an existing project, follow the naming conventions in place to maintain consistency throughout the codebase. 3. Modules Terraform modules simplify and standardize your infrastructure configurations. Using them, you can minimize duplication and enhance maintainability by encapsulating and reusing configurations. Create Reusable Modules To ensure your Terraform modules are reusable and easy to understand and maintain, you should store each one in its own directory and follow a logical structure. Here is an example of a well-structured module directory: JSON ├── main.tf ├── variables.tf ├── outputs.tf └── README.md Use semantic versioning for your modules to convey the compatibility and stability of changes. 4. Version Control Version Locking Version control keeps your infrastructure environments consistent. By locking provider versions, you avoid unexpected changes and maintain predictable infrastructure behavior. Lock Module Versions When using modules, specify version constraints to ensure you are using the correct version. This practice helps prevent issues caused by breaking changes in module updates. You can specify version constraints in your module source: JSON module "ec2_instance" { source = "terraform-aws-modules/ec2-instance/aws" version = "5.0.0" # other configurations } Lock Provider Versions Similarly, locking provider versions ties you to a specific version of a provider, preventing new releases from delivering surprises. Specify the provider version range in your Terraform configuration: JSON provider "aws" { version = "~> 5.0" region = "us-west-2" } 5. Code Organization Maintain Clean and Readable Code Clean, readable code is easier to maintain, debug, and extend. Proper code organization makes it easier for teams to collaborate and for new members to understand your infrastructure configurations. Code Organization Organize your Terraform projects by separating environments (e.g., development, staging, production) into different directories. Use meaningful naming conventions for resources and variables to make your code self-explanatory. Example Project Structure Here is an example of a well-organized project structure: JSON . ├── environments │ ├── dev │ │ ├── main.tf │ │ ├── outputs.tf │ │ ├── provider.tf │ │ ├── terraform.tf │ │ ├── terraform.tfvars │ │ └── variables.tf │ └── prod │ ├── main.tf │ ├── outputs.tf │ ├── provider.tf │ ├── terraform.tf │ ├── terraform.tfvars │ └── variables.tf └── modules ├── ec2 │ ├── main.tf │ ├── outputs.tf │ └── variables.tf └── network ├── main.tf ├── outputs.tf └── variables.tf 6. Automation Efficient and reliable infrastructure management centers on automating wherever possible. Create CI/CD Pipelines To automate your infrastructure workflows, integrate Terraform with continuous integration/continuous deployment (CI/CD) pipelines. This minimizes the errors and extra effort created by manual work and enables the flexibility to scale your infrastructure up or down as demand fluctuates. You can automate your Terraform workflows with CI/CD tools and platforms like Jenkins, GitLab CI, and GitHub Actions. Before you decide on one specific solution for your workflows, you may need to experiment with trials of various options to get the process right. Wrapping Up Building secure, scalable, maintainable infrastructure is much easier when you follow best practices in Terraform. Take your infrastructure management processes to new levels by securely managing your state files, using reusable modules, locking versions, organizing your code, and integrating automation. Implementing these practices may require a change of mindset and some initial effort, but once they are in place, your IaC management will be more efficient and reliable. Ultimately, successful Terraform is all about consistency, security, and continuous improvement.

By Mariusz Michalowski

How To Reduce MTTR

As a Site Reliability Engineer, one of the key metrics that I use to track the effectiveness of incident management is Mean Time To Recover (MTTR). Based on Wikipedia, MTTR is defined as the average time that a service or system will take to recover from any failure. Trying to achieve a low MTTR is key to achieving service level objectives and in turn, service level agreements of any critical production service. 10 Things That Can Help Reduce the Mean Time to Recovery (MTTR) 1. Clearly Defined SLIs Service level indicators or SLIs are the key indicators that measure the health of your service. A few examples of SLIs are error rate, latency, throughput, etc. 2. Actionable Alerts Based on SLIs The alert strategy should include improving the signal-to-noise ratio of the alerts. The goal with alerting is that every alert that your team gets should be actionable. Sending too many alerts will cause alert fatigue and will have the risk of the on-call person ignoring alerts that indicate real issues with the service. 3. Troubleshooting Guides Associated With Alerts Every alert should have a clearly defined troubleshooting guide on how to triage and mitigate the issue the alert identifies. A good methodology to use while writing these troubleshooting guides is the USE methodology, suggested by Brendan Gregg in his book, "Systems Performance." USE stands for Usage, Saturation, and Errors. 4. Practice Troubleshooting Guides Practicing troubleshooting guides periodically will help mitigate incidents when they occur. It will also help identify gaps with the TSGs since services evolve over time. A few examples of a good time to practice troubleshooting guides is when a new team member joins the team so that they can give a fresh perspective of the TSG. This will reduce assumptions about the knowledge of the system. 5. Usable Dashboards The observability strategy should include creating easy-to-use dashboards. The dashboards should have panels to include the key metrics of the services and the health of dependent services such as upstream and downstream services. A few examples of important metrics that should be included in the dashboards are the golden signals suggested by the Google SRE book such as latency, throughput, error rate, and saturation metrics. 6. Automated Actions To Mitigate Issues Automating certain actions based on the metrics and events is key to reducing MTTR. An example of this is taking certain servers out of rotation if packet loss is observed from these servers. This will help reduce the impact on user experience and reduce MTTR. 7. Failovers Rehearsals In the case of multi-data center architectures, it is crucial to have failover plans defined to make sure to recover from an outage of a specific data center quickly. Practicing these failover scenarios periodically will help to quickly execute them during an outage. This will also help in identifying any gaps in the failover plans and give the chance to update and fix the failover plans. 8. Automated Failovers Once the failover plans are defined, implemented, and practiced, the next step is to automate these failover scenarios based on the health checks of the service on a given data center. This will help to mitigate the issues faster and thus reduce the MTTR. 9. Change Management Process Changes to production systems are a major cause of outages. It is important to have a well-thought-out change management process in place. A few key elements of the change management process should include clearly defined checklists, change review and approval procedures, automated deployment pipelines with built-in monitoring, and the ability to quickly roll back the changes if any issues are observed. 10. Easy To Identify Change List and Automated Rollbacks There can be multiple changes continuously done in distributed systems where services are designed as microservices. Having a central system where one can easily identify which changes have been done during a given period of time will help to identify if a specific change has caused an outage and is thus easy to roll back. Conclusion In this article, I have discussed 10 things that can help reduce the Mean Time To Recovery of any critical production service. This is not an exhaustive list, but a list of best practices based on my years of experience working as a Site Reliability Engineer on services such as TikTok, Microsoft Teams, Xbox, and Microsoft Dynamics.

By Krishna Vinnakota

Managing Architectural Tech Debt

When I think about technical debt, I still remember the first application I created that made me realize the consequences of an unsuitable architecture. It happened back in the late 1990s when I was first getting started as a consultant. The client had requested the use of the Lotus Notes platform to build a procurement system for their customers. Using the Lotus Notes client and a custom application, end-users could make requests that would be tracked by the application and fulfilled by the product owner’s team. In theory, it was a really cool idea – especially since web-developed applications were not prevalent and everyone used Lotus Notes on a daily basis. The core problem is that the data was very relational in design – and Lotus Notes was not a relational database. The solution’s design required schema management within every Lotus Notes document and leaned on a series of multi-value fields to simulate the relationships between data attributes. It was a mess. A great deal of logic in the Lotus Notes application would not have been required if a better platform had been recommended. The source code was complicated to support. Enhancements to the data structure resulted in major refactoring of the underlying code – not to mention running server-based jobs to convert the existing data. Don’t get me started on the effort behind report creation. Since I was early in my career I was focused on providing a solution that the client wanted over trying to offer a better solution. This was certainly a lesson I learned early in my career, but in the years since that project, I’ve come to realize that the consequence of architectural technical debt is an unfortunate reality we all face. Let’s explore the concept of architecture tech debt a little more at a macro level. Architectural Tech Debt (ATD) The Architectural Technical Debt (ATD) Library at Carnegie Mellon University provides the following definition of ATD: Architectural technical debt is a design or construction approach that's expedient in the short term, but that creates a technical context in which the same work requires architectural rework and costs more to do later than it would cost to do now (including increased cost over time). In the “Quick Answer: How to Manage Architecture Technical Debt” (published 09/22/2023), Gartner Group defines ATD as follows: Architecture technical debt is that type of technical debt that is caused by architectural drift, suboptimal architectural decisions, violations of defined target product architecture and established industry architectural best practices, and architecture trade-offs made for faster software delivery. In both cases, benefits that often yield short-term celebrations can be met with long-term challenges. This is similar to my Lotus Notes example mentioned in the introduction. To further complicate matters, tooling to help identify and manage tech debt for software architecture has been missing in comparison to the other aspects of software development: For code quality, observability, and SCA, proven tooling exists with products like Sonarqube, Datadog, New Relic, GitHub, and Snyk. However, the software architecture segment has lagged behind without any proven solutions. This is unfortunate, given the fact that ATD is consistently the largest – and most damaging – type of technical debt as found in the “Measure It? Manage It? Ignore It? Software Practitioners and Technical Debt” 2015 study published by Carnegie Mellon. The following illustration summarizes Figure 4 from that report, concluding that bad architecture choices were the clear leader in sources of technical debt. If not managed, ATD can continue to grow over time at an increasing rate as demonstrated in this simple illustration: Without mitigation, architecture debt will eventually reach a breaking point for the underlying solution being measured. Managing ATD Before we can manage ATD, we must first understand the problem. Desmond Tutu once wisely said that “There is only one way to eat an elephant: a bite at a time.” The shift-left approach embraces the concept of moving a given aspect closer to the beginning than at the end of a lifecycle. This concept gained popularity with shift-left for testing, where the test phase was moved to a part of the development process and not a separate event to be completed after development was finished. Shift-left can be implemented in two different ways in managing ATD: Shift-left for resiliency: Identifying sources that have an impact on resiliency, and then fixing them before they manifest in performance.Shift-left for security: Detect and mitigate security issues during the development lifecycle. Just like shift-left for testing, a prioritized focus on resilience and security during the development phase will reduce the potential for unexpected incidents. Architectural Observability Architectural observability gives engineering teams the ability to incrementally address architectural drift within their services at a macro level. In fact, the Wall Street Journal reported the cost to fix technical debt at $1.52 trillion earlier this year in “The Invisible $1.52 Trillion Problem: Clunky Old Software,” article. To be successful, engineering leadership must be in full alignment with the following organizational objectives: Resiliency: To recover swiftly from unexpected incidents.Scalability: To scale appropriately with customer demand.Velocity: To deliver features and enhancements in line with product expectations.Cloud Suitability: Transforming legacy solutions into efficient cloud-native service offerings. I recently discovered vFunction’s AI-driven architectural observability platform, which is focused on the following deliverables: Discover the real architecture of solutions via static and dynamic analysis.Prevent architecture drift via real-time views of how services are evolving.Increase the resiliency of applications via the elimination of unnecessary dependencies and improvements between application domains and their associated resources.Manage and remediate tech debt via AI-driven observability. Additionally, the vFunction platform provides the side-benefit of providing a migration path to transform from monoliths to cloud-native solutions. Once teams have modernized their platforms, they can continuously observe them for ongoing drift. If companies already have microservices, they can use vFunction to detect complexity in distributed applications and address dependencies that impact resiliency and scalability. In either case, once implemented, engineering teams can mitigate ATD well before reaching the breaking point. In the illustration above, engineering teams are able to mitigate technical debt as a part of each release, due to the implementation of the vFunction platform and an underlying shift-left approach. Conclusion My readers may recall that I have been focused on the following mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” — J. Vester The vFunction platform adheres to my mission statement by helping engineering teams employ a shift-left approach to the resiliency and security of their services at a macro level. This is an important distinction because without such tooling teams are likely to mitigate at a micro level resolving tech debt that doesn’t really matter from an organizational perspective. When I think back to that application that made me realize the challenges with tech debt, I can’t help but think about how that solution yielded more issues than it did benefits with each feature that was introduced. Certainly, the use of shift-left for resiliency alone would have helped surface issues with the underlying architecture at a point where the cost to consider alternatives would be feasible. If you are interested in learning more about the vFunction solution, you can read more about them here. Have a really great day!

By John Vester

CORE

Ansible Code Scanning and Quality Checks With SonarQube

You should have heard about SonarQube as a code scanning and code quality check tool. SonarQube doesn't support Ansible by default. A plugin needs to be set up to scan Ansible playbooks or roles. In this article, you will learn on how to set up and use SonarQube on your Ansible (YAML) code for linting and code analysis. This article uses the community edition of SonarQube. What Is Ansible? As explained in previous articles around Ansible: Ansible Beyond Automation and Automation Ansible AI, Ansible is a simple IT automation tool that helps you provision infrastructure, install software, and support application automation through advanced workflows. Ansible playbooks are written in YAML format and define a series of tasks to be executed on remote hosts. Playbooks offer a clear, human-readable way to describe complex automation workflows. Using playbooks, you define the required dependencies and desired state for your application. What Is SonarQube? SonarQube is a widely used open-source platform for continuous code quality inspection and analysis. It is designed to help developers and teams identify and address potential issues in their codebase, such as bugs, code smells, security vulnerabilities, and technical debt. SonarQube supports a wide range of programming languages, including Java, C#, C/C++, Python, JavaScript, and many others. The community edition of SonarQube can perform static code analysis for 19 languages like Terraform, code formation, Docker, Ruby, Kotlin, Go, etc., Comparison of SonarQube Editions Code Scanning and Analysis SonarQube performs static code analysis, which means it examines the source code without executing it. This analysis is performed by parsing the code and applying a set of predefined rules and patterns to identify potential issues. SonarQube covers various aspects of code quality, including: Code smells: SonarQube can detect code smells, which are indicators of potential maintainability issues or design flaws in the codebase. Examples include duplicated code, complex methods, and excessive coupling.Bugs: SonarQube can identify potential bugs in the code, such as null pointer dereferences, resource leaks, and other common programming errors.Security vulnerabilities: SonarQube can detect security vulnerabilities in the code, such as SQL injection, cross-site scripting (XSS), and other security flaws.Technical debt: SonarQube can estimate the technical debt of a codebase, which represents the effort required to fix identified issues and bring the code up to a desired level of quality. Importance of Code Scanning and Analysis Code scanning and analysis with SonarQube offer several benefits to development teams: Improved code quality: By identifying and addressing issues early in the development process, teams can improve the overall quality of their codebase, reducing the likelihood of bugs and making the code more maintainable.Increased productivity: By automating the code analysis process, SonarQube saves developers time and effort that would otherwise be spent manually reviewing code.Consistent code standards: SonarQube can enforce coding standards and best practices across the entire codebase, ensuring consistency and adherence to established guidelines.Security awareness: By detecting security vulnerabilities early, teams can address them before they become exploitable in production environments, reducing the risk of security breaches.Technical debt management: SonarQube's technical debt estimation helps teams prioritize and manage the effort required to address identified issues, ensuring that the codebase remains maintainable and extensible. Perform Static Application Security Testing SonarQube is a leading tool for performing SAST, offering comprehensive capabilities to enhance code security and quality. Static Application Security Testing (SAST) is a method of security testing that analyzes source code to identify vulnerabilities and security flaws. Unlike Dynamic Application Security Testing (DAST), which tests running applications, SAST examines the code itself, making it a form of white-box testing. SonarQube integrates seamlessly with popular development tools and continuous integration/continuous deployment (CI/CD) pipelines, making it easy to incorporate code analysis into the development workflow. With its comprehensive analysis capabilities and support for various programming languages, SonarQube has become an essential tool for development teams seeking to improve code quality, maintain a secure and maintainable codebase, and deliver high-quality software products. Install SonarQube on Your Local Machine You can set it up using a zip file or you can spin up a Docker container using one of SonarQube's Docker images. 1. Download and install Java 17 from Eclipse Temurin Latest Releases. If you are using a macOS, you can install using HomeBrew with the below command. Shell brew install --cask temurin@17 2. Download the SonarQube Community Edition zip file. 3. As mentioned in the SonarQube documentation, as a non-root user unzip the downloaded SonarQube community edition zip file to C:\sonarqube on Windows or on Linux / macOS /opt/sonarqube On Linux / macOS, you may have to run a command to create folder as a root sudo mkdir -p /opt/sonarqube 4. The folder structure in your /opt/sonarqube should look similar to the below image. The key folders that you will be using for this article would be bin and extensions/plugins SonarQube Community edition folder structure 5. To start the SonarQube server, change to the directory where you unzipped the community edition and run the below commands under the respective Operating System. For example, If you are running on a macOS, you will change the directory to /opt/sonarqube/bin/macosx-universal-64 Shell # On Windows, execute: C:\sonarqube\bin\windows-x86-64\StartSonar.bat # On other operating systems, as a non-root user execute: /opt/sonarqube/bin/<OS>/sonar.sh console Here's the folder structure under the bin folder. bin folder structure 6. On a macOS, this is how it looks when you run the server with Java 17 setup Shell # To change to the directory and execute cd /opt/sonarqube/bin/macosx-universal-64 ./sonar.sh console SonarQube server up and running If you are using a Docker image of the community edition from the Dockerhub, run the below command Shell docker run -d --name sonarqube -e SONAR_ES_BOOTSTRAP_CHECKS_DISABLE=true -p 9000:9000 sonarqube:latest 7. You can access the SonarQube server at this localhost. Initial system administrator username: admin and password: admin. You will be asked to reset the password once logged in. SonarQube console SonarQube Projects A SonarQube project represents a codebase that you want to analyze. Each project is identified by a unique key and can be configured with various settings, such as the programming languages used, the source code directories, and the quality gates (thresholds for code quality metrics). You can create a new project in SonarQube through the web interface or automatically during the first analysis of your codebase. When creating a project manually, you need to provide a project key and other details like the project name and visibility settings. Scanner CLI for SonarQube A scanner is required to be set up that will be used to run code analysis on SonarQube. Project configuration is read from file sonar-project.properties or passed on the command line. The SonarScanner CLI (Command Line Interface) is a tool that allows you to analyze your codebase from the command line. It is the recommended scanner when there is no specific scanner available for your build system or when you want to run the analysis outside of your build process. Download and Configure SonarScanner CLI Based on the Operating system, you are running your SonarQube server, download the sonar-scanner from this link.Unzip or expand the downloaded file into the directory of your choice. Let's refer to it as <INSTALL_DIRECTORY> in the next steps.Update the global settings to point to your SonarQube server by editing $install_directory/conf/sonar-scanner.properties Plain Text # Configure here general information about the environment, such as the server connection details for example # No information about specific project should appear here #----- SonarQube server URL (default to SonarCloud) sonar.host.url=http://localhost:9000/ #sonar.scanner.proxyHost=myproxy.mycompany.com #sonar.scanner.proxyPort=8002 4. Add the <INSTALL_DIRECTORY>/bin directory to your path. If you are using macOS or Linux, add this to your ~/.bashrc or ~/.zshrc and source the file source ~/.bashrc Setup Ansible Plugin Before you set up the SonarQube plugin for Ansible, install ansible-lint Shell npm install -g ansible-lint On macOS, if you have homebrew installed, use this command brew install ansible-lint To install and setup the SonarQube plugin for Ansible, follow the instructions here Download the YAML and Ansible SonarQube pluginsCopy them into the extensions/pluginsdirectory of SonarQube and restart SonarQube LaTeX ├── README.txt ├── sonar-ansible-plugin-2.5.1.jar └── sonar-yaml-plugin-1.9.1.jar Log into SonarQube Server console.Click on Quality Profiles to create a new quality profile for YAML. Quality Profiles 5. Click Create. 6. Select Copy from an existing quality profile, fill in the below details and click Create. Language: YAMLParent: YAML Analyzer (Built-in)Name: ansible-scan New quality profile 7. Activate the Ansible rules on the ansible-scan quality profile by clicking on the menu icon and selecting Active More Rules. Activate more rules for Ansible 8. Search with the tag "ansible" and from the Bulk Change, Click on Activate in ansible-scan. Search and apply 9. Set ansible-scan as the Default. The Ansible rules will be applicable to other YAML files. You can now see that for YAML you have 20 rules and for Ansible you have 38 rules. Set ansible-scan Create a New Project and Run Your First Scan 1. Navigate to the localhost on your browser to launch the SonarQube Server console.2. Click Create Project and select Local project. For demo purpose, you can download Ansible code from this GitHub repository. Create local project 3. Enter a project displayname, project key, branch name, and click Next. Local project creation 4. Under Choose the baseline for new code for this project, select Use the global setting and click Create project. Read the information below the selection to understand why you should pick this choice. Select settings 5. Select Locally under the Analysis Method as you will be running this locally on your machine. Analysis method 6. Under Provide a token, select Generate a token. Give your token a name, click Generate, and click Continue.Under Run analysis on your project, Select Other.Select the Operating System(OS). 7. Click on the Copy icon to save the commands to the clipboard. Generate token 8. On a terminal or command prompt, navigate to your Ansible code folder, and paste and execute commands in your project's folder. You can see the Ansible-lint rules called in the log. Plain Text INFO: ansible version: INFO: ansible [core 2.17.0] INFO: config file = None INFO: configured module search path = ['/Users/vmac/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] INFO: ansible python module location = /usr/local/Cellar/ansible/10.0.1/libexec/lib/python3.12/site-packages/ansible INFO: ansible collection location = /Users/vmac/.ansible/collections:/usr/share/ansible/collections INFO: executable location = /usr/local/bin/ansible INFO: python version = 3.12.3 (main, Apr 9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)] (/usr/local/Cellar/ansible/10.0.1/libexec/bin/python) INFO: jinja version = 3.1.4 INFO: libyaml = True INFO: ansible-lint version: INFO: ansible-lint 24.6.0 using ansible 9. On the SonarQube server console, you can see the analysis information Overview Ansible code analyzed Conclusion In this article, you learned how to install, configure, and run the SonarQube plugin for Ansible that allows developers and operations teams to analyze the Ansible playbooks and/or roles for code quality, security vulnerabilities, and best practices. It leverages the YAML SonarQube plugin and adds additional rules specifically tailored for Ansible. Suggested Reading If you are new to Ansible and want to learn the tools and capabilities it provides, check my previous articles: Ansible Beyond AutomationAutomation Ansible AI

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Why Is Kubernetes Debugging So Problematic?

Debugging application issues in a Kubernetes cluster can often feel like navigating a labyrinth. Containers are ephemeral by design and intended to be immutable once deployed. This presents a unique challenge when something goes wrong and we need to dig into the issue. Before diving into the debugging tools and techniques, it's essential to grasp the core problem: why modifying container instances directly is a bad idea. This blog post will walk you through the intricacies of Kubernetes debugging, offering insights and practical tips to effectively troubleshoot your Kubernetes environment. The Problem With Kubernetes Video The Immutable Nature of Containers One of the fundamental principles of Kubernetes is the immutability of container instances. This means that once a container is running, it shouldn't be altered. Modifying containers on the fly can lead to inconsistencies and unpredictable behavior, especially as Kubernetes orchestrates the lifecycle of these containers, replacing them as needed. Imagine trying to diagnose an issue only to realize that the container you’re investigating has been modified, making it difficult to reproduce the problem consistently. The idea behind this immutability is to ensure that every instance of a container is identical to any other instance. This consistency is crucial for achieving reliable, scalable applications. If you start modifying containers, you undermine this consistency, leading to a situation where one container behaves differently from another, even though they are supposed to be identical. The Limitations of kubectl exec We often start our journey in Kubernetes with commands such as: $ kubectl -- exec -ti <pod-name> This logs into a container and feels like accessing a traditional server with SSH. However, this approach has significant limitations. Containers often lack basic diagnostic tools—no vim, no traceroute, sometimes not even a shell. This can be a rude awakening for those accustomed to a full-featured Linux environment. Additionally, if a container crashes, kubectl exec becomes useless as there's no running instance to connect to. This tool is insufficient for thorough debugging, especially in production environments. Consider the frustration of logging into a container only to find out that you can't even open a simple text editor to check configuration files. This lack of basic tools means that you are often left with very few options for diagnosing problems. Moreover, the minimalistic nature of many container images, designed to reduce their attack surface and footprint, exacerbates this issue. Avoiding Direct Modifications While it might be tempting to install missing tools on the fly using commands like apt-get install vim, this practice violates the principle of container immutability. In production, installing packages dynamically can introduce new dependencies, potentially causing application failures. The risks are high, and it's crucial to maintain the integrity of your deployment manifests, ensuring that all configurations are predefined and reproducible. Imagine a scenario where a quick fix in production involves installing a missing package. This might solve the immediate problem but could lead to unforeseen consequences. Dependencies introduced by the new package might conflict with existing ones, leading to application instability. Moreover, this approach makes it challenging to reproduce the exact environment, which is vital for debugging and scaling your application. Enter Ephemeral Containers The solution to the aforementioned problems lies in ephemeral containers. Kubernetes allows the creation of these temporary containers within the same pod as the application container you need to debug. These ephemeral containers are isolated from the main application, ensuring that any modifications or tools installed do not impact the running application. Ephemeral containers provide a way to bypass the limitations of kubectl exec without violating the principles of immutability and consistency. By launching a separate container within the same pod, you can inspect and diagnose the application container without altering its state. This approach preserves the integrity of the production environment while giving you the tools you need to debug effectively. Using kubectl debug The kubectl debug command is a powerful tool that simplifies the creation of ephemeral containers. Unlike kubectl exec, which logs into the existing container, kubectl debug creates a new container within the same namespace. This container can run a different OS, mount the application container’s filesystem, and provide all necessary debugging tools without altering the application’s state. This method ensures you can inspect and diagnose issues even if the original container is not operational. For example, let’s consider a scenario where we’re debugging a container using an ephemeral Ubuntu container: kubectl debug <myapp> -it <pod-name> --image=ubuntu --share-process --copy-to=<myapp-debug> This command launches a new Ubuntu-based container within the same pod, providing a full-fledged environment to diagnose the application container. Even if the original container lacks a shell or crashes, the ephemeral container remains operational, allowing you to perform necessary checks and install tools as needed. It relies on the fact that we can have multiple containers in the same pod, that way we can inspect the filesystem of the debugged container without physically entering that container. Practical Application of Ephemeral Containers To illustrate, let’s delve deeper into how ephemeral containers can be used in real-world scenarios. Suppose you have a container that consistently crashes due to a mysterious issue. By deploying an ephemeral container with a comprehensive set of debugging tools, you can monitor the logs, inspect the filesystem, and trace processes without worrying about the constraints of the original container environment. For instance, you might encounter a situation where an application container crashes due to an unhandled exception. By using kubectl debug, you can create an ephemeral container that shares the same network namespace as the original container. This allows you to capture network traffic and analyze it to understand if there are any issues related to connectivity or data corruption. Security Considerations While ephemeral containers reduce the risk of impacting the production environment, they still pose security risks. It’s critical to restrict access to debugging tools and ensure that only authorized personnel can deploy ephemeral containers. Treat access to these systems with the same caution as handing over the keys to your infrastructure. Ephemeral containers, by their nature, can access sensitive information within the pod. Therefore, it is essential to enforce strict access controls and audit logs to track who is deploying these containers and what actions are being taken. This ensures that the debugging process does not introduce new vulnerabilities or expose sensitive data. Interlude: The Role of Observability While tools like kubectl exec and kubectl debug are invaluable for troubleshooting, they are not replacements for comprehensive observability solutions. Observability allows you to monitor, trace, and log the behavior of your applications in real time, providing deeper insights into issues without the need for intrusive debugging sessions. These tools aren't meant for everyday debugging: that role should be occupied by various observability tools. I will discuss observability in more detail in an upcoming post. Command Line Debugging While tools like kubectl exec and kubectl debug are invaluable, there are times when you need to dive deep into the application code itself. This is where we can use command line debuggers. Command line debuggers allow you to inspect the state of your application at a very granular level, stepping through code, setting breakpoints, and examining variable states. Personally, I don't use them much. For instance, Java developers can use jdb, the Java Debugger, which is analogous to gdb for C/C++ programs. Here’s a basic rundown of how you might use jdb in a Kubernetes environment: 1. Set Up Debugging First, you need to start your Java application with debugging enabled. This typically involves adding a debug flag to your Java command. However, as discussed in my post here, there's an even more powerful way that doesn't require a restart: java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005 -jar myapp.jar 2. Port Forwarding Since the debugger needs to connect to the application, you’ll set up port forwarding to expose the debug port of your pod to your local machine. This is important as JDWP is dangerous: kubectl port-forward <pod-name> 5005:5005 3. Connecting the Debugger With port forwarding in place, you can now connect jdb to the remote application: jdb -attach localhost:5005 From here, you can use jdb commands to set breakpoints, step through code, and inspect variables. This process allows you to debug issues within the code itself, which can be invaluable for diagnosing complex problems that aren’t immediately apparent through logs or superficial inspection. Connecting a Standard IDE for Remote Debugging I prefer IDE debugging by far. I never used JDB for anything other than a demo. Modern IDEs support remote debugging, and by leveraging Kubernetes port forwarding, you can connect your IDE directly to a running application inside a pod. To set up remote debugging we start with the same steps as the command line debugging. Configuring the application and setting up the port forwarding. 1. Configure the IDE In your IDE (e.g., IntelliJ IDEA, Eclipse), set up a remote debugging configuration. Specify the host as localhost and the port as 5005. 2. Start Debugging Launch the remote debugging session in your IDE. You can now set breakpoints, step through code, and inspect variables directly within the IDE, just as if you were debugging a local application. Conclusion Debugging Kubernetes environments requires a blend of traditional techniques and modern tools designed for container orchestration. Understanding the limitations of kubectl exec and the benefits of ephemeral containers can significantly enhance your troubleshooting process. However, the ultimate goal should be to build robust observability into your applications, reducing the need for ad-hoc debugging and enabling proactive issue detection and resolution. By following these guidelines and leveraging the right tools, you can navigate the complexities of Kubernetes debugging with confidence and precision. In the next installment of this series, we’ll delve into common configuration issues in Kubernetes and how to address them effectively.

By Shai Almog

CORE

Reliability Models and Metrics for Test Engineering

Tech teams do their best to develop amazing software products. They spent countless hours coding, testing, and refining every little detail. However, even the most carefully crafted systems may encounter issues along the way. That's where reliability models and metrics come into play. They help us identify potential weak spots, anticipate failures, and build better products. The reliability of a system is a multidimensional concept that encompasses various aspects, including, but not limited to: Availability: The system is available and accessible to users whenever needed, without excessive downtime or interruptions. It includes considerations for system uptime, fault tolerance, and recovery mechanisms.Performance: The system should function within acceptable speed and resource usage parameters. It scales efficiently to meet growing demands (increasing loads, users, or data volumes). This ensures a smooth user experience and responsiveness to user actions.Stability: The software system operates consistently over time and maintains its performance levels without degradation or instability. It avoids unexpected crashes, freezes, or unpredictable behavior.Robustness: The system can gracefully handle unexpected inputs, invalid user interactions, and adverse conditions without crashing or compromising its functionality. It exhibits resilience to errors and exceptions.Recoverability: The system can recover from failures, errors, or disruptions and restore normal operation with minimal data loss or impact on users. It includes mechanisms for data backup, recovery, and rollback.Maintainability: The system should be easy to understand, modify, and fix when necessary. This allows for efficient bug fixes, updates, and future enhancements. This article starts by analyzing mean time metrics. Basic probability distribution models for reliability are then highlighted with their pros and cons. A distinction between software and hardware failure models follows. Finally, reliability growth models are explored including a list of factors for how to choose the right model. Mean Time Metrics Some of the most commonly tracked metrics in the industry are MTTA (mean time to acknowledge), MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond or resolve), and MTTF (mean time to failure). They help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. The acronym MTTR can be misleading. When discussing MTTR, it might seem like a singular metric with a clear definition. However, it actually encompasses four distinct measurements. The 'R' in MTTR can signify repair, recovery, response, or resolution. While these four metrics share similarities, each carries its own significance and subtleties. Mean Time To Repair: This focuses on the time it takes to fix a failed component.Mean Time To Recovery: This considers the time to restore full functionality after a failure.Mean Time To Respond: This emphasizes the initial response time to acknowledge and investigate an incident.Mean Time To Resolve: This encompasses the entire incident resolution process, including diagnosis, repair, and recovery. While these metrics overlap, they provide a distinct perspective on how quickly a team resolves incidents. MTTA, or Mean Time To Acknowledge, measures how quickly your team reacts to alerts by tracking the average time from alert trigger to initial investigation. It helps assess both team responsiveness and alert system effectiveness. MTBF or Mean Time Between Failures, represents the average time a repairable system operates between unscheduled failures. It considers both the operating time and the repair time. MTBF helps estimate how often a system is likely to experience a failure and require repair. It's valuable for planning maintenance schedules, resource allocation, and predicting system uptime. For a system that cannot or should not be repaired, MTTF, or Mean Time To Failure, represents the average time that the system operates before experiencing its first failure. Unlike MTBF, it doesn't consider repair times. MTTF is used to estimate the lifespan of products that are not designed to be repaired after failing. This makes MTTF particularly relevant for components or systems where repair is either impossible or not economically viable. It's useful for comparing the reliability of different systems or components and informing design decisions for improved longevity. An analogy to illustrate the difference between MTBF and MTTF could be a fleet of delivery vans. MTBF: This would represent the average time between breakdowns for each van, considering both the driving time and the repair time it takes to get the van back on the road.MTTF: This would represent the average lifespan of each van before it experiences its first breakdown, regardless of whether it's repairable or not. Key Differentiators FeatureMTBFMTTFRepairable SystemYesNoRepair TimeConsidered in the calculationNot considered in the calculationFailure FocusTime between subsequent failuresTime to the first failureApplicationPlanning maintenance, resource allocationAssessing inherent system reliability The Bigger Picture MTTR, MTTA, MTTF, and MTBF can also be used all together to provide a comprehensive picture of your team's effectiveness and areas for improvement. Mean time to recovery indicates how quickly you get systems operational again. Incorporating mean time to respond allows you to differentiate between team response time and alert system efficiency. Adding mean time to repair further breaks down how much time is spent on repairs versus troubleshooting. Mean time to resolve incorporates the entire incident lifecycle, encompassing the impact beyond downtime. But the story doesn't end there. Mean time between failures reveals your team's success in preventing or reducing future issues. Finally, incorporating mean time to failure provides insights into the overall lifespan and inherent reliability of your product or system. Probability Distributions for Reliability The following probability distributions are commonly used in reliability engineering to model the time until the failure of systems or components. They are often employed in reliability analysis to characterize the failure behavior of systems over time. Exponential Distribution Model This model assumes a constant failure rate over time. This means that the probability of a component failing is independent of its age or how long it has been operating. Applications: This model is suitable for analyzing components with random failures, such as memory chips, transistors, or hard drives. It's particularly useful in the early stages of a product's life cycle when failure data might be limited.Limitations: The constant failure rate assumption might not always hold true. As hardware components age, they might become more susceptible to failures (wear-out failures), which the Exponential Distribution Model wouldn't capture. Weibull Distribution Model This model offers more flexibility by allowing dynamic failure rates. It can model situations where the probability of failure increases over time at an early stage (infant mortality failures) or at a later stage (wear-out failures). Infant mortality failures: This could represent new components with manufacturing defects that are more likely to fail early on.Wear-out failures: This could represent components like mechanical parts that degrade with use and become more likely to fail as they age. Applications: The Weibull Distribution Model is more versatile than the Exponential Distribution Model. It's a good choice for analyzing a wider range of hardware components with varying failure patterns.Limitations: The Weibull Distribution Model requires more data to determine the shape parameter that defines the failure rate behavior (increasing, decreasing, or constant). Additionally, it might be too complex for situations where a simpler model like the Exponential Distribution would suffice. The Software vs Hardware Distinction The nature of software failures is different from that of hardware failures. Although both software and hardware may experience deterministic as well as random failures, their failures have different root causes, different failure patterns, and different prediction, prevention, and repair mechanisms. Depending on the level of interdependence between software and hardware and how it affects our systems, it may be beneficial to consider the following factors: 1. Root Cause of Failures Hardware: Hardware failures are physical in nature, caused by degradation of components, manufacturing defects, or environmental factors. These failures are often random and unpredictable. Consequently, hardware reliability models focus on physical failure mechanisms like fatigue, corrosion, and material defects. Software: Software failures usually stem from logical errors, code defects, or unforeseen interactions with the environment. These failures may be systematic and can be traced back to specific lines of code or design flaws. Consequently, software reliability models do not account for physical degradation over time. 2. Failure Patterns Hardware: Hardware failures often exhibit time-dependent behavior. Components might be more susceptible to failures early in their lifespan (infant mortality) or later as they wear out.Software: The behavior of software failures in time can be very tricky and usually depends on the evolution of our code, among others. A bug in the code will remain a bug until it's fixed, regardless of how long the software has been running. 3. Failure Prediction, Prevention, Repairs Hardware: Hardware reliability models that use MTBF often focus on predicting average times between failures and planning preventive maintenance schedules. Such models analyze historical failure data from identical components. Repairs often involve the physical replacement of components.Software: Software reliability models like Musa-Okumoto and Jelinski-Moranda focus on predicting the number of remaining defects based on testing data. These models consider code complexity and defect discovery rates to guide testing efforts and identify areas with potential bugs. Repair usually involves debugging and patching, not physical replacement. 4. Interdependence and Interaction Failures The level of interdependence between software and hardware varies for different systems, domains, and applications. Tight coupling between software and hardware may cause interaction failures. There can be software failures due to hardware and vice-versa. Here's a table summarizing the key differences: FeatureHardware Reliability ModelsSoftware Reliability ModelsRoot Cause of FailuresPhysical Degradation, Defects, Environmental FactorsCode Defects, Design Flaws, External DependenciesFailure PatternsTime-Dependent (Infant Mortality, Wear-Out)Non-Time Dependent (Bugs Remain Until Fixed)Prediction FocusAverage Times Between Failures (MTBF, MTTF)Number of Remaining DefectsPrevention StrategiesPreventive Maintenance SchedulesCode Review, Testing, Bug Fixes By understanding the distinct characteristics of hardware and software failures, we may be able to leverage tailored reliability models, whenever necessary, to gain in-depth knowledge of our system's behavior. This way we can implement targeted strategies for prevention and mitigation in order to build more reliable systems. Code Complexity Code complexity assesses how difficult a codebase is to understand and maintain. Higher complexity often correlates with an increased likelihood of hidden bugs. By measuring code complexity, developers can prioritize testing efforts and focus on areas with potentially higher defect density. The following tools can automate the analysis of code structure and identify potential issues like code duplication, long functions, and high cyclomatic complexity: SonarQube: A comprehensive platform offering code quality analysis, including code complexity metricsFortify: Provides static code analysis for security vulnerabilities and code complexityCppDepend (for C++): Analyzes code dependencies and metrics for C++ codebasesPMD: An open-source tool for identifying common coding flaws and complexity metrics Defect Density Defect density illuminates the prevalence of bugs within our code. It's calculated as the number of defects discovered per unit of code, typically lines of code (LOC). A lower defect density signifies a more robust and reliable software product. Reliability Growth Models Reliability growth models help development teams estimate the testing effort required to achieve desired reliability levels and ensure a smooth launch of their software. These models predict software reliability improvements as testing progresses, offering insights into the effectiveness of testing strategies and guiding resource allocation. They are mathematical models used to predict and improve the reliability of systems over time by analyzing historical data on defects or failures and their removal. Some models exhibit characteristics of exponential growth. Other models exhibit characteristics of power law growth while there exist models that exhibit both exponential and power law growth. The distinction is primarily based on the underlying assumptions about how the fault detection rate changes over time in relation to the number of remaining faults. While a detailed analysis of reliability growth models is beyond the scope of this article, I will provide a categorization that may help for further study. Traditional growth models encompass the commonly used and foundational models, while the Bayesian approach represents a distinct methodology. The advanced growth models encompass more complex models that incorporate additional factors or assumptions. Please note that the list is indicative and not exhaustive. Traditional Growth Models Musa-Okumoto Model It assumes a logarithmic Poisson process for fault detection and removal, where the number of failures observed over time follows a logarithmic function of the number of initial faults. Jelinski-Moranda Model It assumes a constant failure intensity over time and is based on the concept of error seeding. It postulates that software failures occur at a rate proportional to the number of remaining faults in the system. Goel-Okumoto Model It incorporates the assumption that the fault detection rate decreases exponentially as faults are detected and fixed. It also assumes a non-homogeneous Poisson process for fault detection. Non-Homogeneous Poisson Process (NHPP) Models They assume the fault detection rate is time-dependent and follows a non-homogeneous Poisson process. These models allow for more flexibility in capturing variations in the fault detection rate over time. Bayesian Approach Wall and Ferguson Model It combines historical data with expert judgment to update reliability estimates over time. This model considers the impact of both defect discovery and defect correction efforts on reliability growth. Advanced Growth Models Duane Model This model assumes that the cumulative MTBF of a system increases as a power-law function of the cumulative test time. This is known as the Duane postulate and it reflects how quickly the reliability of the system is improving as testing and debugging occur. Coutinho Model Based on the Duane model, it extends to the idea of an instantaneous failure rate. This rate involves the number of defects found and the number of corrective actions made during testing time. This model provides a more dynamic representation of reliability growth. Gooitzen Model It incorporates the concept of imperfect debugging, where not all faults are detected and fixed during testing. This model provides a more realistic representation of the fault detection and removal process by accounting for imperfect debugging. Littlewood Model It acknowledges that as system failures are discovered during testing, the underlying faults causing these failures are repaired. Consequently, the reliability of the system should improve over time. This model also considers the possibility of negative reliability growth when a software repair introduces further errors. Rayleigh Model The Rayleigh probability distribution is a special case of the Weibull distribution. This model considers changes in defect rates over time, especially during the development phase. It provides an estimation of the number of defects that will occur in the future based on the observed data. Choosing the Right Model There's no single "best" reliability growth model. The ideal choice depends on the specific project characteristics and available data. Here are some factors to consider. Specific objectives: Determine the specific objectives and goals of reliability growth analysis. Whether the goal is to optimize testing strategies, allocate resources effectively, or improve overall system reliability, choose a model that aligns with the desired outcomes.Nature of the system: Understand the characteristics of the system being analyzed, including its complexity, components, and failure mechanisms. Certain models may be better suited for specific types of systems, such as software, hardware, or complex systems with multiple subsystems.Development stage: Consider the stage of development the system is in. Early-stage development may benefit from simpler models that provide basic insights, while later stages may require more sophisticated models to capture complex reliability growth behaviors.Available data: Assess the availability and quality of data on past failures, fault detection, and removal. Models that require extensive historical data may not be suitable if data is limited or unreliable.Complexity tolerance: Evaluate the complexity tolerance of the stakeholders involved. Some models may require advanced statistical knowledge or computational resources, which may not be feasible or practical for all stakeholders.Assumptions and limitations: Understand the underlying assumptions and limitations of each reliability growth model. Choose a model whose assumptions align with the characteristics of the system and the available data.Predictive capability: Assess the predictive capability of the model in accurately forecasting future reliability levels based on past data.Flexibility and adaptability: Consider the flexibility and adaptability of the model to different growth patterns and scenarios. Models that can accommodate variations in fault detection rates, growth behaviors, and system complexities are more versatile and applicable in diverse contexts.Resource requirements: Evaluate the resource requirements associated with implementing and using the model, including computational resources, time, and expertise. Choose a model that aligns with the available resources and capabilities of the organization.Validation and verification: Verify the validity and reliability of the model through validation against empirical data or comparison with other established models. Models that have been validated and verified against real-world data are more trustworthy and reliable.Regulatory requirements: Consider any regulatory requirements or industry standards that may influence the choice of reliability growth model. Certain industries may have specific guidelines or recommendations for reliability analysis that need to be adhered to.Stakeholder input: Seek input and feedback from relevant stakeholders, including engineers, managers, and domain experts, to ensure that the chosen model meets the needs and expectations of all parties involved. Wrapping Up Throughout this article, we explored a plethora of reliability models and metrics. From the simple elegance of MTTR to the nuanced insights of NHPP models, each instrument offers a unique perspective on system health. The key takeaway? There's no single "rockstar" metric or model that guarantees system reliability. Instead, we should carefully select and combine the right tools for the specific system at hand. By understanding the strengths and limitations of various models and metrics, and aligning them with your system's characteristics, you can create a comprehensive reliability assessment plan. This tailored approach may allow us to identify potential weaknesses and prioritize improvement efforts.

By Stelios Manioudakis, PhD

CORE

High-Volume Security Analytics: Splunk vs. Flink for Rule-Based Incident Detection

The amount of data generated by modern systems has become a double-edged sword for security teams. While it offers valuable insights, sifting through mountains of logs and alerts manually to identify malicious activity is no longer feasible. Here's where rule-based incident detection steps in, offering a way to automate the process by leveraging predefined rules to flag suspicious activity. However, the choice of tool for processing high-volume data for real-time insights is crucial. This article delves into the strengths and weaknesses of two popular options: Splunk, a leading batch search tool, and Flink, a powerful stream processing framework, specifically in the context of rule-based security incident detection. Splunk: Powerhouse Search and Reporting Splunk has become a go-to platform for making application and infrastructure logs readily available for ad-hoc search. Its core strength lies in its ability to ingest log data from various sources, centralize it, and enable users to explore it through powerful search queries. This empowers security teams to build comprehensive dashboards and reports, providing a holistic view of their security posture. Additionally, Splunk supports scheduled searches, allowing users to automate repetitive queries and receive regular updates on specific security metrics. This can be particularly valuable for configuring rule-based detections, monitoring key security indicators, and identifying trends over time. Flink: The Stream Processing Champion Apache Flink, on the other hand, takes a fundamentally different approach. It is a distributed processing engine designed to handle stateful computations over unbounded and bounded data streams. Unlike Splunk's batch processing, Flink excels at real-time processing, enabling it to analyze data as it arrives, offering near-instantaneous insights. This makes it ideal for scenarios where immediate detection and response are paramount, such as identifying ongoing security threats or preventing fraudulent transactions in real time. Flink's ability to scale horizontally across clusters makes it suitable for handling massive data volumes, a critical factor for organizations wrestling with ever-growing security data. Case Study: Detecting User Login Attacks Let's consider a practical example: a rule designed to detect potential brute-force login attempts. This rule aims to identify users who experience a high number of failed login attempts within a specific timeframe (e.g., an hour). Here's how the rule implementation would differ in Splunk and Flink: Splunk Implementation sourcetype=login_logs (result="failure" OR "failed") | stats count by user within 1h | search count > 5 | alert "Potential Brute Force Login Attempt for user: $user$" This Splunk search query filters login logs for failed attempts, calculates the count of failed attempts per user within an hour window, and then triggers an alert if the count exceeds a predefined threshold (5). While efficient for basic detection, it relies on batch processing, potentially introducing latency in identifying ongoing attacks. Flink Implementation SQL SELECT user, COUNT(*) AS failed_attempts FROM login_logs WHERE result = 'failure' OR result = 'failed' GROUP BY user, TUMBLE(event_time, INTERVAL '1 HOUR') HAVING failed_attempts > 5; Flink takes a more real-time approach. As each login event arrives, Flink checks the user and result. If it's a failed attempt, a counter for that user's window (1 hour) is incremented. If the count surpasses the threshold (5) within the window, Flink triggers an alert. This provides near-instantaneous detection of suspicious login activity. A Deep Dive: Splunk vs. Flink for Detecting User Login Attacks The underlying processing models of Splunk and Flink lead to fundamental differences in how they handle security incident detection. Here's a closer look at the key areas: Batch vs. Stream Processing Splunk Splunk operates on historical data. Security analysts write search queries that retrieve and analyze relevant logs. These queries can be configured to run periodically automatically. This is a batch processing approach, meaning Splunk needs to search through potentially a large volume of data to identify anomalies or trends. For the login attempt example, Splunk would need to query all login logs within the past hour every time the search is run to calculate the failed login count per user. This can introduce significant latency in detecting, and increase the cost of compute, especially when dealing with large datasets. Flink Flink analyzes data streams in real-time. As each login event arrives, Flink processes it immediately. This stream-processing approach allows Flink to maintain a continuous state and update it with each incoming event. In the login attempt scenario, Flink keeps track of failed login attempts per user within a rolling one-hour window. With each new login event, Flink checks the user and result. If it's a failed attempt, the counter for that user's window is incremented. This eliminates the need to query a large amount of historical data every time a check is needed. Windowing Splunk Splunk performs windowing calculations after retrieving all relevant logs. In our example, the search stats count by user within 1h retrieves all login attempts within the past hour and then calculates the count for each user. This approach can be inefficient for real-time analysis, especially as data volume increases. Flink Flink maintains a rolling window and continuously updates the state based on incoming events. Flink uses a concept called "time windows" to partition the data stream into specific time intervals (e.g., one hour). For each window, Flink keeps track of relevant information, such as the number of failed login attempts per user. As new data arrives, Flink updates the state for the current window. This eliminates the need for a separate post-processing step to calculate windowed aggregations. Alerting Infrastructure Splunk Splunk relies on pre-configured alerting actions within the platform. Splunk allows users to define search queries that trigger alerts when specific conditions are met. These alerts can be delivered through various channels such as email, SMS, or integrations with other security tools. Flink Flink might require integration with external tools for alerts. While Flink can identify anomalies in real time, it may not have built-in alerting functionalities like Splunk. Security teams often integrate Flink with external Security Information and Event Management (SIEM) solutions for alert generation and management. In essence, Splunk operates like a detective sifting through historical evidence, while Flink functions as a security guard constantly monitoring activity. Splunk is a valuable tool for forensic analysis and identifying historical trends. However, for real-time threat detection and faster response times, Flink's stream processing capabilities offer a significant advantage. Choosing the Right Tool: A Balancing Act While Splunk provides a user-friendly interface and simplifies rule creation, its batch processing introduces latency, which can be detrimental to real-time security needs. Flink excels in real-time processing and scalability, but it requires more technical expertise to set up and manage. Beyond Latency and Ease of Use: Additional Considerations The decision between Splunk and Flink goes beyond just real-time processing and ease of use. Here are some additional factors to consider: Data Volume and Variety Security teams are often overwhelmed by the sheer volume and variety of data they need to analyze. Splunk excels at handling structured data like logs but struggles with real-time ingestion and analysis of unstructured data like network traffic or social media feeds. Flink, with its distributed architecture, can handle diverse data types at scale. Alerting and Response Both Splunk and Flink can trigger alerts based on rule violations. However, Splunk integrates seamlessly with existing Security Information and Event Management (SIEM) systems, streamlining the incident response workflow. Flink might require additional development effort to integrate with external alerting and response tools. Cost Splunk's licensing costs are based on data ingestion volume, which can become expensive for organizations with massive security data sets. Flink, being open-source, eliminates licensing fees. However, the cost of technical expertise for setup, maintenance, and rule development for Flink needs to be factored in. The Evolving Security Landscape: A Hybrid Approach The security landscape is constantly evolving, demanding a multifaceted approach. Many organizations find value in a hybrid approach, leveraging the strengths of both Splunk and Flink. Splunk as the security hub: Splunk can serve as a central repository for security data, integrating logs from various sources, including real-time data feeds from Flink. Security analysts can utilize Splunk's powerful search capabilities for historical analysis, threat hunting, and investigation.Flink for real-time detection and response: Flink can be deployed for real-time processing of critical security data streams, focusing on identifying and responding to ongoing threats. This combination allows security teams to enjoy the benefits of both worlds: Comprehensive security visibility: Splunk provides a holistic view of historical and current security data.Real-time threat detection and response: Flink enables near-instantaneous identification and mitigation of ongoing security incidents. Conclusion: Choosing the Right Tool for the Job Neither Splunk nor Flink is a one-size-fits-all solution for rule-based incident detection. The optimal choice depends on your specific security needs, data volume, technical expertise, and budget. Security teams should carefully assess these factors and potentially consider a hybrid approach to leverage the strengths of both Splunk and Flink for a robust and comprehensive security posture. By understanding the strengths and weaknesses of each tool, security teams can make informed decisions about how to best utilize them to detect and respond to security threats in a timely and effective manner.

By Mayank Singhi

If Software Quality Is Everybody’s Responsibility, So Is Failure

In many large organizations, software quality is primarily viewed as the responsibility of the testing team. When bugs slip through to production, or products fail to meet customer expectations, testers are the ones blamed. However, taking a closer look, quality — and likewise, failure — extends well beyond any one discipline. Quality is a responsibility shared across an organization. When quality issues arise, the root cause is rarely something testing alone could have prevented. Typically, there were breakdowns in communication, unrealistic deadlines, inadequate design specifications, insufficient training, or corporate governance policies that incentivized rushing. In other words, quality failures tend to stem from broader organizational and leadership failures. Scapegoating testers for systemic issues is counterproductive. It obscures the real problems and stands in the way of meaningful solutions to quality failings. Testing in Isolation In practice, all too often, testing teams still work in isolation from the rest of the product development lifecycle. They are brought in at the end, given limited information, and asked to validate someone else’s work. Under these conditions, their ability to prevent defects is severely constrained. For example, without access to product requirement documents, test cases may overlook critical functions that need validation. With short testing timelines, extensive test coverage further becomes impossible. Without insight into design decisions or access to developers, some defects found in testing prove impossible to diagnose effectively. Testers are often parachuted in when the time and cost of repairing a defect has grown to be unfeasible. In this isolated model, testing serves as little more than a final safety check before release. The burden of quality is passed almost entirely to the testers. When the inevitable bugs still slip through, testers then make for easy scapegoats. Who Owns Software Quality? In truth, responsibility for product quality is distributed across an organization. So, what can you do? Quality is everyone’s responsibility. Image sources: Kharnagy (Wikipedia), under CC BY-SA 4.0 license, combined with an image from Pixabay. Executives and leadership teams — Set the tone and policies around quality, balancing it appropriately against other priorities like cost and schedule. Meanwhile, provide the staffing, resources, and timescale needed for a mature testing effort.Product Managers — Gather user requirements, define expected functionality, and support test planning.Developers — Follow secure coding practices, perform unit testing, enable automated testing, and respond to defects uncovered in testing.User experience designers — Consider quality and testability during UX design. Conduct user acceptance testing on prototypes.Information security — Perform security reviews of code, architectures, and configurations. Guide testing-relevant security use cases.Testers — Develop test cases based on user stories, execute testing, log defects, perform regression test fixes, and report on quality to stakeholders.Operations — Monitor systems once deployed, gather production issues, and report data to inform future testing.Customers — Voice your true quality expectations, participate in UAT, and report real-world issues once launched. As this illustrates, no one functional area owns quality alone. Testers contribute essential verification, but quality is truly everyone’s responsibility. Governance Breakdowns Lead to Quality Failures In a 2023 episode of the "Why Didn’t You Test That?" podcast, Marcus Merrell, Huw Price, and I discussed how testing remains treated as a “janitorial” effort and cost center, and how you can align testing and quality. When organizations fail to acknowledge the shared ownership of software quality, governance issues arise that enable quality failures: Unrealistic deadlines — Attempting to achieve overly aggressive schedules often comes at the expense of quality and sufficient testing timelines. Leadership teams must balance market demands against release readiness.Insufficient investment — Success requires appropriate staffing and support for all areas that influence quality. These range from design and development to development to testing. Underinvestment leads to unhealthy tradeoffs.Lack of collaboration — Cross-functional coordination produces better quality than work done in silos. Governance policies should foster collaboration across product teams, not hinder it.Misaligned priorities — Leadership should incentivize balanced delivery, not just speed or cost savings. Quality cannot be someone else’s problem.Lack of transparency — Progress reporting should incorporate real metrics on quality. Burying or obscuring defects undermines governance.Absence of risk management — Identifying and mitigating quality risks through appropriate action requires focus from project leadership. Lacking transparency about risk prevents proper governance. When these governance breakdowns occur, quality suffers, and failures follow. However, the root causes trace back to organizational leadership and culture, not solely the testing function. The Costs of Obscuring Systemic Issues Blaming testers for failures caused by systemic organizational issues leads to significant costs: Loss of trust — When testers become scapegoats, it erodes credibility and trust in the testing function, inhibiting their ability to advocate for product quality.Staff turnover — Testing teams experience higher turnover when the broader organization fails to recognize their contributions and value.Less collaboration — Other groups avoid collaborating with testers perceived as bottlenecks or impediments rather than partners.Reinventing the wheel — Lessons from past governance breakdowns go unlearned, leading those issues to resurface in new forms down the line.Poorer customer experiences — Ultimately, obscuring governance issues around quality leads to more negative customer experiences that damage an organization’s reputation and bottom line. Taking Ownership of Software Quality Elevating quality as an organization-wide responsibility is essential for governance, transparency, and risk management. Quality cannot be the burden of one isolated function, and leadership should foster a culture that values quality intrinsically, rather than viewing it as an afterthought or checkbox. To build ownership, organizations need to shift testing upstream, integrating it earlier into requirements planning, design reviews, and development processes. It also requires modernizing the testing practice itself, utilizing the full range of innovation available: from test automation, shift-left testing, and service virtualization, to risk-based test case generation, modeling, and generative AI. With a shared understanding of who owns quality, governance policies can better balance competing demands around cost, schedule, capabilities, and release readiness. Testing insights will inform smarter tradeoffs, avoiding quality failures and the finger-pointing that today follows them. This future state reduces the likelihood of failures — but also acknowledges that some failures will still occur despite best efforts. In these cases, organizations must have a governance model to transparently identify root causes across teams, learn from them, and prevent recurrence. In a culture that values quality intrinsically, software testers earn their place as trusted advisors, rather than get relegated to fault-finders. They can provide oversight and validation of other teams’ work without fear of backlash. And their expertise will strengthen rather than threaten collaborative delivery. With shared ownership, quality ceases to be a “tester problem” at all. It becomes an organizational value that earns buy-in across functional areas. Leadership sets the tone for an understanding that if quality is everyone’s responsibility — so too is failure.

By Rich Jordan

Failure Is Required: Understanding Fail-Safe and Fail-Fast Strategies

Failures in software systems are inevitable. How these failures are handled can significantly impact system performance, reliability, and the business’s bottom line. In this post, I want to discuss the upside of failure. Why you should seek failure, why failure is good, and why avoiding failure can reduce the reliability of your application. We will start with the discussion of fail-fast vs. fail-safe, this will take us to the second discussion about failures in general. As a side note, if you like the content of this and the other posts in this series check out my Debugging book that covers this subject. If you have friends that are learning to code I'd appreciate a reference to my Java Basics book. If you want to get back to Java after a while check out my Java 8 to 21 book. Fail-Fast Fail-fast systems are designed to immediately stop functioning upon encountering an unexpected condition. This immediate failure helps to catch errors early, making debugging more straightforward. The fail-fast approach ensures that errors are caught immediately. For example, in the world of programming languages, Java embodies this approach by producing a NullPointerException instantly when encountering a null value, stopping the system, and making the error clear. This immediate response helps developers identify and address issues quickly, preventing them from becoming more serious. By catching and stopping errors early, fail-fast systems reduce the risk of cascading failures, where one error leads to others. This makes it easier to contain and resolve issues before they spread through the system, preserving overall stability. It is easy to write unit and integration tests for fail-fast systems. This advantage is even more pronounced when we need to understand the test failure. Fail-fast systems usually point directly at the problem in the error stack trace. However, fail-fast systems carry their own risks, particularly in production environments: Production disruptions: If a bug reaches production, it can cause immediate and significant disruptions, potentially impacting both system performance and the business’s operations.Risk appetite: Fail-fast systems require a level of risk tolerance from both engineers and executives. They need to be prepared to handle and address failures quickly, often balancing this with potential business impacts. Fail-Safe Fail-safe systems take a different approach, aiming to recover and continue even in the face of unexpected conditions. This makes them particularly suited for uncertain or volatile environments. Microservices are a prime example of fail-safe systems, embracing resiliency through their architecture. Circuit breakers, both physical and software-based, disconnect failing functionality to prevent cascading failures, helping the system continue operating. Fail-safe systems ensure that systems can survive even harsh production environments, reducing the risk of catastrophic failure. This makes them particularly suited for mission-critical applications, such as in hardware devices or aerospace systems, where smooth recovery from errors is crucial. However, fail-safe systems have downsides: Hidden errors: By attempting to recover from errors, fail-safe systems can delay the detection of issues, making them harder to trace and potentially leading to more severe cascading failures.Debugging challenges: This delayed nature of errors can complicate debugging, requiring more time and effort to find and resolve issues. Choosing Between Fail-Fast and Fail-Safe It's challenging to determine which approach is better, as both have their merits. Fail-fast systems offer immediate debugging, lower risk of cascading failures, and quicker detection and resolution of bugs. This helps catch and fix issues early, preventing them from spreading. Fail-safe systems handle errors gracefully, making them better suited for mission-critical systems and volatile environments, where catastrophic failures can be devastating. Balancing Both To leverage the strengths of each approach, a balanced strategy can be effective: Fail-fast for local services: When invoking local services like databases, fail-fast can catch errors early, preventing cascading failures.Fail-safe for remote resources: When relying on remote resources, such as external web services, fail-safe can prevent disruptions from external failures. A balanced approach also requires clear and consistent implementation throughout coding, reviews, tooling, and testing processes, ensuring it is integrated seamlessly. Fail-fast can integrate well with orchestration and observability. Effectively, this moves the fail-safe aspect to a different layer of OPS instead of into the developer layer. Consistent Layer Behavior This is where things get interesting. It isn't about choosing between fail-safe and fail-fast. It's about choosing the right layer for them. E.g. if an error is handled in a deep layer using a fail-safe approach, it won't be noticed. This might be OK, but if that error has an adverse impact (performance, garbage data, corruption, security, etc.) then we will have a problem later on and won't have a clue. The right solution is to handle all errors in a single layer, in modern systems the top layer is the OPS layer and it makes the most sense. It can report the error to the engineers who are most qualified to deal with the error. But they can also provide immediate mitigation such as restarting a service, allocating additional resources, or reverting a version. Retry’s Are Not Fail-Safe Recently I was at a lecture where the speakers listed their updated cloud architecture. They chose to take a shortcut to microservices by using a framework that allows them to retry in the case of failure. Unfortunately, failure doesn't behave the way we would like. You can't eliminate it completely through testing alone. Retry isn't fail-safe. In fact: it can mean catastrophe. They tested their system and "it works", even in production. But let's assume that a catastrophic situation does occur, their retry mechanism can operate as a denial of service attack against their own servers. The number of ways in which ad-hoc architectures such as this can fail is mind-boggling. This is especially important once we redefine failures. Redefining Failure Failures in software systems aren't just about crashes. A crash can be seen as a simple and immediate failure, but there are more complex issues to consider. In fact, crashes in the age of containers are probably the best failures. A system restarts seamlessly with barely an interruption. Data Corruption Data corruption is far more severe and insidious than a crash. It carries with it long-term consequences. Corrupted data can lead to security and reliability problems that are challenging to fix, requiring extensive reworking and potentially unrecoverable data. Cloud computing has led to defensive programming techniques, like circuit breakers and retries, emphasizing comprehensive testing and logging to catch and handle failures gracefully. In a way, this environment sent us back in terms of quality. A fail-fast system at the data level could stop this from happening. Addressing a bug goes beyond a simple fix. It requires understanding its root cause and preventing reoccurrence, extending into comprehensive logging, testing, and process improvements. This ensures that the bug is fully addressed, reducing the chances of it reoccurring. Don't Fix the Bug If it's a bug in production you should probably revert, if you can't instantly revert production. This should always be possible and if it isn't this is something you should work on. Failures must be fully understood before a fix is undertaken. In my own companies, I often skipped that step due to pressure, in a small startup that is forgivable. In larger companies, we need to understand the root cause. A culture of debriefing for bugs and production issues is essential. The fix should also include process mitigation that prevents similar issues from reaching production. Debugging Failure Fail-fast systems are much easier to debug. They have inherently simpler architecture and it is easier to pinpoint an issue to a specific area. It is crucial to throw exceptions even for minor violations (e.g. validations). This prevents cascading types of bugs that prevail in loose systems. This should be further enforced by unit tests that verify the limits we define and verify proper exceptions are thrown. Retries should be avoided in the code as they make debugging exceptionally difficult and their proper place is in the OPS layer. To facilitate that further, timeouts should be short by default. Avoiding Cascading Failure Failure isn't something we can avoid, predict, or fully test against. The only thing we can do is soften the blow when a failure occurs. Often this "softening" is achieved by using long-running tests meant to replicate extreme conditions as much as possible with the goal of finding our application's weak spots. This is rarely enough, robust systems need to revise these tests often based on real production failures. A great example of a fail-safe would be a cache of REST responses that lets us keep working even when a service is down. Unfortunately, this can lead to complex niche issues such as cache poisoning or a situation in which a banned user still had access due to cache. Hybrid in Production Fail-safe is best applied only in production/staging and in the OPS layer. This reduces the amount of changes between production and dev, we want them to be as similar as possible, yet it's still a change that can negatively impact production. However, the benefits are tremendous as observability can get a clear picture of system failures. The discussion here is a bit colored by my more recent experience of building observable cloud architectures. However, the same principle applies to any type of software whether embedded or in the cloud. In such cases we often choose to implement fail-safe in the code, in this case, I would suggest implementing it consistently and consciously in a specific layer. There's also a special case of libraries/frameworks that often provide inconsistent and badly documented behaviors in these situations. I myself am guilty of such inconsistency in some of my work. It's an easy mistake to make. Final Word This is my last post on the theory of debugging series that's part of my book/course on debugging. We often think of debugging as the action we take when something fails, it isn't. Debugging starts the moment we write the first line of code. We make decisions that will impact the debugging process as we code, often we're just unaware of these decisions until we get a failure. I hope this post and series will help you write code that is prepared for the unknown. Debugging, by its nature, deals with the unexpected. Tests can't help. But as I illustrated in my previous posts, there are many simple practices we can undertake that would make it easier to prepare. This isn't a one-time process, it's an iterative process that requires re-evaluation of decisions made as we encounter failure.

By Shai Almog

CORE

Maintenance

DZone's Featured Maintenance Resources

Top Maintenance Experts

The Latest Maintenance Topics