Any Version of the Test Pyramid Is a Misconception – Don’t Use Anyone
The ratio of the test cases of different test levels is an output, as a consequence of your system and the selected test design techniques based on risk analysis.
Join the DZone community and get the full member experience.
Join For FreeThe test pyramid (testing pyramid, test automation pyramid) was originally published in the famous book by Mike Cohn: Succeeding with Agile: Software Development Using Scrum (Cohn, 2010). The original figure is:
The concept is simple: you should write more unit tests than service tests and only a few UI tests. The reason behind this is that:
- UI tests are slow.
- UI tests are brittle.
There were many modifications to the original version:
- Adding manual tests at the top of the pyramid
- Modify the name ‘service’ to ‘integration’.
- Modify the name ‘UI’ to ‘e2e’.
- Add more layers such as ‘API’ or ‘component.’
An article that considers several alternatives is (Roth 2019). However, the main problem is that this concept
- Considers only some aspects of testing,
- Cannot consider the progress of test (design) automation.
In this article, we delve into a comprehensive approach to test design and test automation, focusing on the primary objective of testing, i.e., bug detection rather than execution speed optimization. Therefore, let's explore the effectiveness of test cases.
Mutation Testing
To measure the efficiency of test cases, we employ mutation testing, a technique that involves testing the tests themselves by introducing slight modifications to the original code, creating multiple mutants. A robust test dataset should be capable of distinguishing the original code from all carefully selected mutants. In mutation testing, we intentionally inject faults into the code to assess the reliability of our test design. A dependable test dataset must effectively "eliminate" all mutants. A test eliminates a mutant when there are discernible differences in behavior between the original code and the mutant. For instance, if the original code is y = x, and a mutant emerges as y = 2 * x, a test case like x = 0 fails to eliminate the mutant, whereas x = 1 succeeds in doing so.
Unfortunately, the number of potential mutants is excessively high. However, a significant reduction can be achieved by concentrating on efficient first-order mutants. In the realm of first-order mutants, modifications are limited to a single location within the code. Conversely, second-order mutants involve alterations at two distinct locations, and during execution, both modifications come into play.
An investigation by Offutt demonstrated that when all first-order mutants are effectively eliminated, only an exceptionally minute fraction of second-order mutants remain unaddressed. This implies that if a test set is capable of exterminating all first-order mutants, it can also address second-order mutants with efficacy ranging from 99.94% to 99.99%, as per Offutt's empirical study. It's essential to note that our consideration solely pertains to non-equivalent mutants, signifying that test cases exist for each mutant's elimination.
We can further reduce the mutants, but first, we should consider the efficiency of the test cases. We consider test case efficiency with respect to the reduced mutation set. It’s only a very slight restriction as the imprecision is less than 0.1%.
- A test case is unreliable if it cannot find any bug in any mutants, i.e., it doesn’t eliminate any mutants.
- A test case is superfluous if there are other test cases that eliminate the same mutants.
- A test case T1 substitutes test T2 if it eliminates all the mutants as T2 and at least one more.
- A test case T1 is stronger than T2 if it eliminates more mutants than T2, but T1 doesn’t substitute T2.
- A test set is quasi-reliable if it eliminates all the mutants. A test set is reliable if it finds any defect in the code. A quasi-reliable test set is very close to a reliable test set.
- A test set is quasi-optimal if it is quasi-reliable and it consists of less than or equal number of test cases for all other quasi-reliable test sets.
- A test case is deterministic if it either passes or fails for all executions.
- A test case is non-deterministic if it both passes and fails for some executions. Flaky tests are non-deterministic. Non-deterministic test cases should be improved to become deterministic or should be deleted.
Now, we can reduce the mutant set.
- A mutant is superfluous if no test case in a quasi-reliable test set eliminates only this mutant. For example, if only test case T1 eliminates this mutant, but T1 also eliminates another mutant, then we can remove this mutant.
- The reduced mutant set is called the optimal mutant set.
Ideal Mutant Set
In this manner, assuming we possessed flawless software, we could create an ideal mutant set from which we could derive a test set that is quasi-reliable. It's of paramount significance that the number of test cases required to detect nearly all defects (99.94% or more) would not exceed the count of optimal mutants. The author, with his co-author Attila Kovács, developed a website and a mutation framework with mutant sets that are close to optimal (i.e., there are only very few or zero superfluous mutants but no missing mutants). The following table shows the code size and the number of the mutants in the near optimum mutant sets (in the code, each parameter is in a different line):
Program |
Code size (LOC) |
Number of reliable mutants |
Pizza ordering |
142 |
30 |
Tour competition |
141 |
24 |
Extra holiday |
57 |
20 |
Car rental |
49 |
15 |
Grocery |
33 |
26 |
You can see that the code size and the number of mutants correlate, except for the Grocery app. We believe that the number of optimum mutants in the mutant sets is (close to) linear with the code, which means a reliable test set could be developed. Unfortunately, developing an optimal mutant set is difficult and time-consuming.
Don’t Use the Test Pyramid or its Alternatives
Why is this ‘artificial mutant creation’ important? We argue that during test automation, we should optimize the test design to find as many defects as we can but avoid superfluous tests. As the tests should the mutants, it’s the system’s attribute that a test eliminating a mutant is a unit, an integration, or a system (e2e) test. We should optimize the test design for each level separately; that’s why there cannot be a pre-described shape of test automation.
You can argue that you can add more unit test cases as it is cheap to execute. However, there are other factors of tests as well. You should design and code tests. The difficulty is the calculation of the results, which can be time-consuming and error-prone. In creating e2e tests, the outputs (results) are only checked instead of calculated, which is much easier to see (Forgacs and Kovacs, 2023). Another problem is maintenance. While maintaining e2e tests is cheap, see the book above again, unfortunately, maintaining the unit tests is expensive, see (Ellims et al. 2006). OK, but if most defects can be found by unit testing, then the test pyramid is appropriate to use. However, it’s not true. Runeson et al. (2006) showed in a case study that unit tests detected only 30-60% of the defects. In addition, Berling and Thelin 2004 showed that for different programs, the ratio of bug detection for different test levels is different.
That’s why the test design should be carried out one by one for the different levels independently from each other. Don’t design fewer e2e test cases than needed, as your system’s quality remains low, and the bug-fixing costs will be higher than the costs of missing test design and execution. Don’t design more unit tests; your costs will significantly increase without improving quality.
But how to decrease unit tests? If you find some bugs with the unit test and the bugs could have been detected by several other test unit test cases, then you included superfluous tests, and you should remove them.
Conclusion
We showed that any type of test automation (shape) is faulty. The ratio of the test cases of different test levels is not an input but an output because of your system and the selected test design techniques based on risk analysis.
We can conclude other things as well. As the number of quasi-reliable test cases is probably linear with the code, it’s enough to apply linear test design techniques. In this way, let’s apply each-transition (0-switch) testing instead of n-switch testing, where n > 0. Similarly, in most cases, avoid using combinatorial testing, such as all-pair testing, as a large part of the tests will be superfluous. Instead, we should develop more efficient linear test design techniques (see Forgacs and Kovacs, 2023).
There are cases when you still need to use a stronger test design. If a defect may cause more damage than the whole SDLC cost, then you should apply stronger methods. However, most systems do not fall into this category.
Opinions expressed by DZone contributors are their own.
Comments