How To Make Test Reliability Engineering By Design


We can feel a never-ending world of problems in our test automation journey. We had to defend the initiative, implement the tests and get them used by the team; we have now to keep them reliable to be useful.

Test Reliability Engineering is the practice of implementing valuable automated tests over time, avoiding one more bypassed test campaign that becomes too slow. Test campaigns tend to be “too big and fail”, hence the importance of engineering them properly.

This article aims to provide various techniques to improve the test reliability while not being exhaustive in each category. We start by exploring how to improve our test reliability before any test is implemented, followed by implementation practices.

Automate the right thing

We need clear objectives before implementing our automated tests to support the business. We usually refer to the value creation for our users in the context of digital experiences by fulfilling unmet needs.

The use of discovery techniques, questioning, and reformulation take all their place here. We need to balance our approach based on data and intuition to get through facts, interviews, and observations. Our automation must become valuable.

Automating the right tests is about value, not pyramids.

An approach is to decline key use-cases attributes such as value, stability, change rate, complexity, and automation capability. The combination in a matrix can give us a prioritized list of automated scenarios. 

From that point, we have to perform an additional set of triage similar to backlog refinement, using the Pareto principle.

Apply the Pareto principle

The Pareto principle is quite known for its transversal applicability and keyword “80/20”. Quality assurance and test automation are no exception; they can even be powerful to improve the reliability of our tests.

Our test suite can quickly grow creating design, execution, and maintenance issues. The variety of environments, devices, and countries in scope increase the scope of our tests; we can quickly end up with an extensive test suite too slow to provide value.

Figure 1: An example of test distribution with Pareto in QA 

The Pareto principle can help us step back on the necessary tests and context. Using matrices enables us to perform various analyses such as the 20% most important test, unstable tests, or slow ones.

That way, we increase our test reliability by limiting tests to the essential ones and improving our performance. Pareto is an excellent way to focus on the value that we need to reach early, in a “just-enough” way.

Just-enough and least-effort automated test

Our test automation initiative has to start but also ends somewhere. Matrices tend to create the desire to check all the boxes to feel progress and achievement; we need to change this perspective using the “just-enough” principle.

Our automation is fundamentally an emergent process, a good reason we need to start small and apply iterations with the Pareto principle. With the “right automated tests” identified, we have to implement them focusing on end-to-end integration, measuring the effective value creation.

Figure 2: The Just Barely Good Enough for Effectiveness 

Our iterations must attain sufficient value in our process, usually driven by the confidence to deliver changes while maintaining a successful user experience. We have to assess the additional value created with more tests, as it tends to decline. 

Complementarily, the “just-enough” principle applies to the test techniques to use. We can refer to the test pyramid to select the lowest test layer fulfilling our verifications requirements; functional UI tests are not the only answer.

Once clear on our core valuable automated test, we can automate the “things right”.

Have a good test design

The investment in design is fundamental to improve our implementation and reliability afterward. Building a software, a house, or a building starts with a good architecture; automated tests also.

Test design is an art of balancing usability, coupling, complexity, effectiveness, among others. We never reach a final stage like a piece of art with an evolving software, hence the need to perform continuous test design.

The design and its maintenance activities must be included in every sprint to secure its execution. The best value-test of our design lies first in the reliability of its implementation, understanding and ends with the performance of their maintenance.

Figure 3: An example of test design principles 

A test design capability does not emerge overnight; organizations must invest in developing those skills in their teams, whatever the profile implementing the automated tests. We can refer to some common patterns, such as the page-object or modular testing to structure our work.

While the design is more about structure, the internals of our automated tests is also important.

Improve implementation and execution reliability

Hell is in the details. Poor automated tests implementation will be visible with different symptoms, including its reliability. We can act at two levels: implementation and execution.

The actual test implementation must serve both the people that work and use the tests. Usability, readability, and documentation are therefore critical. Concretely, it means that putting those criteria before purely technical optimizations will help us in the long run. Your tests must survive a team turn-over, be fast to debug, and so easy to understand.

Figure 4: An Cerberus Testing Dashboard example 

On the execution side, we must first keep the user, customer, and value in mind. We can fall into focusing on having all tests green when the objective is to verify the value creation; having some orange and red signs can be better.

The reliability of our test execution can then be addressed at various levels. We should avoid fixed waits, replacing them with both a “wait until”  and a timeout. The timeout must make sense for the user, not being just a technical one of 30 seconds by default. We can also use retries on specific use-case to avoid alert fatigue while not hiding them to deal with the root causes of the instability.

Another implementation detail is essential: our locators.

Collaborate on a locator strategy

Locators are usually a pain in test automation, accumulating various pitfalls and anti-patterns of missed testability requirements in the first place. A locator strategy makes the difference for our test reliability.

The typical case is to start test automation late in the process, discovering the software framework does not even give access to locators or generate new ones for every execution. In that case, creating locators can be investigated if the workload is available; else, an image recognition strategy will be used as a backup plan.

Another case where people did not share about the testability is to have access to unstable locators. It usually happens by chance of the software framework in use. The teams can decide to take the most stable locator, combine image recognition, or implement a fixed locator strategy.

Figure 5: The various locators available in Cerberus Testing 

A fixed locator strategy is one of the solutions to have stable identifiers through the application changes. This strategy can make sense for particular cases and requirements, not for every test case. It usually makes sense for non-regression tests on core systems, while exploratory testing, comparison tests, and other techniques remain useful.

The locator strategy is one example of testability requirements that must be identified upfront by the team.

Shift-left for Test Reliability Engineering

The practices are happening through the software development lifecycle. As for bugs, the earlier they are identified, the better they will be included within the implementation. Shift-left is therefore vital to improve the reliability of our automated tests.

Our goal is to balance value, quality, and speed in our automated test implementation effort. Test reliability is as important as the reliability of our experiences. We can probably talk about “Test Reliability Engineering”.

We hope for some AI to help us through our journey; it will be a long, incremental, and iterative process. We are working on a self-healing feature for improving the locator strategy of changing applications.

Stay tuned for this Cerberus Testing increment.

How To Make Test Reliability Engineering By Design
Scroll to top