Good Tests

A dissection of what makes good tests - 25th July 2019

Introduction

What should a good test look like? I'd like to come up with a rationally argued set of properties for tests without some bias towards any programming ideology or without trying to fit some acronym. Hopefully what results is a genuine set of properties that should be independent of whichever language, paradigm or framework that is used. I feel that testing is more important, especially the quality of tests. And I'm sorry to say, that in my experience, the quality of tests doesn't really measure up to expectations. This sorry state seems to be either because we have working tests and that is all that is expected, or that we don't have a framework for being able to describe what we expect from tests. Having an idea of what really good tests should look like, should help us to identify if our tests are lacking.

So let's talk about what makes a great set of tests.

But before we get started... There are many different types of tests. I'm primarily focused on unit testing here. I'm not saying these arguments don't apply to other forms of testing, but I'm not making any arguments in those contexts here. Journey testing warrants a whole discussion by itself, and you should see the website on Cascade in order to explore that. Performance testing, again, is something else, and requires an entirely different approach, that I go into in another article.

The Purpose of Tests

I would like to start my article by focusing on the many uses of tests so as to add context and motivation for what I feel is important in testing. This way, later chapters should be supported by a sound foundation.

Well, okay... Let's get the obvious out of the way. It's nice to have tests that prove the code works in some sense before we deploy it into production. In this way, we are using tests as a validation step before releasing disaster on our users.

To Prove That The Code Works

Moving on from that, we naturally find ourselves on the subject of Continuous Integration. This is where developers follow CI discipline. They follow these rules:

There is a build server that constantly monitors the codebase for changes to the master version of code. It then executes the build and notifies the developers of the results. If change is introduced that is bad, the CI system produces a RED build result. This prompts the developer who broke the build, to go and fix things, while all further changes are suspended. What all this amounts to is a build system that spends most of its time as GREEN, meaning that the the code base is in working condition and that any code within that build can be understood to work and be reasoned about. The results of a GREEN build can be deployed and used to demonstrate stories that have been developed and are complete. Tests are needed to determine a RED or a GREEN result. So they are a crucial part of Continuous Integration.

To Support Continuous Integration

Now, we arrive at the topic of Test-Driven-Development. This is where developers use tests as a core part of their software design and development process. They write the test as they write the code, rather than afterwards. This has the effect of making sure the code is fully tested. But to state that now is to understate the other effects...

Writing the test at the same time as the code also has an affect on the design of the code. The developer focuses on writing a test that tells a clear story. The design of the code fits the requirements as described in the test. This results in the code reflecting that story. Which means code with better focus and responsibilities, better composition of objects and overall better object orientated design (or functional design if preferred). By comparison; when code is developed without tests, the implementation of the code grows organically within functions or objects as the developer sees fit. He could just add it where it is needed, the minimal approach, or he could adopt some grand design, using whatever patterns he feels are necessary. And then the tests are written and they now have to match the existing design. Clarity suffers a great deal at this point as the test is coerced into fitting the subject arbitrarily.

The second major benefit of having tests as a part of good development practice is that tests form a structure that supports refactoring. It is no longer a challenging and risky proposition to change the code. Before testing was a thing, I remember working in software development environments where it was viewed as good practise to not touch the 'old' code since 'we know it works'. Rather, if you have a new requirement to support, put a check right upfront and branch out to your new code to support the new requirements. I kid you not! The risks of introducing unexpected bugs into production outweighed any considerations as to the quality or readability of the code. It's not hard to see why; career developers are much more likely to lose their jobs over an absolute, the system failing in production, than over something subjective, such as the health of the code base.

Something particularly elegant happens when you express your requirements explicitly in tests. You introduce new tests to meet new requirements, and as you develop, you run the tests. The tests keep a running tally of the requirements that you are meeting. You may find that while developing new tests, and changing the subject to suit, old tests start to fail. This immediately tells you what you have broken. You might find that you cannot have all your tests pass at the same time; as they are mutually incompatible. This means you have a conflict in requirements. You are now getting feedback on the requirements analysis process, and not just development. You learn through code.

As A Technique For Developing Good Code

I find tests invaluable in learning what is intended in code I have never seen before. It is widely assumed that developers can 'just read the code' to know what is going on. While that is true, the reality of this condition is far less than optimal for the following reasons.

The first reason is that, in most cases, we are dealing with some bug. By reading the code, we can tell what it is doing, but not what it is intended to do. I'm sure the reporter of the bug will have clear ideas about what the code should do, but that still leaves you very far from a solution which you will appreciate after I've included my other reasons.

The second reason is that the code needs to support many requirements, which makes it necessarily complex. Each test only needs to test one or a limited set of requirements, so the test should be simpler to understand.

My final reason on this point, is that the developer has an opportunity to add all that context that he is in possession of as he is doing the development. When he writes the actual code, he can try for good names in the code base, but he can't tell a story for others to learn from.

So 'just reading the code' is a very poor alternative to having a good set of tests that describe requirements in clear and simple terms.

To Show How The System Is Intended To Be Used

Properties of Tests

These properties come in groups that are naturally discussed together.

Assertive, Valuable and Explicit

The most basic thing about a test is that it should contain assertions. These are true or false statements about the end state of the system. A test with no assertions is not a test. Tests should be Assertive.

So, okay, we have assertions. Are we good? Well that depends on if the assertions are of any value. Test should be Valuable. This property of tests also places a theoretical limit on the number of tests you will have. It is not always true that the more tests you have the better. More on that later. So my second property of good tests is that they are Valuable.

If having fewer tests is valuable in some way that I haven't gone into yet, then why not assert as many things as possible explicitly and implicitly as we can in as few tests as possible?

What do I mean by 'implicit' assertion? This is the kind of condition that must be true given some other condition that is true. For example; I wish to test the functionality of a car. I write a test that asserts that the car can be driven. I can now infer that someone can sit in the driver's seat, since the car cannot be driven otherwise.

Implicit assertions are bad. I want a relationship between my requirements and my tests. As I develop code, the tests are giving me a catalogue of requirements that I have met. If a test fails, I want to know which part of the system is failing. Taking the car example again; if the car fails the test, then it might be the engine, or it might be that the driver cannot sit in the driver seat for some reason. I don't know. The tests are not specific enough. I have lost specific feedback. Also, an implicit assertion does not tell the story. It therefore loses clarity.

Closed, Deterministic and Repeatable

Closed? What do I mean by that? What I mean is that the test must not receive input or be affected by implicit sources.
An implicit source is a source of input that is not described in the test. I could have said 'pure', to coincide with functional coding ideas, but the word 'pure' does not really communicate the point. The word Closed clearly denotes that the test is not 'open' in some sense, so it is the term I prefer. Every test should start by describing all its inputs.

Which leads me to Determinism. A deterministic function is a function that will always produce the same output given the same input. So in particular, certain types of tests are very problematic. Tests involving random numbers for a start. Tests involving dates. Tests that are using multi-threaded behaviour, and incorporate race conditions are also not always deterministic. Tests must be Deterministic. At this point, you will realise that Closed and Deterministic tests are complete in a certain sense. They describe explicitly all inputs and outputs.

In conjunction with the Closed property, the Deterministic property sets the stage for Repeatable tests. This means that the test will pass now, and will pass in the future. Tests don't break for arbitrary reasons, and that when the test fails, something has actually gone wrong. So my tests must be Repeatable.

Complete

Completeness is a property of your entire set of tests. After executing all your tests, you want to be able to state that the system is good. This could mean deploying the system into production or merely that others can accept your version of the code so that they can make change of their own.

There are two dimensions to consider when ascertaining the level of completeness you have achieved.

The first dimension is in having test coverage that is Broad. What we are considering here is if the tests cover all the code that is going to be used in production. It's of little value to be able to claim that only 80% of the code is known to work. This is were code coverage metrics are useful but before I get into that: Let's clarify exactly what we should test. What is code? If I write a configuration system that allows me to declare behaviour without actually writing procedural code, do I need to write tests? Well, that stuff can still fail in production, so yes, you test it. If my test uses a third party library, or an external system do I test it? Yes.

I have to add a caveat here, and an explanation. I stared the article by stating that I was focusing on unit testing, but I just contradicted that statement. It's not possible to test everything with unit tests alone, so other forms of testing are used at this point. And the caveat as always; is that the tests should be Valuable. And Timely which I will get to in the next chapter. For now though, back to the discussion on code coverage. I was saying how code coverage was useful, but...

Code coverage has some major drawbacks. Code coverage only measures code that is under active development. So all code artefacts that are not actually code are not measured, and this includes configuration files, data that drives logic, third party libraries, and off the shelf applications that interface with your system.

If we restrict ourselves to considering the value of code coverage over just the code being developed, there are still problems.
It is possible to have very good code coverage of your code base and still have worthless tests. All I have to do is remove all the assertions in all the tests to illustrate this point. There is a qualitative aspect to the tests that is missed in this measure.

Another aspect to consider is the false sense of security the measure engenders. I came across a quote a while ago that seems apt.
This is known as Goodhart's law and it goes; When a measure becomes a target, it ceases to be a good measure. Code coverage offers a quantitative measure that is cheap to produce, consistent, objective and inarguable. And for all of these properties, code coverage metrics are tempting to accept as the control over code quality. Having measured quality, it becomes very tempting to move on. True quality cannot be measured in this way.

What is needed is for tests to be Comprehensive. The tests need to test all aspects of the system under test in an intelligent, deep and thoughtful way. There needs to be a cognitive assessment as to whether the tests are testing the requirements supported by the subject of the test. This is the role of Peer Review.

How should this peer review be conducted? The biggest weakness in peer review is just how subjective it is. It is necessary at this point for the reviewer to restrain himself in order to produce a positive result. It is very tempting for the reviewer to start imposing all of his own biases onto the tests.

Each team of developers that works on a system should develop a set of patterns that they use to solve common problems.
The patterns should be adaptive and malleable. They should change based on a consensus within the team. They cannot be imposed, because this imposition will solidify the patterns, which means they stop adapting to current conditions. You see these patterns in tests just the same as in the code itself. What should a test look like? This is driven in large part by these patterns.

So now we have an outsider who is reviewing the test. He comes from a team with its own set of patterns. He will be most comfortable with those patterns, and will start to judge the weaknesses in the current patterns used within the test. This is not constructive as it's an imposition. This solidifies the codebase, and prevents the patterns from evolving. What should happen is that the reviewer will observe some patterns that are better than the patterns he sees in his own team. He will take those ideas back to his team and improve that way. And because the review process is reciprocal, the current team will review other team's tests, and will grow the patterns used in their testing on their own.

So how should the review be conducted? It's necessary to evaluate the tests without prescribing solutions unless invited. This document should really help; since it's possible to identify that, for example, the tests are not Complete. Are the requirements Explicitly tested? Is error handling covered? Are there tests that cover interfaces to other systems?

Timely

I have worked in environments where the time for all tests to execute is measured in hours. Imagine it; sixty or so developers all working off the same code base and all trying to feed changes through the same CI system and each build is taking 2 hours. That means 4 builds per day.

The purpose of having a RED build is to identify the set of changes that have broken the build. These changes are limited to all the changes there have been introduced since the build was GREEN. Naturally the tighter the window of changes is, the more specific the feedback. The more specific the feedback, the easier it is to find the root cause of the broken build. In an ideal situation you would have one build per change. But given our earlier scenario, that means 4 builds per day. Three weeks to run 60 builds! Each developer gets one chance every three weeks to merge. Clearly there is a scalability problem here.

On the other hand; why practise CI discipline? It seems to be a development system that is fundamentally flawed. Well let's suppose that we are not practising CI discipline. And some bad changes have been introduced:

The first point I wish to make is that committing code to master is the primary method that developers communicate code with one another. It is also the primary mechanism to manage diverging code bases as sixty odd developers all decide to pull the code in different directions to support their changes. The code tends to diverge quickly and none of it can go into production until it's merged into master. Management's response to this scenario is typically that developers must communicate more. Sorry, but no, this isn't practical. So diverging code is not delivered code, and to counter this the developers regularly take changes from master in order to minimize how much diversion there is. This keeps the problem of merging small.

There is now a bug. The developers all start pulling broken code from master. And they start finding their tests are failing and they all start solving those issues. The same issues are then solved by ten different developers in possibly different ways that all need to be merged together. And in the meantime, they have all been delayed. The smart ones will identify who is fixing the issue, and then down-tools until the fix is in. This scenario isn't at all hypothetical. I've encountered this as an ongoing problem with development in my career.

The second point to consider is that if delivering broken code is permissible, then the developers who deliver these changes can go off and start working on other problems. They are no longer on hand to address the issues they have created and are most familiar with. The person who introduces the break is the most knowledgeable about the business case for the change, and is also the most familiar with all the code changes that have been introduced. Therefore he is the most efficient person to fix the issue. So our problem is in identifying which change caused the issue, and then in identifying the best person to fix the issue. Which brings me onto my third point.

If you are not using CI, or if CI is RED, it stops giving any information. Any changes committed to master could be introducing bugs of their own. The more of this that happens, the larger the set of changes that might have problems.
And the situation only gets worse as time goes on. While all of this is going on, you now longer have the capability to deploy the most recent changes into the demo environments. That means stories no longer get signed off, and business analysts can no longer analyse the current behaviour of the system. They lose touch with the system, and the quality of stories coming through suffers.

And finally; we have this massive build headache to deal with that is un-planned, un-estimated, and un-sized. Delivery dates start to slip. A team is set aside to focus on just getting the build GREEN and consequently they are no longer available to new development. And they hate their lives, and some of them will start to move on. Who wants to deal with broken builds all day?

So tests must be Timely, in order to keep the window of changes tested in CI as tight as possible, in order to keep builds GREEN. This keeps all the developers working on working code, keeps them delivering changes into master, and keeps the demo environments up to date, and allows stories to be signed off. This is the reason I wrote an entire testing framework for managing Journey Tests, called Cascade. Have a look.

Which brings us to a point of controversy. What is more important: that CI builds be fast, or accurate? This question seems to have a very simple answer: to be accurate. Accuracy is then used to argue that everything must be tested as soon as possible and be included as exhaustively as possible in the immediate CI pipeline. This accuracy increases build times. And the view becomes; that if the builds have to take a while to run, then so be it.

This is not acceptable. I am not proposing that the system is tested with any less rigour than is necessary. What I advocate is that there is a system of progression. Earlier stages of the CI pipeline focus on building specific components. Those components are focused on specific teams who use that pipeline to communicate effectively. Build times sacrifice accuracy for timeliness at this stage in order to achieve efficient development. As successful build artefacts are produced, they are combined with the output of other pipelines and tested in more comprehensive ways. Databases, deployment mechanisms, and security infrastructure is incorporated in a progressive manner, which builds the accuracy of testing up until you have a build system that is both very accurate and still produces a build environment that supports the developers. I go into all of this at great length in my article on Scaling CI.

Pedagogical

Finally, what I consider most important; That the tests be Pedagogical. That is that they help others to learn about the system. That they describe in an understandable way what the system is about.

This point is terribly underestimated. It's the old problem again. The tests pass, so they are finished right? The general attitude seems to be that the tests need to pass, and if they do, they are sufficient.

Well let me put it to you this way. When tests break, you need to understand two things. The first is how the code works. The second, is how it's meant to work. After these facts have been established, you can ascertain whether the production code is the problem, or the test is the problem.

So how do you go about learning how the system is supposed to work? Well you could read the documentation. If the documentation of your system is a thousand page novel, then which page will relate to the currently broken test? Perhaps you need to read the whole novel, understand the system in its entirety, even as some of those pages are still being written... And then after reading the whole novel, you find that some sections are out of date.

Okay, what about understanding the code as it is currently written? The code must support all requirements, while your tests can focus on specific requirements. The code is therefore a lot more complex while test are a whole lot easier to understand.
Looking at the code shows you what will happen when it is used. But how it it used? The interactions between how methods are called in the code base is not explicitly expressed within the code, but rather in the tests. If you are trying to understand the code by only looking at the code, you have to re-engineer this part of the picture. This makes determining what is actually happening challenging. And now you are trying to tell how the system should be by reading the code which could be written incorrectly.

So reading the documentation is hopeful at best, and reading the code is only seeing half the picture and terribly complex. So that leaves... reading the test. This is the moment when there is a bug. Things are falling apart, and the users and product owners are looking at you to get things sorted; the sooner the better.

The Pedagogical property implies not only that you can learn about business requirements of the system in simple and explicit terms, but this requirement also necessarily implies that test be understandable. And that readability will help you tremendously when you are trying to fix tests for code, neither of which you have seen before.

In Conclusion

In my opinion, testing makes the difference between Programmers and Developers. That is to say; someone fresh out of university, and someone who has developed and lived with enterprise software for a few years. It's also more important than language experience, or opinions on ideology. I feel that a sound commitment to testing should be the first thing required from any developer because it speaks directly to quality. With sound tests in place, anyone can program effectively in any language. With sound tests in place, the code can be refactored often and in dramatic fashion so that any programming paradigm can be adopted.

And tests are a core part of a functioning continuous integration system. CI is the cornerstone of having a successful group of developers coding on the same codebase at the same time. Therefore having a CI system is crucial in having a successful project. The ability to make changes safely and prove those changes and then deploy those changes into production is a core survival attribute for any business that uses software to support its core business.

Anyway, hope you enjoyed reading.