Observations on Code Coverage Tools and Unit Tests
At my current client, there’s a drive to put more unit tests on the Java code. It’s been a long hard process as few of the developers do strict, or even just “mostly,” test-driven development. As an attempt to try and drive this home, they’ve also tried to implement code coverage (specifically EMMA) to try to encourage unit test coverage. Unfortunately this doesn’t improve the quality of the tests, or even make the coverage meaningful.
The big problem isn’t that code coverage reporting is wrong. The problem is the false sense of security that having an acceptable number gives. Even when that number is 100%, just because everything is “covered,” it doesn’t mean that it will do any good.
Of course, unit tests in Java code are “programs” that attempt to call the production or base code methods to ensure they “work.” Most popular (by non-scientific polling of frameworks in use) is probably JUnit and its just-likes. A more specific idea is to write a test that calls a target method and ensure that the method does what is expected.
There are quite a few reasons for doing this. And quite a few arguments against it.
My favorite reason for supporting unit tests is that it helps defend the code from errant changes. It happens often that after something is written, something changes and causes it to be rewritten. Tests can help ensure that the re-writer has an awareness of other uses of the method, hopefully leading to either a more compatible change wherein the other uses are allowed to remain unchanged, or a more comprehensive change wherein the other uses are also updated.
My least favorite argument against unit testing is that it takes time, and therein costs more. I contend that a proper test suite will make the initial effort worth the investment just in avoiding “pulling the tarp” too tightly. This is when one developer makes a change to something to suit their needs, but in the process breaks some other bit of functionality that used the same bit they just changed.
In fewer shops than should probably have them, an agile methodology called “Test-Driven Development” (TDD) is used to try to ensure test coverage. The idea here is that production code doesn’t get written until a test is written against it and fails. There are many degrees of strictness that go around this.
Assuming there’s already been some design, we know there are reasons for the development. Sometimes this comes out of development in progress that we need to continue towards the end of the software. Sometimes this is new development we need to begin to start the software.
In a very strict TDD environment, and using JUnit as the test suite behind it, a test class is created that will be the first tests of our new class. The first test we create will fail compilation when the test method uses the new class, as it doesn’t yet exist. This gives reason to create the class. The test is expanded to attempt to call the desired method on the new class, which will fail as the method doesn’t exist. This gives reason to create the method. This continues through the method parameters and return types; test is expanded, fails, class is updated.
A good test writer will evaluate the tested classes and methods by having multiple tests, varying inputs, and evaluating multiple results. This starts to get more complicated (and outside the scope of this discussion) as the methods affect or use external resources, such as databases, files, or services, to name just a few.
Following TDD, a test will be written for an expected input, and the returned values or external resources will be evaluated to ensure the expected events and results occur. Each step along the way, each failed test, will justify the software.
One problem with a some strict TDD is that any failure is justification for writing code. This includes when base code requires something, such as handling an Exception. In this case, the base code itself is responsible for justifying the new code to catch the Exception or declare the throws clause. For a variety of reasons this is allowed, the least of which is that it’s hard to force an Exception to be thrown in a lot more situations than when it’s easy.
Test During Development
More common than Test-Driven Development, and in part as a loose extension of TDD, is the process of writing tests while, or shortly after, writing base code. Even during TDD this will occur. A method is completed, at least to some point, and a new test will be written to test the method. It may be the case that it is a parallel of an existing test, trying different parameters. It may be the case that some code had been written without a test, after which it’s corresponding test is being written.
This is often allowed as it’s sometimes arguably necessary. An example might be when the code causes a failure, such as the Exception example earlier. Then other development is paused and a test for that new code is written.
Some times its necessary to make up for shortcomings, such as developers who don’t write unit tests, or who don’t write good unit tests. This is a necessary evil, and whether desired or not, we put up with it.
Any Test Development
Far too often, this is used, even just sometimes. Bad tests often just to meet the goal of having a test are written. These are the tests that simply call methods, but don’t evaluate the the results.
These tests are often written just to make code coverage tools happy.
Code coverage tools in general, and EMMA in particular, watch the base code as tests are being run and confirm that each line is executed. Sometimes a complex line is even noted as partially hit if not all parts of it are executed. The idea is to confirm that code is covered.
Code coverage tools can be a good second-check to validate that TDD is indeed covering all of the base code. This is an acceptable use of code coverage tools. It should be the case that TDD (during or driven) coverage is 100%. Some claim this is a lofty goal, but if 100% isn’t reached (or reachable), then unit tests are just falling short.
Code coverage tools should not be used to drive tests, however. This is too often what happens. Code will get written, code coverage will be checked, and then tests will be written to make the coverage tool happy. The resulting tests far too often simply try to coax the code down every required branch to hit the coverage goal. Just as far too often, these tests don’t evaluate the results.
The latter is the reason my current client is trying to use code coverage, and unfortunately the result is exactly bad tests just trying to achieve the lowly 80% target.
I believe they are selling themselves short, and have suggested instead perhaps it would benefit them to teach some proper unit testing to the developers, especially those that aren’t writing tests at all, and more especially to those who write bad tests.
The Mythical 100%
There is much discussion elsewhere on the Internet saying that 100% is unreasonable and unrealistic. I contend that it is not only possible, but desirable, and should be required. For this end, the lack of a 100% from code coverage reports should trigger a flurry of activity.
Some believe it’s acceptable to allow getters and setters to not be tested. It is possible that one can agree that could be allowed, but I am not among them. With proper tests, all of the properties of an object eventually should be tested, so at least their getters will be tested. A proper suite of tests also should preset the properties to ensure they’re changed (or not) as expected, so that should cover all of the setters. The setters and getters should be considered automatically covered, and any missing getter or setter arguably shouldn’t exist.
Using code coverage to show where tests don’t hit should lead to Test During Development addition of unit tests. Those tests still should be quality tests and the results should be evaluated. However, since the only confirmation used is code coverage reporting, there’s no validation that the test does anything, and therefore no protection to future changes by the test.
Even the elusive caught Exceptions can be tested. With a variety of mock objects, either from a framework such as EasyMock or generated as a part of the test suite, it is possible to fake a thrown Exception, hitting the catch blocks. One of our associates has gone one better and written an Exception injector called JEXIN, allowing Exception injections where mock objects can’t be used.
Unacceptable at 99%
So if 100% is achievable, should 99% be allowed? Other than caught Exceptions, everything should be easy to create tests for, especially with mock objects. Allowing for the difficulty of writing tests that force Exceptions to occur, should they be allowed to be untested?
I contend that it is not acceptable. If a test can’t be written for it, then perhaps the catching of Exceptions needs to be avoided; make the methods throw the Exception instead. Certainly this puts the burden of handling the Exception on the caller, but arguably it more correctly meets the unit test and coverage report goals. Unfortunately, it removes a lot of the stuff we like to leave in the methods we’re writing. The right answer is to get better at test writing, but in the meantime, what can be done? Accept a slightly lower code coverage report value?
So, after all of that, what does it mean?
Low code coverage reports show clearly a lack of concern for (or belief in) unit tests. One project at my client is covered to 18% of the lines. Obviously the unit tests aren’t important to those developers. Clearly many methods could be changed without breaking any tests at all, so the justification for the code existing isn’t there, and neither is any protection.
Another has over 90% coverage, but does that mean it’s any better? Surely a lot more methods have potentially been justified and protected. It is more likely that changing code may break existing tests. This is good as we want to have developers make sure that their changes either respect existing code, or correct that, too.
Should coverage hit 100%, the code is still not certain to be well protected. It may be the case that there are tests created that simply call every method. Simple tests could be written so that setter and getter are tested together to make sure that the getter returns what the setter got.
Perhaps a metric needs to be introduced to count the number of test methods that call base methods. Similarly, perhaps the line counts of test cases should be checked against the lines they test; they should probably be double or triple for even base testing. This could help encourage testers to write “good,” “also good,” “expected bad,” and “unexpected bad” kinds of tests. It could also encourage bad testers to simply repeat test calls in different test methods. Any of these kinds of metrics would simply reinforce the false sense of security that good code coverage gives.
Really, the only sure way is to have some hands-on code review. Get multiple sets of eyes on the tests. Validate that methods are tested for multiple sets of data, and that the data is adequately evaluated after the tests.
Some Good Testing Ideas
Unfortunately, the bulk of bad unit tests are found by user testing of the software. Some of this user testing is done by the developer. The more embarrassing failures are found after the developers release the code. Either of these should return the developer to the unit tests first, not the base code. Sure, some review of the base code, but fixes should also be started in the unit tests.
When the developer finds a failure, they should immediately follow up with a unit test that tries to recreate the failure; the new test should not try to validate the expected result, but that the failure happened. It should be the case that a good test writer can create a test that does get the failure to occur. The test should then be reviewed for the flaw and corrected; it should fail, and the corrections should be made to the base code. If other tests fail, they should be evaluated, and either the fix should be made more cooperative, or the tests should be corrected to respect the fix.
Good Test Tenets
Tests should never be disabled. Never. Not ever. If they break, fix ’em or the code they’re protecting. It’s a common practice to “get tests out of the way” while working on something. If changing one thing breaks another, then the tarp is being pulled to tightly, and both (or all) broken tests need to be corrected.
Tests should be able to stand alone. It should not matter what order tests are executed. No test should depend on the set-up done by another test. If there is common configuration, make a set-up method.
Tests should leave environments in their starting state. Database transactions should be rolled-back. Sockets should be closed. The hardest is that singletons and statics should be reinitialized as necessary; a good understanding of ClassLoader and Reflection makes this easy.
The tested method should never be the end of the test. It should always be the case that there is some post-call validation to every method call, otherwise why would the call need to be made? Something had to have been done, right? The database transaction (mock or real) should be checked to be sure the correct actions occurred. Passed objects should be verified that the correct changes were made, and that undesired changes were not made. Return values should certainly be verified. Expected Exceptions should be caught and tested for, and unexpected Exceptions should cause test failures. Everything a method is expected to do or not do should be validated after it is called; therefore the call of the tested method should not be the end of the test.
Remove Untested Code
A quick aside for a tool that can come in handy when trying to make the case for good code coverage. It still does not do anything to validate tests are doing anything useful, or to ensure that code coverage means anything.
An unfortunately named code coverage tool Guantanamo will do quite different than just report where code isn’t covered. It rewrites the code without covered code. Again, it doesn’t validate that the tests are doing any good, but it goes a little further in the argument that tests should exist. If one is going to try to use unit tests to justify, or more importantly to protect code, this tool will show the developers just what is being protected.
Like the other code coverage tools, it will evaluate what is tested, but then it makes a copy of the source without the covered lines. This code will either result in compile-time and run-time errors as code that isn’t protected is removed. What is a try without a catch? What happens when untested branches are removed?
I’m not sure one could use this to clean the code, but it can be used as a second-check to validate the usefulness of some code, and certainly to educate about unit tests as code protection. Try building and running the code as cleaned by Guantanamo and see if it makes a difference. If it compiles and runs then there was indeed unnecessary code. If it won’t compile or it doesn’t run, then the argument is strengthened that coverage is weak.