Wednesday, June 04, 2025

Have LLMs made code-coverage a meaningless statistic?

TLDR: If AI can easily generate code to increase test code coverage, has it become a meaningless metric?

Example code coverage report output

I used to like code coverage (the percentage of the code executed while testing) as a metric.

I was interested in whether it was very high or very low.

Either of these was a flag for further investigation.

Very low would indicate a lack of testing.

Very high would be suspicious or encouraging (if the code was written following TDD).

Neither was a deal breaker, as neither was an indication of the quality or value of the tests.


Now tests are easy. Anyone can ask an AI tool to create tests for a codebase.


This means very low code coverage indicates a lack of use of AI as a coding tool, which probably also suggests a lack of other productivity tools and time-saving techniques.

Now, very high code coverage can mean nothing. There may very well be lots of tests or tests that cover a lot of the code, but these are very likely to only be unit tests and are also very likely to be low-value tests.


There are two approaches to tests. Asking:

  1. Are there inputs or options that cause the code to break in unexpected or unintended ways?
  2. Does the code do what it's supposed to? (What the person/user/business wants?)


Type 1 tests are easy, and the type AI can produce as they can be written based on looking at the code. These are tests like: "What if this function is passed an empty string?"

Type 2 tests verify that the code behaves as intended. These are the kind that can't be written without knowledge that exists outside the codebase. These are tests like: "Are all the business rules met?"


Type 1 tests are about the reliability of the code. Type 2 tests are about whether you have the right code.

Type 1 tests are useful and necessary. Type 2 tests require understanding the business, the app, and the people who will be using it

Type 1 tests are generic. Type 2 tests will vary for each piece of software.

Type 1 tests are boring. Type 2 tests are where a lot of the challenge of software development lives. That's the fun bit.


Them: "We've got loads of tests."

Me: "But are they useful?"

Them: "Umm..."


I've recently started experimenting by keeping AI-generated tests separate from the ones I write myself. I'm hoping this will help me identify where value is created by AI and where it's from me.




0 comments:

Post a Comment

I get a lot of comment spam :( - moderation may take a while.