Wednesday, June 04, 2025

Have LLMs made code-coverage a meaningless statistic?

June 04, 2025 AI, software dev, testing No comments

TLDR: If AI can easily generate code to increase test code coverage, has it become a meaningless metric?

I used to like code coverage (the percentage of the code executed while testing) as a metric.

I was interested in whether it was very high or very low.

Either of these was a flag for further investigation.

Very low would indicate a lack of testing.

Very high would be suspicious or encouraging (if the code was written following TDD).

Neither was a deal breaker, as neither was an indication of the quality or value of the tests.

Now tests are easy. Anyone can ask an AI tool to create tests for a codebase.

This means very low code coverage indicates a lack of use of AI as a coding tool, which probably also suggests a lack of other productivity tools and time-saving techniques.

Now, very high code coverage can mean nothing. There may very well be lots of tests or tests that cover a lot of the code, but these are very likely to only be unit tests and are also very likely to be low-value tests.

There are two approaches to tests. Asking:

Are there inputs or options that cause the code to break in unexpected or unintended ways?
Does the code do what it's supposed to? (What the person/user/business wants?)

Type 1 tests are easy, and the type AI can produce as they can be written based on looking at the code. These are tests like: "What if this function is passed an empty string?"

Type 2 tests verify that the code behaves as intended. These are the kind that can't be written without knowledge that exists outside the codebase. These are tests like: "Are all the business rules met?"

Type 1 tests are about the reliability of the code. Type 2 tests are about whether you have the right code.

Type 1 tests are useful and necessary. Type 2 tests require understanding the business, the app, and the people who will be using it.

Type 1 tests are generic. Type 2 tests will vary for each piece of software.

Type 1 tests are boring. Type 2 tests are where a lot of the challenge of software development lives. That's the fun bit.

Them: "We've got loads of tests."

Me: "But are they useful?"

Them: "Umm..."

I've recently started experimenting by keeping AI-generated tests separate from the ones I write myself. I'm hoping this will help me identify where value is created by AI and where it's from me.

Matt Lacey - Software Developer

Creating maintainable software that adds "real" value

Wednesday, June 04, 2025

Have LLMs made code-coverage a meaningless statistic?

0 comments:

Post a Comment

Popular Posts

Categories

Blog Archive