Wednesday, June 04, 2025

Have LLMs made code-coverage a meaningless statistic?

TLDR: If AI can easily generate code to increase test code coverage, has it become a meaningless metric?

Example code coverage report output

I used to like code coverage (the percentage of the code executed while testing) as a metric.

I was interested in whether it was very high or very low.

Either of these was a flag for further investigation.

Very low would indicate a lack of testing.

Very high would be suspicious or encouraging (if the code was written following TDD).

Neither was a deal breaker, as neither was an indication of the quality or value of the tests.


Now tests are easy. Anyone can ask an AI tool to create tests for a codebase.


This means very low code coverage indicates a lack of use of AI as a coding tool, which probably also suggests a lack of other productivity tools and time-saving techniques.

Now, very high code coverage can mean nothing. There may very well be lots of tests or tests that cover a lot of the code, but these are very likely to only be unit tests and are also very likely to be low-value tests.


There are two approaches to tests. Asking:

  1. Are there inputs or options that cause the code to break in unexpected or unintended ways?
  2. Does the code do what it's supposed to? (What the person/user/business wants?)


Type 1 tests are easy, and the type AI can produce as they can be written based on looking at the code. These are tests like: "What if this function is passed an empty string?"

Type 2 tests verify that the code behaves as intended. These are the kind that can't be written without knowledge that exists outside the codebase. These are tests like: "Are all the business rules met?"


Type 1 tests are about the reliability of the code. Type 2 tests are about whether you have the right code.

Type 1 tests are useful and necessary. Type 2 tests require understanding the business, the app, and the people who will be using it

Type 1 tests are generic. Type 2 tests will vary for each piece of software.

Type 1 tests are boring. Type 2 tests are where a lot of the challenge of software development lives. That's the fun bit.


Them: "We've got loads of tests."

Me: "But are they useful?"

Them: "Umm..."


I've recently started experimenting by keeping AI-generated tests separate from the ones I write myself. I'm hoping this will help me identify where value is created by AI and where it's from me.




Tuesday, June 03, 2025

The problem with multi-word terms (including "vibe coding")

TLDR: I think it's worth being clear about the meaning of the words we use. Maybe compound terms 

Not wanting to sound too pessimistic, but I think it's fair to say that we are Lazier than we realise and not as smart as we think.

We hear a term that's comprised of multiple words we recognise, and assume a meaning of the overall term based on our individual understanding of the individual words.
confused speech emojis
Let me give you three 3 examples.

1. "Vibe coding"

Originally, it was defined to describe people "going with the vibe" and letting the AI/LLM do all the work. You just tell the AI what you want and keep going until it has produced all the code and deployed the resultant software without having a care or knowledge about how it works.
But some developers heard the term, presumably thought "I know what coding is and I know what good vibes are so if I put them together that must mean 'using AI to produce code that gives me good vibes.'" 
The result: there are lots of different understandings of the meaning, and so whenever it's used, it's necessary to clarify what's meant. Yes, there can be lots of different meanings and I'm not going to argue that one is more valid than the others.

2. "Agile development" 

The original manifesto had some flexibility and left some things open to interpretation or implementation appropriate to specific circumstances. However, I suspect, there were a lot of people who thought "I know what development is and I know what it means to be agile so I'll just combine the two."
The result: everyone has their own understanding of what it means to "do agile development". Some of those variations are small and some are massive. I've yet to meet two different teams "doing agile development" who do things exactly the same. Does that matter? Probably not. It's just important to clarify what people mean when they use the term.


3. "Minimal viable product" (MVP)

Yes, you may know what all the words mean individually. You may even have an idea about the term as a whole, but the internet is bursting with explanations of what it actually means. My experience also tells me that if you have a development background, your understanding is highly likely to be very different from someone in product or marketing.
Does it matter? It depends on whether all the people using the term are in agreement. It might be fine if you're using it as an alternative term for "beta", or you mean it must have a particular set of features, or it requires a certain level of visual polish. I think that you can prove it's viable based on customer actions is more important. But, again if all the people on your project can agree on the meaning, I trust you'll work it out. (Confession: I left one job because the three people in charge--an issue for another time-- all had a different understanding of what MVP meant, but refused to give their definition or acknowledge their definition was different from the others. It made the work impossible.)



Some people (Or, maybe all people, some times--I have done this myself) will hear a word, assume a meaning and not ask any questions.

I've observed a similar thing with headlines. People make assumptions based on headlines or TLDRs, and so don't get to appreciate the nuance. Or maybe don't even appreciate that there might be more than a simple explanation.

Nuance matters. It's the detail where the devil hides. It's the 80% of edge cases accompanying the 20% of the obvious in a 'simple' scenario.

Words matter. I probably spend far too much of my time thinking about words because they're a foundation of communication.

Yes, for many people, words don't matter.

But, going back to thinking about "vibe coding", words are how we communicate with machines. While the trend has always been for "higher-level" languages, we didn't go all the way to our spoken languages previously because of
A) technical limitations 
B) the lack of precision in our spoken/written languages 

AI/LLMs overcome some of the technical limitations and can make some reasonable guesses to work around the lack of precision.

Relying solely on natural language to express all the subtle details and the specifics required with software using only a few sentences, or even paragraphs, doesn't seem appropriate.

Some people think 'Nuance doesn't matter' until the software doesn't do exactly what they expect in an edge case scenario.

Producing software that isn't as good as I want/expect may just be part of the enshittification of life.

I think many people believe (or think they can get away with) acting like "Close enough" is the new "Good enough".

Magpies are very vocal. And, maybe they're right. Perhaps we should just focus on the new and shiny.

Or if using AI/LLMs saves money and cuts costs, that's all that matters. Well, matters to some. I definitely don't think it's all that's important.



Then I wonder about choosing names for things. 
If there are such potential problems when combining existing words. Maybe using made-up words or words with no direct correlation to the thing the name is used for....




Now, I'll just wait for the comments that tell me I don't understand the above terms correctly...