Monday, March 01, 2021

You must be insane to be a software developer

"The definition of insanity is doing the same thing over and over again and expecting a different result." - Probably not Albert Einstein.
Albert Einstein - pixabay

When debugging, I repeat myself frequently. Doing the same thing over and over again and hoping for a different result.

There are times I wish I had alternatives, but I know no other way, and so I rerun the code. Each time hoping that something will be different.

Rarely it is.

But it's those rare occasions that make this approach acceptable, if not necessary.

Note. I'm not talking about Heisenbugs. They're a separate, additional thing.

Sometimes repetition is the only option. An issue is observed or reported as only happening "some of the time." In such scenarios, the only available option is to run the code or perform an action to see when the bad thing occurs. Then try and identify what the difference was when it happened. 

This is important because, without consistent reproducibility of an issue, it's impossible to be 100% confident of an applied fix.

I know many developers and organizations who dismiss reports of occasional issues. If the problem doesn't happen all the time, so their thinking goes, it must not be with their software. Or they assume that the responsibility to fully identify the cause of the bug is not theirs. Or, they think their time would be wasted trying to reproduce such bugs consistently.

This comes down to perspective and priority.

Is the developer tasked only with writing the code? Or is the code a means to an end?  

I think the goal of writing code is to deliver value. I don't see any value being delivered by dismissing a bug report because it's not consistently reproducible. Instead, I see this as a challenge.

I assume that the person reporting the problem has given all the information they can. Now the challenge is for me to investigate the situation and find the solution. I'm a detective. 🔍

Being a bug detective might (and often does) mean repeating tasks to see when there is an inconsistent response and then trying to work out the cause. Such problems usually come down to one of four things.

  • It's time (or date) dependent.
  • It's machine/OS/configuration dependent.
  • It's a race condition.
  • It's dependent on the number of times the function is executed.

That time it failed, and the time before it worked even though the code hasn't changed. What else has changed? And how can I test that?

Note. I know there are obscure edge case bugs that it's not the best use of anybody's time to address. Let's assume that here I'm talking about bugs that the business deems worth spending time/money/effort fixing.

I wonder if there are better ways to investigate such bugs.

I wonder if there are tools to help with this. (I've found references to a few things that no longer exist but nothing current. 😢)

I will frequently run unaltered code and hope for a different result. Surely I'm not the only one...

1 comment:

  1. I turn over to memory dumps, WPR to sort out the hard things with plenty of tooling around that stuff which I did package together into an installer. Saves a lot of time to find the hard issues in production.