What Is Fragility?

Sometimes the software we write seems to break for no specific reason. For example, a new bug appears in an area of our system that hasn’t been changed for a while. We may feel perplexed and frustrated by this program misbehaviour.

Such problems are often signs of Software Fragility—also known as Software Brittleness.

Here’s my favourite metaphor for fragility:

Turning on the kitchen lights sets our bed on fire.

No sane person would want their bed burning—an undesirable outcome. Weirder still, we’re producing the problem situation—bed on fire—as we turn on the lights in a room presumably far removed from where we sleep—the problem’s cause is non-local.

Naturally, an isolated instance of software system fragility would not prove overly troubling. We’d fix the issue and move on. Calling the whole system ‘fragile’ would seem a little unfair.

However, in software, problems seldom come alone. We’ve probably seen it firsthand—systems affected by more pervasive concerns: Almost every time we make a minor code change, a cascade of broken behaviour ripples through the program.

If every modification—never mind how small—endangers expected behaviour, working with this system’s code will be demanding.

Welcome to fragile software!

How does system fragility manifest itself? Let’s find out by continuing with our ‘faulty kitchen light switch sets the bed on fire’ example:

After extinguishing the flaming bed, we diagnose a fault in the light switch and make the appropriate repairs. The problem is gone—turning on the kitchen lights no longer ignites our mattress. We’re winning. Or are we?

A new problem has appeared: The washing machine water sensor is not turning off at the preset water level. Water is overflowing onto the laundry floor and pouring into the hallway and the rest of our house!

As we turn off the water intake tap, we wonder whether the misbehaving water level sensor has anything to do with our kitchen light repair. It seems unlikely but maybe.

While inspecting how the washing machine is wired up, we discover a link between it and the kitchen lights. That’s not right – let’s remedy that.

Unbeknownst to us, this latest repair affects the garage door opener: Every time we start a laundry wash cycle, the mechanism also inadvertently opens the garage door. It takes us a few days to discover this latest issue; by then, our top-of-the-line mountain bike got stolen from the garage.

Well, you get the idea.

Is this Gedankenexperiment-house a fragile system? It certainly seems that way. Every time we fix a system defect, we instigate more problems. It’s like playing crisis whack-a-mole.

Fragility

We’re ready to define System Fragility:

A software system is fragile when localised changes cause undesirable non-local behaviour.

In other words, fixing bugs causes more bugs. Often those bugs are in other parts of the system.

Here is a graphical representation of software fragility:

We start with a single bug in a fragile program:

The light-blue background represents the system, or more accurately, all the behaviour this program expresses—its ‘behaviour space’.

Let’s fix this bug and see what happens:

Indeed, we’ve fixed the bug—it’s no longer causing any problems. Yet, correcting the original defect has spawned a cascade of three bugs in other parts of the program—Bugs 2, 3 & 4. (Note: Longer arrows depict behaviour problems further removed (non-local) from the bug fix, while shorter arrows indicate problems more similar to the program behaviour near the bug fix).

When we now proceed to fix one of these new bugs—Bug 3—in our fragile system, we get another cascade of bugs—Bugs 5, 6, 7, 8:

Wow! One fixed bug fanning out to 4 new bugs!

Let’s continue a little further. We fix Bug 4. Luckily, our efforts here only create one more bug—Bug 9:

Thankfully, when we remedy Bug 9, no further bugs are produced:

Initially, in this hypothetical fragile system, we had one bug to fix, while in the end, all our efforts resulted in 5 outstanding program defects. All our work seemingly made things worse! In fragile systems, fixing bugs the traditional way does not work.

Instead, how should we make changes in fragile systems?

We’ll come back to that question. First, let’s examine the causes of fragility.

Out Of Sight

We haven’t even talked about the most troubling aspect of software fragility:

Fragility is a hidden phenomenon.

The diagrams above, where bug fixes breed more bugs, are misleading. They imply that the resultant bugs are immediately known to the software developer. In reality, that is not the case—if we knew we had created a cascade of child defects, we would modify our code until it was bug-free.

With fragility, our code fails silently. We may test a bug fix to validate its correct working and even manually run through a sample of adjacent code which may have been affected by our code changes. However, most non-trivial production systems are too large to test in their entirety. And ultimately, because we could not validate all existing software dependencies, the end user or customer discovers functionality that no longer works as it should—which is not a good look.

The Cause of Fragility

Why do we have Fragility? How do software systems end up in a place where we developers fear making minor changes because they could break the expected behaviour of the program?

A heightened extent of interconnectedness between software which should be independent of one another—or only minimally dependent— causes fragility.

We’re talking about Coupling.

Fragility shows up when we have too much inappropriate dependency in our codebase. For example, when the UI depends on the database schema, then changes to this schema may endanger the error-free operation of the UI. We probably should isolate the UI as much as possible from changes to the database and its schema.

Mitigating Techniques

Fast Feedback

How do we tackle systems that exhibit signs of fragility? Since the bulk of fragility’s damage is done by it being a hidden problem, maybe we can use this property as a starting point. What if we could immediately see bugs caused by a code change? Such visibility would be a powerful remedy: Now, we could change code so that all the program behaviour we want to preserve is actually preserved.

Unit tests can help us in this regard. They provide insight into whether our software is working as expected—or not. When we unintentionally break behaviour, one or more unit tests would fail and highlight the problem.

Reducing Fragility

With our unit tests in place, we could start refactoring the code and reduce the degree of Coupling throughout our software.

Now we’re onto a virtuous cycle that will reduce fragility and improve our ability to change the software more reliably.