Schrödinger’s Data Backups: Why Untested Resilience Is a Governance Failure

In the realm of quantum physics, Schrödinger’s Cat is a famous thought experiment: a cat is placed in a sealed box with a radioactive source and a poison that will be triggered if an atom decays. Until the box is opened and the cat is observed, it exists in a superposition of states; both alive and dead at the same time.

In the modern enterprise, data backups exist in a disturbingly similar state.

On the surface, everything looks perfect. The dashboard shows a sea of green checkmarks. The policy manual is impressive. The Disaster Recovery (DR) plan has been signed off by the auditors. According to every reporting metric available to the Board, the organisation is protected. Yet, until a full-scale restoration is attempted under pressure, the integrity of those backups is unknowable. They are simultaneously “working” and “broken” until the moment of observation.

The uncomfortable truth for leadership is this:

A backup you haven’t restored is optimism, not strategy.

The Illusion of Resilience

Many organisations mistake “having backups” for “being resilient.” This is a fundamental category error. Resilience is not a static state achieved by purchasing a software license; it is a dynamic capability that must be proven.

The IT industry has become exceptionally good at providing “false signals” of safety. These signals create a dangerous sense of complacency among C-suite executives and Board members:

The Green Checkmark Syndrome: Backup dashboards provide a daily sense of achievement. “Job completed successfully” simply means the data was copied from point A to point B. It does not mean point B is readable, consistent, or useful.
Offsite Replication Status: Knowing that data is stored in a secondary cloud region or a remote data center is comforting, but replication often replicates corruption just as efficiently as it replicates healthy data.
The “Paper” DR Plan: An annual DR plan document approved by an audit is a statement of intent, not a proof of capability.

None of these metrics validate what actually matters during a crisis: Recovery Time Objectives (RTO), Data Consistency, and Application Functionality. Technical recovery is not a success if the servers are “on” but the applications cannot talk to the database, or if the identity management system is inaccessible, preventing anyone from logging in to the restored environment.

Why This Is a Governance Issue (Not an IT Issue)

For too long, backup and recovery have been relegated to the “basement” of IT operations; a checkbox for system administrators to manage. This is a profound misunderstanding of risk. Untested backups represent a systemic governance failure for three primary reasons:

1. The Creation of Known-But-Unvalidated Safeguards

Governance is about the oversight of safeguards. If a Board is told a safeguard exists but that safeguard has never been tested, the Board is making decisions based on unverified assumptions. This is the equivalent of a shipping company assuming its lifeboats will float because they were purchased recently, without ever putting them in the water.

2. False Assurance at the Executive Level

When IT reports “100% backup success” to the Risk Committee, they are providing a metric that masks the actual risk. This produces a “veneer of safety” that prevents the Board from allocating necessary resources to true resilience.

3. The Transfer of Operational Risk into Strategic Blind Spots

In the event of a total systemic failure, such as a sophisticated ransomware attack, the inability to restore data moves from being an “IT problem” to a “going concern” issue. It threatens the very existence of the company.

The Regulatory Context: Beyond Best Practice

This is no longer just a matter of “good hygiene”; it is a matter of law. Under frameworks like NIS2, resilience and incident response are becoming mandatory pillars of corporate governance.

These regulations emphasise accountability at the management level. If an organisation suffers a catastrophic data loss because its backups were never tested, the Board cannot claim they “didn’t know.” Compliance now requires demonstrable capability, not just documented intention. A DR plan sitting in a PDF on a corrupted server does not meet your resilience obligations.

The Hidden Failure Modes Boards Rarely See

In a boardroom, “Backup” sounds like a single, simple thing. In reality, it is a complex chain of dependencies. When that chain breaks, it usually happens in ways that catch unprepared executives off guard. Here are the “failure modes” that rarely make it into a status report:

Encryption Key Paradox: In a crisis, you find your backups are encrypted (to protect them from hackers), but the keys required to unlock them were stored on the very system that just crashed.
SaaS Data: Many Boards assume that because the company uses Microsoft 365 “the cloud” handles the backups. Most SaaS providers operate on a “shared responsibility” model … they protect the infrastructure, but you are responsible for the data. If a user deletes a critical folder, it may be gone forever.
The Identity Deadlock: You have the data backups, but the system that authenticates users (Identity Provider) is down. Without the “keys to the front door,” you cannot access the restored data.
Logical Corruption: A virus or a database error may have been silently corrupting data for months. Your backups are “successful,” but they have been faithfully backing up the corruption for weeks, leaving you with no clean version to return to.

In the words of Nassim Taleb, this is Fragility. The system appears stable because it hasn’t been hit yet. But because it has never been stressed, the failure, when it comes, will be non-linear and catastrophic.

Optimisation vs. Redundancy: The Efficiency Trap

Modern IT is driven by the mandate to be “Lean.” We optimise for cost efficiency, performance, and minimal waste. We remove “unnecessary” overhead.

However, resilience requires a controlled form of inefficiency. True resilience requires “slack”; extra time for staff to run drills, extra storage space for multiple versions of data, and redundant systems that sit idle until needed. When an organisation over-optimises for cost, resilience is usually the first thing sacrificed at the altar of the budget.

This is made worse by “Firefighting IT Culture.” Most IT teams are so overwhelmed by day-to-day tickets and reactive maintenance that they never have the “luxury” to run a full-scale restoration excersie. Restore drills are postponed indefinitely. As a result, technical gap accumulates quietly in the dark. Confidence in the system increases simply because it hasn’t failed yet, while the actual probability of a successful recovery decreases every day.

The Organisational Blind Spot

Why is restore testing so rare?

No Immediate ROI: You don’t make money by testing a restore. It is a cost center.
No Visible Crisis: Until the building is on fire, no one cares about the fire extinguishers.
Fear of Exposing Weakness: IT leadership may be hesitant to run a test because they suspect it might fail, and a failed test requires explaining to the Board why “their” systems aren’t working.

This leads to the Illusion of Resilience: the act of going through the motions of security and backup compliance to satisfy auditors, without ever actually possessing the ability to recover from a real strike.

The Fix: What Mature Governance Looks Like

To move from “Hope” to “Strategy,” Boards and C-suite executives must change the way they measure and manage data resilience. It requires moving beyond the dashboard and into the realm of verification.

1. Regular Restore Drills

Do not ask IT if the backups “ran.” Ask when the last full-scale restoration of a critical business service was completed.

2. Documented RTO/RPO Validation

If the business claims it can be back online in 4 hours (Recovery Time Objective), that number must be tested. If the test reveals it actually takes 18 hours, the Board needs to know that gap exists now, not during a ransomware negotiation.

3. Application-Level Recovery Testing

Testing a server restore is easy. Testing an application restore (where the database, the web front-end, the API layer, and the user permissions all have to sync up) is hard. This is the only level of testing that matters to the end-user.

4. Board-Level Reporting that Makes Sense

Board reports should no longer feature “Backup Success %.” Instead, they should include:

Date of the last successful full-environment restore test.
Actual recovery time achieved vs. target.
Identified gaps and the specific budget allocated to remediate them.

5. Treat Restore Testing Like a Fire Drill

Fire drills are inconvenient. They disrupt the workday. They are loud. But we do them so that when the alarm is real, people don’t die. Data recovery testing must be treated the same.

Resilience Is Demonstrated, Not Declared

Any disaster recovery plan is a work of fiction until it is rehearsed.

The Board of Directors is not responsible for preventing every possible cyberattack or system failure; in the modern world, that is an impossible standard. However, the Board is responsible for ensuring that when a crisis arrives, the organisation has the demonstrated capability to survive it.

The time to find out that your backups don’t work is on a Tuesday morning during a scheduled drill, not at 3:00 AM on a Sunday during a ransomware attack.

Resilience is not a status update. It is a muscle that needs to be exercised. If you don’t, it will fail you when you need it most.

Schrödinger’s Data Backups: Why Untested Resilience Is a Governance Failure

The Illusion of Resilience

Why This Is a Governance Issue (Not an IT Issue)

1. The Creation of Known-But-Unvalidated Safeguards

2. False Assurance at the Executive Level

3. The Transfer of Operational Risk into Strategic Blind Spots

The Regulatory Context: Beyond Best Practice

The Hidden Failure Modes Boards Rarely See

Optimisation vs. Redundancy: The Efficiency Trap

The Organisational Blind Spot

The Fix: What Mature Governance Looks Like

1. Regular Restore Drills

2. Documented RTO/RPO Validation

3. Application-Level Recovery Testing

4. Board-Level Reporting that Makes Sense

5. Treat Restore Testing Like a Fire Drill

Resilience Is Demonstrated, Not Declared

Other Posts

Schrödinger’s Data Backups: Why Untested Resilience Is a Governance Failure

Robert’s Radar #6

Robert’s Radar #5

2025 Reading Journal

Comments

Leave a Reply Cancel reply