Recovery Is the First Systems Problem

Systems May 4, 2026 7 min read

Most systems get their first serious engineering pass when they start to grow. The database needs indexes. The queue needs more workers. The API needs caching. The deployment target needs more capacity.

Those are useful moves. But they often arrive before the system can answer a more basic question: what happens after something fails?

That question is not secondary. Every production system is already dealing with failure. Requests time out. Workers stop halfway through a job. Webhooks arrive twice. A deploy leaves old and new code running at the same time. Someone needs to fix a customer record without making the problem larger.

Scale makes those events happen more often. Recovery decides whether they become normal maintenance or incidents.

What Recovery Mode Means

Recovery mode is the part of the system that takes over when normal flow is no longer trustworthy.

It does not mean the whole product goes down. It means the system changes posture. It stops pretending everything completed cleanly. It protects the important state, records what is unfinished, and gives the product or operator a safe next action.

A simple recovery mode might do four things:

Save the user's action before calling an outside service.
Mark the work as pending, failed, or needs review.
Let the same work be retried without creating duplicates.
Give someone a safe way to roll back, cancel, or repair the item.

This is different from simply showing an error page. An error page tells the user something went wrong. Recovery mode tells the system what to do next.

A system that cannot recover has a hidden limit: the first failure no one knows how to repair. More throughput above that limit only moves broken work faster.

A GitHub Integration Example

Imagine a product that connects to GitHub. A user asks it to create a pull request, attach some metadata, and update the local project status to "ready for review."

The happy path is easy to picture. The user clicks a button. The app calls GitHub. GitHub creates the pull request. The app stores the pull request URL. The UI updates.

The recovery path is where the real design starts.

What if GitHub creates the pull request, but your app times out before it receives the response? What if your app updates the local status first, then GitHub rejects the request? What if the user clicks twice because the first attempt looked stuck? What if GitHub sends a webhook before your own database has finished updating?

Without recovery design, the team ends up guessing. Maybe there is a pull request in GitHub but no record in the product. Maybe the product says "ready for review" even though no pull request exists. Maybe two pull requests were created. Maybe the only fix is a manual database edit.

With recovery design, the flow looks different. The app first stores a durable record of the user's command: "create a pull request for this project." That record gets a stable key, tied to the project and action, so the same command can be recognized if it is tried again. The app then calls GitHub. If the call succeeds, the command is marked complete with the pull request URL. If the call fails or times out, the command stays visible as pending or failed.

Now the product has a recovery mode. It can show "GitHub sync pending" instead of lying about the final state. It can check GitHub to see whether the pull request was actually created. It can retry the same command without creating a duplicate. If the local status moved too far ahead, it can roll back that local change to the last safe state.

That is the shape to aim for: do the local work in a way that can be confirmed, retried, or rolled back before it leaks into the rest of the product.

Design The Try-Again Path

Retries are often treated as a small technical detail. Add a delay, try three times, and move on.

That is not enough. A retry is only safe when the system knows what it is retrying.

If the operation creates a GitHub pull request, the retry should be tied to the user's original command, not to a random HTTP request. If the operation sends a notification, the system should know whether sending two notifications is acceptable. If the operation changes account access, the retry must not grant access twice or skip a required check.

A practical baseline is:

Give each user command a stable key.
Store the command before doing work that cannot be easily undone.
Make repeated attempts land on the same result when possible.
Separate "try again later" from "this request was rejected."
Make failed commands searchable by customer, project, or product object.

The point is not to make every action perfectly reversible. The point is to make the next attempt understandable.

Make Rollback Local And Bounded

Rollback is safest when it is small.

In the GitHub example, a local-first rollback means the product can move its own state back to a safe local version without pretending it controls GitHub. If creating the pull request fails, the app can change the local project from "ready for review" back to "draft" or "GitHub sync failed." If GitHub actually created the pull request but the product did not record it, the product can reconcile that one command instead of scanning every repository.

The important boundary is scope. Recovery should usually answer a question like: "What happened to this project, this command, this pull request, or this customer?"

It should not require a broad repair script that might touch the entire customer base.

Good rollback records enough history to make the safe move obvious. What was the previous local status? Which user started the action? Which external request was attempted? Did the outside service return an identifier? Has anyone already retried it?

When that information is stored with the command, rollback becomes a normal product behavior instead of a risky production maneuver.

Keep Failed Work Visible

A failed job hidden in logs is not really recoverable. Someone has to know it exists, understand what it affected, and decide what to do next.

Recoverable systems give failed work a durable shape. A GitHub sync failure should be visible as a record with a project, a user, a timestamp, a last error, and an allowed next action. It should not exist only as a stack trace from last Tuesday.

This does not require a large internal platform. A small admin view over failed commands is often enough. The view should answer plain questions:

What failed?
Who or what did it affect?
Is it safe to retry?
Did the outside service receive the request?
What happened last time we tried?
Can this be canceled, rolled back, or marked resolved?

The goal is not to remove human judgment. The goal is to give human judgment a safe surface to work on.

Decide What Can Wait

A system that cannot recover often tries to stay fully available until it damages state. A recoverable system knows which work must stop and which work can wait.

If GitHub is unavailable, the product might still let users keep editing locally while marking GitHub sync as pending. If search indexing is down, writes can continue while search results become temporarily stale. If analytics is delayed, the core user workflow should usually continue.

Each outside dependency needs a clear failure posture:

Block: do not continue because the action would be unsafe.
Defer: accept the user's action and finish the outside work later.
Skip: continue and record that a noncritical side effect did not happen.
Read stale: show the last known value with a clear freshness boundary.

A timeout value is not a policy. The policy is the product decision: what should the user see, what state should be stored, and what work should be waiting for recovery?

Build Repair Into The Product

Many teams eventually collect private repair paths: console snippets, one-off SQL updates, local scripts, admin-only endpoints, and messages that say "run this after deploy."

Some of that is unavoidable. But when every repair is improvised, recovery has not been modeled.

The better version is to make repair part of the product's internal surface. Failed work should have state, ownership, timestamps, last error, retry eligibility, and a small set of safe actions. Operators should be able to inspect why an item failed, retry it, cancel it, roll back a local change, or attach a note. Dangerous actions should be scoped and logged.

For the GitHub integration, this might be as simple as an admin page for sync commands. One row per command. One project. One GitHub repository. One status. One last error. One retry button that runs the same code path as the original user action.

That last part matters. Repair scripts drift. They skip checks, bypass normal rules, and encode one person's memory into code nobody wants to run twice. Whenever possible, repair should call the same service boundary the product uses.

Recovery Changes What Scale Means

Once recovery is designed first, scale work becomes more honest.

More workers are useful because retrying the same command is safe. More queue throughput is useful because failed items do not disappear. More caching is useful because stale data has a defined boundary. More automation is useful because operators can see and correct what automation could not finish.

The order matters. If you scale before recovery, every optimization increases the number of partial states the team does not understand. If you design recovery first, scale is no longer a bet that failures stay rare. It is a way to process more work through a system that already knows how to get back to a safe state.

The first version does not need elaborate machinery. It needs stored commands, safe retries, visible failed work, bounded replay, clear fallback behavior, and repair tools that match the product's real workflows.

Those are not afterthoughts. They are the architecture.