Guides · Ops · Incident response

Break-glass procedures: a practical guide for small ops teams

"In case of emergency, break glass" sounds tidy. In practice, half the break-glass procedures at small companies are a single password in a sealed envelope nobody's checked in two years, in an office nobody goes to anymore. Here's how to build one that holds up.

10 min read · Updated 2026-05-25

TL;DR

A break-glass credential is a tier-zero secret that grants unusual authority, intended for rare, deliberate emergency use. The procedure should require collaboration (not a single keyholder), produce an audit trail (the act of using it is itself an event), and trigger automatic rotation afterward. Threshold-share the credential, store portions with on-call rotations and leadership, rehearse quarterly, and rotate every time the glass actually breaks.

Every step is recorded · the act of running this procedure is itself an event STEP 1 Trigger condition met STEP 2 Convene threshold STEP 3 Reconstruct clean device STEP 4 Use in-scope only STEP 5 Rotate re-split, retire
The procedure is loud by design. Every step generates signal — channel post, incident ticket, audit log — and the cycle closes with mandatory rotation so the same credential is never used twice.

What "break-glass" actually means

A break-glass credential exists to handle scenarios where your normal access path is broken: your IdP is down, your usual admin accounts are locked or compromised, your bastion host is unreachable, your CI/CD pipeline has stopped issuing credentials, your auto-renewing certificates have lapsed. In those moments you need a way in that doesn't depend on the same infrastructure that's currently on fire.

Three properties define a real break-glass procedure:

Why "the admin password in a sealed envelope" fails

The classic small-shop break-glass is a printed credential in a sealed envelope in a safe in the office. The envelope is signed across the seal so tampering is detectable. The procedure is: open the envelope, log in, rotate the credential, reseal a new envelope.

It fails for predictable reasons:

The shape of a better procedure

A break-glass procedure that actually works in a small ops team has four moving parts:

  1. The credential is threshold-shared — typically 2-of-3 or 2-of-4 — so any one person convening with one teammate can reach it, but no single person can.
  2. Portions are distributed across roles and locations that fail independently: primary on-call, secondary on-call, leadership, off-team custodian.
  3. The act of reconstructing is itself an event — recorded in an incident channel, ticketed in your incident tracker, time-stamped by the tool used to combine portions.
  4. Rotation is mandatory after every use, automated or runbook-driven, so the same break-glass credential is never used twice.

Picking a threshold

2-of-3 — the small-team default

Three portions: primary on-call, secondary on-call, and an engineering manager or VP. Any two recover. Survives one custodian being unreachable, asleep, or compromised. Right for 5–20 person ops teams.

2-of-4 — when on-call coverage is uneven

Useful when your "secondary on-call" is sometimes a single person whose own availability isn't reliable. Adds a fourth custodian (a leader or peer team's on-call) for redundancy without raising the threshold above two. The two-person ceremony stays the same; you just have more options for the second person.

3-of-5 — for higher-stakes credentials

For tier-zero break-glass (e.g. the database root credential whose misuse could exfiltrate every customer's data), raising the threshold to three is appropriate. The tradeoff: slower convening, harder to execute under genuine time pressure. Use this for credentials whose misuse is worse than their unavailability.

Heuristic: if the secret being protected is something whose silent use is worse than its delayed use, raise the threshold. If you'd rather a tired engineer reach it alone in three minutes than wait fifteen for a quorum, lower it.

Custodian selection

Bad custodian choices kill break-glass procedures quietly. Some lessons from the wreckage:

How portions actually live on people's machines

The point of a break-glass is reachability under stress. The portions should be storable somewhere a sleepy engineer can find them at 3am — without being so accessible that they leak.

Practical patterns:

What to avoid: portions stored in the same SSO/IdP-protected system the break-glass exists to bypass. If your normal access is down, your portion storage should still be reachable.

The runbook

Write the runbook for someone sleep-deprived, mildly panicked, and possibly not the most senior person on the team. The shape that holds up:

  1. Trigger conditions. Explicit: "use this if you cannot reach AWS via SSO and the SSO outage has been confirmed for 15+ minutes" — not "use in case of emergency."
  2. Authorization. Who is allowed to invoke; whose permission, if any, is required first; how that permission is recorded.
  3. How to convene the threshold. Which two (or three) custodians; primary contact channels; alternates if a primary is unreachable within X minutes.
  4. How to reconstruct. Which tool (e.g. shattr's decrypt) on which type of device (clean, browser, recent version).
  5. What to do with the recovered credential. Use only for the in-scope action. Do not save. Do not paste into messaging tools.
  6. Mandatory post-use steps. Rotate the credential. Re-split. Redistribute portions. File the incident report.
  7. Escalation path if reconstruction fails: who to call, in what order, with what authority to bypass the procedure.

Print one copy. Keep one in your incident-management system. Keep one in your team's wiki. All three should match.

Audit, noise, and the "this is itself an event" property

The use of a break-glass should generate signal the rest of the organization can see. Specifically:

The goal isn't to make break-glass painful. It's to make it loud — so a malicious or compromised user can't quietly trigger it without anyone noticing.

Rehearsal — actually do this one

Quarterly, more often if your team is changing fast:

  1. Pick a non-prod credential or a freshly-generated test secret.
  2. Run the full procedure end-to-end. Trigger the runbook. Convene the threshold. Reconstruct. "Use" the credential (or simulate use). Rotate. Redistribute portions.
  3. Time it. Note where the bottlenecks were: a custodian who didn't see the page, a runbook step that referenced a tool nobody had installed, an incident-channel name that had changed.
  4. Fix the bottlenecks. Update the runbook. Re-run the most-broken segment to verify the fix.

Teams that have never rehearsed a break-glass procedure cannot, in practice, use it under load. The drill is the difference between a security control and a wish.

After the glass breaks: rotation and review

Every real use of a break-glass should end with:

  1. Rotate the credential immediately. The reconstructed value should not be in active use beyond the in-scope action.
  2. Re-split the new credential and redistribute portions. Old portions are discarded (you'll never reconstruct the old credential again).
  3. Postmortem the underlying outage. Why did normal access fail? What would prevent that next time? Was break-glass the right call, or did we miss a less-disruptive option?
  4. Review whether the procedure itself behaved as expected. Was anything in the runbook stale? Was a custodian unreachable? Where did latency creep in?

Build a real break-glass in an afternoon

Pick one tier-zero credential. Split it 2-of-3 in your browser, hand out the portions, write the runbook, and run a tabletop next week.