Break-glass procedures: a practical guide for small ops teams
"In case of emergency, break glass" sounds tidy. In practice, half the break-glass procedures at small companies are a single password in a sealed envelope nobody's checked in two years, in an office nobody goes to anymore. Here's how to build one that holds up.
TL;DR
A break-glass credential is a tier-zero secret that grants unusual authority, intended for rare, deliberate emergency use. The procedure should require collaboration (not a single keyholder), produce an audit trail (the act of using it is itself an event), and trigger automatic rotation afterward. Threshold-share the credential, store portions with on-call rotations and leadership, rehearse quarterly, and rotate every time the glass actually breaks.
What "break-glass" actually means
A break-glass credential exists to handle scenarios where your normal access path is broken: your IdP is down, your usual admin accounts are locked or compromised, your bastion host is unreachable, your CI/CD pipeline has stopped issuing credentials, your auto-renewing certificates have lapsed. In those moments you need a way in that doesn't depend on the same infrastructure that's currently on fire.
Three properties define a real break-glass procedure:
- It bypasses normal access controls — that's its purpose. A break-glass that still requires the usual SSO is not break-glass.
- It's rare. If it's used weekly, it's an access path, not break-glass. The whole rationale rests on its being a noisy event.
- Its use is auditable and ceremonial. Anyone using it knows they're triggering something out-of-band; the rest of the team knows when it happens.
Why "the admin password in a sealed envelope" fails
The classic small-shop break-glass is a printed credential in a sealed envelope in a safe in the office. The envelope is signed across the seal so tampering is detectable. The procedure is: open the envelope, log in, rotate the credential, reseal a new envelope.
It fails for predictable reasons:
- Geography: a remote-first or distributed team can't physically reach the envelope at 3am from another time zone.
- One-person control: the person who reaches the envelope has unilateral authority. No two-person ceremony. No friction.
- Decay: credentials change; envelopes don't get reissued; the envelope's contents drift out of date silently.
- No rehearsal: nobody's tested the envelope path. When the moment comes, the envelope contains last year's password.
- Audit trail is paper-only: the act of opening doesn't generate a signal anywhere a non-physical alerting system can see.
The shape of a better procedure
A break-glass procedure that actually works in a small ops team has four moving parts:
- The credential is threshold-shared — typically 2-of-3 or 2-of-4 — so any one person convening with one teammate can reach it, but no single person can.
- Portions are distributed across roles and locations that fail independently: primary on-call, secondary on-call, leadership, off-team custodian.
- The act of reconstructing is itself an event — recorded in an incident channel, ticketed in your incident tracker, time-stamped by the tool used to combine portions.
- Rotation is mandatory after every use, automated or runbook-driven, so the same break-glass credential is never used twice.
Picking a threshold
2-of-3 — the small-team default
Three portions: primary on-call, secondary on-call, and an engineering manager or VP. Any two recover. Survives one custodian being unreachable, asleep, or compromised. Right for 5–20 person ops teams.
2-of-4 — when on-call coverage is uneven
Useful when your "secondary on-call" is sometimes a single person whose own availability isn't reliable. Adds a fourth custodian (a leader or peer team's on-call) for redundancy without raising the threshold above two. The two-person ceremony stays the same; you just have more options for the second person.
3-of-5 — for higher-stakes credentials
For tier-zero break-glass (e.g. the database root credential whose misuse could exfiltrate every customer's data), raising the threshold to three is appropriate. The tradeoff: slower convening, harder to execute under genuine time pressure. Use this for credentials whose misuse is worse than their unavailability.
Heuristic: if the secret being protected is something whose silent use is worse than its delayed use, raise the threshold. If you'd rather a tired engineer reach it alone in three minutes than wait fifteen for a quorum, lower it.
Custodian selection
Bad custodian choices kill break-glass procedures quietly. Some lessons from the wreckage:
- The primary on-call must hold a portion. If they don't, every break-glass starts with the primary waking someone up to ask for a portion. That latency makes the procedure fragile.
- Include at least one custodian outside the on-call rotation. Otherwise an outage that takes out an on-call's laptop and the secondary's WiFi simultaneously can leave you below threshold.
- Don't make the threshold "the manager and one direct report." Reporting lines create coercion paths; the manager can functionally demand the report's portion. Cross-functional custodians neutralize this.
- Rotate custodians when on-call rotations change. A custodian who has rotated off on-call still has their portion — that's fine for the next 30 days; longer than that, redistribute.
How portions actually live on people's machines
The point of a break-glass is reachability under stress. The portions should be storable somewhere a sleepy engineer can find them at 3am — without being so accessible that they leak.
Practical patterns:
- Personal password manager with hardware-key unlock, in a vault explicitly labeled "BREAK-GLASS · DO NOT TOUCH UNLESS INVOKED." Searchable, accessible from any of the holder's devices, gated behind the hardware key.
- Printed and tucked into an on-call go-bag for the primary, alongside the YubiKey and the cached runbook. For people who work from physical desks where this is meaningful.
- The portion itself can be a QR code so reconstruction is a phone-camera operation, not a typing operation. Useful when the engineer's laptop is the thing that's broken.
What to avoid: portions stored in the same SSO/IdP-protected system the break-glass exists to bypass. If your normal access is down, your portion storage should still be reachable.
The runbook
Write the runbook for someone sleep-deprived, mildly panicked, and possibly not the most senior person on the team. The shape that holds up:
- Trigger conditions. Explicit: "use this if you cannot reach AWS via SSO and the SSO outage has been confirmed for 15+ minutes" — not "use in case of emergency."
- Authorization. Who is allowed to invoke; whose permission, if any, is required first; how that permission is recorded.
- How to convene the threshold. Which two (or three) custodians; primary contact channels; alternates if a primary is unreachable within X minutes.
- How to reconstruct. Which tool (e.g. shattr's decrypt) on which type of device (clean, browser, recent version).
- What to do with the recovered credential. Use only for the in-scope action. Do not save. Do not paste into messaging tools.
- Mandatory post-use steps. Rotate the credential. Re-split. Redistribute portions. File the incident report.
- Escalation path if reconstruction fails: who to call, in what order, with what authority to bypass the procedure.
Print one copy. Keep one in your incident-management system. Keep one in your team's wiki. All three should match.
Audit, noise, and the "this is itself an event" property
The use of a break-glass should generate signal the rest of the organization can see. Specifically:
- An entry in your incident channel — automatic if your runbook says "post here first," manual otherwise.
- An incident ticket with the trigger condition, the people involved, the action taken, and the rotation status.
- A monitoring alert on the use of the credential itself, where the platform supports it (CloudTrail event on root login, audit log on database root, etc.).
- A retrospective after-the-fact, even if everything went smoothly. Break-glass uses are precious data about your normal access controls' failure modes.
The goal isn't to make break-glass painful. It's to make it loud — so a malicious or compromised user can't quietly trigger it without anyone noticing.
Rehearsal — actually do this one
Quarterly, more often if your team is changing fast:
- Pick a non-prod credential or a freshly-generated test secret.
- Run the full procedure end-to-end. Trigger the runbook. Convene the threshold. Reconstruct. "Use" the credential (or simulate use). Rotate. Redistribute portions.
- Time it. Note where the bottlenecks were: a custodian who didn't see the page, a runbook step that referenced a tool nobody had installed, an incident-channel name that had changed.
- Fix the bottlenecks. Update the runbook. Re-run the most-broken segment to verify the fix.
Teams that have never rehearsed a break-glass procedure cannot, in practice, use it under load. The drill is the difference between a security control and a wish.
After the glass breaks: rotation and review
Every real use of a break-glass should end with:
- Rotate the credential immediately. The reconstructed value should not be in active use beyond the in-scope action.
- Re-split the new credential and redistribute portions. Old portions are discarded (you'll never reconstruct the old credential again).
- Postmortem the underlying outage. Why did normal access fail? What would prevent that next time? Was break-glass the right call, or did we miss a less-disruptive option?
- Review whether the procedure itself behaved as expected. Was anything in the runbook stale? Was a custodian unreachable? Where did latency creep in?
Build a real break-glass in an afternoon
Pick one tier-zero credential. Split it 2-of-3 in your browser, hand out the portions, write the runbook, and run a tabletop next week.