Break-glass procedures: a practical guide for small ops teams

"In case of emergency, break glass" sounds tidy. In practice, half the break-glass procedures at small companies are a single password in a sealed envelope nobody's checked in two years, in an office nobody goes to anymore. Here's how to build one that holds up.

10 min read · Updated 2026-05-25

TL;DR

A break-glass credential is a tier-zero secret that grants unusual authority, intended for rare, deliberate emergency use. The procedure should require collaboration (not a single keyholder), produce an audit trail (the act of using it is itself an event), and trigger automatic rotation afterward. Threshold-share the credential, store portions with on-call rotations and leadership, rehearse quarterly, and rotate every time the glass actually breaks.

The procedure is loud by design. Every step generates signal — channel post, incident ticket, audit log — and the cycle closes with mandatory rotation so the same credential is never used twice.

What "break-glass" actually means

A break-glass credential exists to handle scenarios where your normal access path is broken: your IdP is down, your usual admin accounts are locked or compromised, your bastion host is unreachable, your CI/CD pipeline has stopped issuing credentials, your auto-renewing certificates have lapsed. In those moments you need a way in that doesn't depend on the same infrastructure that's currently on fire.

Three properties define a real break-glass procedure:

It bypasses normal access controls — that's its purpose. A break-glass that still requires the usual SSO is not break-glass.
It's rare. If it's used weekly, it's an access path, not break-glass. The whole rationale rests on its being a noisy event.
Its use is auditable and ceremonial. Anyone using it knows they're triggering something out-of-band; the rest of the team knows when it happens.

Why "the admin password in a sealed envelope" fails

The classic small-shop break-glass is a printed credential in a sealed envelope in a safe in the office. The envelope is signed across the seal so tampering is detectable. The procedure is: open the envelope, log in, rotate the credential, reseal a new envelope.

It fails for predictable reasons:

Geography: a remote-first or distributed team can't physically reach the envelope at 3am from another time zone.
One-person control: the person who reaches the envelope has unilateral authority. No two-person ceremony. No friction.
Decay: credentials change; envelopes don't get reissued; the envelope's contents drift out of date silently.
No rehearsal: nobody's tested the envelope path. When the moment comes, the envelope contains last year's password.
Audit trail is paper-only: the act of opening doesn't generate a signal anywhere a non-physical alerting system can see.

The shape of a better procedure

A break-glass procedure that actually works in a small ops team has four moving parts:

The credential is threshold-shared — typically 2-of-3 or 2-of-4 — so any one person convening with one teammate can reach it, but no single person can.
Portions are distributed across roles and locations that fail independently: primary on-call, secondary on-call, leadership, off-team custodian.
The act of reconstructing is itself an event — recorded in an incident channel, ticketed in your incident tracker, time-stamped by the tool used to combine portions.
Rotation is mandatory after every use, automated or runbook-driven, so the same break-glass credential is never used twice.

Picking a threshold

2-of-3 — the small-team default

Three portions: primary on-call, secondary on-call, and an engineering manager or VP. Any two recover. Survives one custodian being unreachable, asleep, or compromised. Right for 5–20 person ops teams.

2-of-4 — when on-call coverage is uneven

Useful when your "secondary on-call" is sometimes a single person whose own availability isn't reliable. Adds a fourth custodian (a leader or peer team's on-call) for redundancy without raising the threshold above two. The two-person ceremony stays the same; you just have more options for the second person.

3-of-5 — for higher-stakes credentials

For tier-zero break-glass (e.g. the database root credential whose misuse could exfiltrate every customer's data), raising the threshold to three is appropriate. The tradeoff: slower convening, harder to execute under genuine time pressure. Use this for credentials whose misuse is worse than their unavailability.

Heuristic: if the secret being protected is something whose silent use is worse than its delayed use, raise the threshold. If you'd rather a tired engineer reach it alone in three minutes than wait fifteen for a quorum, lower it.

Custodian selection

Bad custodian choices kill break-glass procedures quietly. Some lessons from the wreckage:

The primary on-call must hold a portion. If they don't, every break-glass starts with the primary waking someone up to ask for a portion. That latency makes the procedure fragile.
Include at least one custodian outside the on-call rotation. Otherwise an outage that takes out an on-call's laptop and the secondary's WiFi simultaneously can leave you below threshold.
Don't make the threshold "the manager and one direct report." Reporting lines create coercion paths; the manager can functionally demand the report's portion. Cross-functional custodians neutralize this.
Rotate custodians when on-call rotations change. A custodian who has rotated off on-call still has their portion — that's fine for the next 30 days; longer than that, redistribute.

How portions actually live on people's machines

The point of a break-glass is reachability under stress. The portions should be storable somewhere a sleepy engineer can find them at 3am — without being so accessible that they leak.

Practical patterns:

Personal password manager with hardware-key unlock, in a vault explicitly labeled "BREAK-GLASS · DO NOT TOUCH UNLESS INVOKED." Searchable, accessible from any of the holder's devices, gated behind the hardware key.
Printed and tucked into an on-call go-bag for the primary, alongside the YubiKey and the cached runbook. For people who work from physical desks where this is meaningful.
The portion itself can be a QR code so reconstruction is a phone-camera operation, not a typing operation. Useful when the engineer's laptop is the thing that's broken.

What to avoid: portions stored in the same SSO/IdP-protected system the break-glass exists to bypass. If your normal access is down, your portion storage should still be reachable.

The runbook

Write the runbook for someone sleep-deprived, mildly panicked, and possibly not the most senior person on the team. The shape that holds up:

Trigger conditions. Explicit: "use this if you cannot reach AWS via SSO and the SSO outage has been confirmed for 15+ minutes" — not "use in case of emergency."
Authorization. Who is allowed to invoke; whose permission, if any, is required first; how that permission is recorded.
How to convene the threshold. Which two (or three) custodians; primary contact channels; alternates if a primary is unreachable within X minutes.
How to reconstruct. Which tool (e.g. shattr's decrypt) on which type of device (clean, browser, recent version).
What to do with the recovered credential. Use only for the in-scope action. Do not save. Do not paste into messaging tools.
Mandatory post-use steps. Rotate the credential. Re-split. Redistribute portions. File the incident report.
Escalation path if reconstruction fails: who to call, in what order, with what authority to bypass the procedure.

Print one copy. Keep one in your incident-management system. Keep one in your team's wiki. All three should match.

Audit, noise, and the "this is itself an event" property

The use of a break-glass should generate signal the rest of the organization can see. Specifically:

An entry in your incident channel — automatic if your runbook says "post here first," manual otherwise.
An incident ticket with the trigger condition, the people involved, the action taken, and the rotation status.
A monitoring alert on the use of the credential itself, where the platform supports it (CloudTrail event on root login, audit log on database root, etc.).
A retrospective after-the-fact, even if everything went smoothly. Break-glass uses are precious data about your normal access controls' failure modes.

The goal isn't to make break-glass painful. It's to make it loud — so a malicious or compromised user can't quietly trigger it without anyone noticing.

Rehearsal — actually do this one

Quarterly, more often if your team is changing fast:

Pick a non-prod credential or a freshly-generated test secret.
Run the full procedure end-to-end. Trigger the runbook. Convene the threshold. Reconstruct. "Use" the credential (or simulate use). Rotate. Redistribute portions.
Time it. Note where the bottlenecks were: a custodian who didn't see the page, a runbook step that referenced a tool nobody had installed, an incident-channel name that had changed.
Fix the bottlenecks. Update the runbook. Re-run the most-broken segment to verify the fix.

Teams that have never rehearsed a break-glass procedure cannot, in practice, use it under load. The drill is the difference between a security control and a wish.

After the glass breaks: rotation and review

Every real use of a break-glass should end with:

Rotate the credential immediately. The reconstructed value should not be in active use beyond the in-scope action.
Re-split the new credential and redistribute portions. Old portions are discarded (you'll never reconstruct the old credential again).
Postmortem the underlying outage. Why did normal access fail? What would prevent that next time? Was break-glass the right call, or did we miss a less-disruptive option?
Review whether the procedure itself behaved as expected. Was anything in the runbook stale? Was a custodian unreachable? Where did latency creep in?

Build a real break-glass in an afternoon

Pick one tier-zero credential. Split it 2-of-3 in your browser, hand out the portions, write the runbook, and run a tabletop next week.

Split a secret → Create a team org How it works