Shadow to Enforce: Rollout Playbook
By Platform Engineering
Teams break production when they jump directly to blocking behavior.
The safer path is staged rollout with measurable gates. "We tested it a bit" is not a rollout strategy. For governance systems, promotion criteria need to be explicit before you touch live traffic.
The goal of shadow mode
Shadow mode is not just a comfort blanket. It answers a specific question:
If this policy had been active on real traffic, what would it have done?
That lets you evaluate blast radius before users or internal operators pay the price.
Recommended sequence
1. Replay and simulate first
Start with historical traces and targeted test cases. This catches obvious regressions before the policy sees current production traffic.
Look for:
- known risky inputs that should
BLOCK - malformed but safely fixable inputs that could
MODIFY - known-good tasks that must remain
ALLOW
If replay is noisy or inconclusive, shadow mode will not magically solve that.
2. Run in shadow mode on live traffic
Once the candidate version looks sane offline, attach it in shadow mode. Let the platform record what it would have done while still forwarding traffic unchanged.
This phase is where you learn whether your policy understands the messiness of real usage:
- unexpected payload shapes
- missing context fields
- tool descriptions that are less precise than you assumed
- edge-case behavior that never appeared in test fixtures
3. Enforce narrowly before enforcing broadly
The first active rollout should be scoped.
Good first targets:
- high-risk tools with well-understood inputs
- high-confidence rules with clear reason codes
- a single environment or workflow
Bad first targets:
- broad natural-language policies with ambiguous triggers
- business-critical workflows with no replay coverage
- every tool in production at once
4. Widen enforcement only after review
Policy promotion is not complete when the toggle flips. It is complete when the metrics support the change.
Keep widening scope only if the data says the system is helping more than it is harming.
Metrics that actually matter
You need a short list of metrics with stop conditions.
- Catastrophic failure rate: Did clearly unsafe actions still get through?
- Safe task success: Did legitimate work continue to complete?
- False positive rate: How often did the system interrupt safe behavior?
- Enforcement latency (p95): Did the control become operationally expensive?
A useful rule of thumb: if you cannot define the rollback trigger before rollout, you are not ready to enforce.
A practical gating model
Use a simple phase gate instead of a vague confidence statement.
| Phase | Evidence required to advance |
|---|---|
| Replay | Critical risky cases caught, benign baseline still passes |
| Shadow | False positives understood, misses reviewed, reason codes usable |
| Narrow active | On-call comfortable with rollback, latency acceptable, no major regressions |
| Broader active | Same pattern holds across more tools or workflows |
This is intentionally conservative. Governance systems should earn trust by being predictable.
Operator checklist
- pin policy versions instead of loading latest implicitly
- keep fail-closed enabled for production-like environments
- record reason codes and human-readable reason text for every intervention
- define rollback before promotion, not after
- start with the smallest scope that still gives meaningful learning
What usually causes failed rollouts
Most bad policy rollouts are not caused by the classifier being slightly wrong. They fail because the operating model is weak:
- no replay baseline
- no shadow observation period
- no agreed false-positive threshold
- no scoped first deployment
- no clear rollback owner
That is why rollout discipline matters as much as model quality.
Read next
- Shadow to active rollout guide for the product runbook
- Policy management for versioning and activation workflow
- Core concepts for the underlying enforcement architecture