Pyde Chain Halt + Recovery Procedures
Version 0.1
The HotStuff lesson made operational: explicit halt detection → investigation → recovery procedures. No live-patching under pressure.
Three Halt Types
| Type | Trigger | Severity | Authority | Recovery |
|---|---|---|---|---|
| Soft stall | Network/quorum issues | Liveness only | Emergent (any node detects) | Wait (auto-resume) |
| Hard halt | Detected inconsistency (state root divergence, equivocation cluster) | Safety risk | Protocol-detected automatic | Manual investigation |
| Emergency halt | Critical bug, active exploit, hard-fork prep | High intentional | Governance multisig (7-of-12) | Per-incident, max 30 days |
Detection Mechanisms
Soft Stall (Automatic)
- No commit for > 5 rounds (~1s expected, so 5s threshold)
- <85 vertices certified for last K rounds
- Active committee count drops below safety threshold (86)
Response: Validators enter "stall mode" — produce vertices, wait for quorum. Mempool keeps accepting txs (queued). Auto-recover when conditions improve.
Hard Halt (Automatic)
- State root divergence detected (2+ signed contradictory roots for same commit)
- Equivocation cluster (10+ validators in single epoch)
- DKG output mismatch
- Execution layer critical invariant violation
- DAG fork detected (impossible per protocol, indicates bug)
Response: All validators stop producing vertices. All commits halted. Halt event broadcast. Forensic state preserved. Manual intervention required.
Emergency Halt (Manual)
- Critical bug discovered (off-chain, e.g., security researcher)
- Active exploit being mitigated
- Hard-fork coordination needed
- State recovery from previous incident
Response: Governance multisig signs HaltMessage with timestamp + reason. Halt activated for max 30 days (constitutional limit).
What Happens During Halt
| Activity | Soft Stall | Hard Halt | Emergency Halt |
|---|---|---|---|
| Vertex production | Continues (no quorum) | Stops | Stops |
| Commits | Paused | Paused | Paused |
| Tx submission | Accepted, queued | Accepted, queued | Accepted, queued |
| Decryption ceremonies | Paused | Stopped | Stopped |
| DKG ceremonies | Continues unless triggered | Stopped | Stopped |
| State queries | Continue | Continue (forensic) | Continue |
| Slashing evidence acceptance | Continues | Continues | Continues |
| Gossip | Continues | Continues | Continues |
Key invariant: slashing evidence accepted during halt. Attackers cannot escape consequences by triggering a halt.
Investigation Procedure (Hard / Emergency)
Phase 1: Triage (within 1 hour)
- Confirm halt type + trigger
- Identify affected commits / validators
- Snapshot forensic state (preserve)
- Public incident report (initial)
Phase 2: Root Cause Analysis (within 6-24 hours)
- Bug / attack / infrastructure failure?
- Determine scope of impact
- Coordinate with validator operators
- Develop fix or recovery plan
Phase 3: Recovery Plan (within 24-72 hours)
- Propose recovery strategy
- Validate plan with multisig + community
- Coordinate validator updates if needed
- Schedule resume timing
Recovery Procedures (5 Paths)
1. Wait It Out (Soft Stalls)
- Network/validator issues resolve naturally
- 85+ validators come back online
- Quorum forms, commits resume
- No intervention needed
- Typical: <30 minutes; >1 hour escalates
2. Software Update + Replay (Hard Halts from Bugs)
- Identify the deterministic bug causing state divergence
- Patch validator software
- Validators verify they're at consistent state
- Coordinate restart from last verified commit
- Replay txs from mempool
3. Rollback (Controversial, Severe Bugs)
- Roll back to last "clean" commit (max 1 epoch back — 3 hours)
- Discard commits after rollback point
- Re-execute affected txs
- Apply slashing to bad actors
- Limited window prevents catastrophic finality violations
4. Hard Fork (Irreconcilable Issues)
- Manual coordination via governance multisig
- Agreement on canonical state
- All validators update software
- Resume from agreed genesis-of-new-fork state
- Old chain abandoned
5. Emergency Unhalt (False-Positive Halts)
- Investigation reveals no actual issue
- Multisig releases halt
- Resume normally
Rollback Policy
Bounded operational pragmatism:
Maximum rollback window: 1 epoch (~3 hours)
Within window: governance multisig can authorize rollback
Beyond window: only hard fork (community coordination required)
Philosophy: weak finality with a sunset.
- Within 1 epoch: finality is "almost certain but reversible via emergency"
- After 1 epoch: finality is "irreversible without coordinated hard fork"
This is industry standard pattern (Solana de facto, Ethereum has emergency rollback procedures).
State Reconciliation After Rollback
1. All validators agree on rollback target (commit C)
2. Validators roll back state to C
3. Commits after C are discarded
4. Txs in those commits returned to mempool (if still valid)
5. Slashing applied to validators who produced bad-state-root sigs
6. Software updates applied if needed
7. Resume normal operation from C
8. New canonical fork is the post-rollback chain
Specific Scenario Playbooks
Scenario A: State Root Divergence in Commit N
- Detection: 2+ validators signed contradictory roots for commit N
- Action: hard halt automatic
- Investigation: which validators? what tx caused? bug or attack?
- Recovery: identify cause, patch validators, rollback to N-1, resume
- Slashing: validators with wrong root get bad-state-root-sig slash (10%+)
Scenario B: 43+ Committee Offline Simultaneously
- Detection: <85 quorum cannot form
- Action: soft stall
- Investigation: coordinated (attack) or correlated (datacenter outage)?
- Recovery: correlated → wait; coordinated → governance emergency halt to remove
- Slashing: extended downtime + possibly coordination evidence
Scenario C: Critical Bug Discovered (Off-Chain)
- Detection: human report to foundation
- Action: emergency halt via multisig
- Investigation: assess exploit, develop patch
- Recovery: coordinate validator update, resume after patch
- Slashing: none (no on-chain evidence)
Scenario D: DKG Ceremony Failed (Multiple Times)
- Detection: round 4 fails >3 consecutive
- Action: partial halt (encryption disabled for epoch)
- Investigation: which members not contributing? bug or attack?
- Recovery: rotate problematic members + retry DKG, OR continue without encryption
- Slashing: DKG-failure for non-participants
Scenario E: Detected DAG Fork
- Detection: contradictory subdags after commit
- Action: hard halt (this should be impossible per protocol)
- Investigation: deep protocol bug
- Recovery: hard fork to canonical chain, coordinate community
- Slashing: equivocation slashing for forking actors
Communication & Coordination
Halt detected → On-chain "ChainHalted" event emitted
↓
Validator dashboards display halt status
↓
Foundation publishes incident page (initial within 1 hour)
↓
Coordination channels active:
- Discord/Telegram: real-time
- Validator email list: critical comms
- Twitter/X: public status
↓
Resolution proposed
↓
Multisig signs ResumeMessage when ready
↓
On-chain "ChainResumed" event
↓
Public post-mortem within 7 days
Re-Entry After Halt
1. Multisig signals resume (or auto-resume for soft stalls)
2. Validators verify they're at consistent state
3. Mempool processes queued txs (validity re-checked against current state)
4. Commits resume normal cadence
5. Slashing evidence from halt period processed
6. System returns to normal operation
Test Plan / Drills
Mandatory before mainnet:
- Soft stall drills: deliberately offline 43 validators, verify recovery
- Hard halt drills: inject state divergence, verify detection + flow
- Emergency halt drills: practice multisig coordination
- Rollback drills: practice 1-epoch rollback procedure
- Hard fork drills: practice coordinated upgrade
Frequency: quarterly in testnet, annually in mainnet.
Documentation: runbooks for each scenario; updated after every drill.
The HotStuff Lesson Applied
HotStuff broke under wedges/stalls because there was no clear halt → investigate → recover procedure. The team patched live, accumulating safety subtleties.
Pyde's design EXPLICITLY:
- Separates the three halt types
- Defines authority + procedure for each
- Builds drills into the operational plan
This is the lesson learned from the pivot.
References
- Threat model: see THREAT_MODEL.md
- Failure scenarios (operational walk-through): see FAILURE_SCENARIOS.md
- Slashing: see SLASHING.md
Document version: 0.1
License: See repository root