Pyde Failure Scenarios
Version 0.1
Operational walk-throughs of failure modes. Complements THREAT_MODEL.md (what attacks exist) with step-by-step recovery procedures.
General Incident Response Timeline
T+0:00 Detection (auto or manual)
T+0:05 On-call notified
T+0:15 Triage call initiated
T+0:30 Initial incident page published
T+1:00 Root cause investigation begins
T+6:00 Recovery plan proposed
T+24:00 Recovery executed (straightforward cases)
T+72:00 Resolution + initial post-mortem
T+7d Full public post-mortem published
T+30d Drill that scenario in testnet
Communication Protocol
- Authoritative source: foundation incident page + Discord #incidents
- Status page: pyde.network/status (always updated)
- Validator coordination: private email list + dedicated Discord channel
- Public: Twitter/X status updates every 30 min during active incident
The 12 Scenarios
Scenario 1: Single Validator Offline (Hardware Failure)
- Trigger: Validator's server crashes (disk, power, etc.)
- Detection: Auto-detected within 2 rounds (no vertex from validator)
- Initial Response: None needed — other 127 continue normally
- Investigation: Operator diagnoses (off-chain)
- Recovery: Operator replaces hardware, runs state sync, resumes
- Time to Recovery: 4-24 hours
- Slashing: Downtime accumulates (~0.05%/round)
- Drill Frequency: Quarterly
Scenario 2: Validator Key Compromise
- Trigger: Operator's key stolen (phishing, server intrusion)
- Detection: Unusual signing patterns OR operator reports
- Initial Response:
- Operator: rotate to new key immediately
- Foundation: investigate scope
- Other validators: monitor for collusion
- Investigation: Forensic analysis, attribution if possible
- Recovery: Key rotation, possibly fresh validator slot if old one slashed
- Time to Recovery: 1-7 days
- Slashing: Whatever the attacker did with the key
- Lessons: HSM strongly recommended; key rotation procedures documented
- Drill Frequency: Annual paper drill
Scenario 3: Network Partition (30% Split for 1 Hour)
- Trigger: BGP routing issue, undersea cable cut, ISP outage
- Detection:
- Active committee count drops below 85 (quorum threshold = 2f+1, f=42)
- Soft stall triggered automatically
- Downtime slashing PAUSES (partition-aware)
- Initial Response:
- Validators in majority partition: keep producing vertices
- Validators in minority: cannot reach quorum, stall
- No coordination needed (automatic handling)
- Investigation: Root cause analysis (network team)
- Recovery:
- Network heals
- Minority validators rejoin gossip
- DAG resynchronizes
- Slashing resumes
- Time to Recovery: Hours (depends on network)
- Slashing: None during partition (partition-aware pause)
- Drill Frequency: Quarterly (simulate in testnet)
Scenario 4: State Root Divergence Detected
- Trigger: Bug in WASM execution layer or non-determinism
- Detection: Auto — 2+ validators sign contradictory state roots for same commit → hard halt
- Initial Response:
- All validators halt
- Forensic state preserved
- Incident page published
- Investigation:
- Identify which validators signed which root
- Determine which root is "correct"
- Identify bug causing divergence
- 6-24 hours
- Recovery:
- Patch the bug
- Validators update software
- Roll back to last consistent commit (within 1-epoch window)
- Resume from rolled-back state
- Slash validators who signed wrong roots
- Time to Recovery: 24-72 hours
- Slashing: Bad-state-root-sig (~10%) to validators on wrong fork
- Lessons: WASM execution determinism testing must improve; add new test cases
- Drill Frequency: Quarterly (inject in testnet)
Scenario 5: DKG Ceremony Fails Repeatedly
- Trigger:
- Several committee members go offline mid-DKG
- DKG round 3 messages don't reach validators
- Bug in DKG implementation
- Detection: DKG round 4 verification fails for >3 consecutive attempts → partial halt (encryption disabled for this epoch)
- Initial Response:
- Identify which members not contributing valid shares
- Decide: retry vs. continue without encryption
- Investigation:
- Per-member: offline, buggy, or malicious?
- Network issues vs. software bug
- Recovery (options):
- A: Retry DKG with backup committee members
- B: Continue without encryption for this epoch
- C: Replace problematic members from the validators-awaiting-selection pool
- Time to Recovery: Same epoch (~3 hours) or next epoch
- Slashing: DKG-failure for non-contributors (~5%)
- Drill Frequency: Annual
Scenario 6: Critical Execution Layer Bug (Off-Chain Disclosure)
- Trigger: Security researcher reports vulnerability via responsible disclosure
- Detection: Email to
security@pyde.network - Initial Response:
- Within 1 hour: foundation reviews + confirms severity
- If critical + active exploit risk: emergency halt via multisig
- If critical + no immediate risk: 24-72 hour disclosure window
- Investigation:
- Reproduce the bug
- Develop patch
- Test patch
- Coordinate validator updates
- Recovery:
- All validators update software simultaneously
- Coordinated restart if needed
- Public disclosure + acknowledgment + bounty payment
- Time to Recovery: 24-72 hours
- Slashing: None (no on-chain offense)
- Lessons: Strong bug bounty program; clear disclosure policy
- Drill Frequency: Annual paper drill
Scenario 7: Active Exploit Being Used
- Trigger: Foundation observes attacker draining funds
- Detection: On-chain monitoring tools, validator reports
- Initial Response: Emergency halt within 15 minutes via multisig
- Investigation:
- Identify exploit mechanism (fast)
- Calculate scope of damage
- Identify attacker addresses if possible
- Recovery:
- Patch the exploit
- Validator update
- Rollback if within 1-epoch window (controversial)
- OR resume without rollback (user funds lost)
- Compensation plan from treasury if available
- Time to Recovery: 24-72 hours
- Slashing: None (off-chain attack)
- Lessons: Better monitoring; multisig response speed critical
- Drill Frequency: Annual simulated
Scenario 8: Foundation Multisig Key Lost / Compromised
- Trigger: Holder loses key (HW failure) OR key stolen
- Detection: Holder reports loss OR unusual multisig activity observed
- Initial Response:
- Lost: holder coordinates with other multisig members for replacement
- Stolen: investigate scope, secure remaining keys
- Investigation: Verify identity of remaining holders; forensic if stolen
- Recovery:
- Replace lost/compromised key via multisig vote
- May need genesis-update if all keys at risk
- Update on-chain multisig configuration
- Time to Recovery: Days to weeks
- Slashing: None (operational)
- Lessons: Diverse holders, geographic distribution, HSM
- Drill Frequency: Annual paper drill
Scenario 9: Major Cloud Provider Outage (AWS us-east-1)
- Trigger: Cloud provider region outage
- Detection: 30-60% of validators in that region go offline
- Initial Response: Validators outside affected region continue if quorum maintained
- Investigation: Identify cause (provider's issue, not Pyde's)
- Recovery:
- Cloud provider recovers
- Validators come back online
- Network catches up
- Slashing PAUSED during partition
- Time to Recovery: Hours (depends on provider)
- Slashing: None (partition-aware)
- Lessons: Validator diversity matters; encourage multi-provider, multi-region
- Drill Frequency: Quarterly multi-region resilience test
Scenario 10: Coordinated 43-Validator Attack
- Trigger: 43 validators coordinate to attack (offline or equivocate)
- Detection: Real-time monitoring shows coordinated behavior
- Initial Response:
- 43 offline: stall (auto), need governance to remove if persistent
- 43 equivocating: massive slashing events
- Investigation: Identify coordinator; collect cryptographic evidence
- Recovery:
- 43 offline: emergency halt + governance removal
- 43 equivocating: slash all 43 (correlation multiplier = 2× → full bond)
- Network resumes with remaining 85+ honest
- Time to Recovery: 24-72 hours
- Slashing: Up to 100% × 43 validators (correlation max)
- Lessons: This is the BFT boundary; design defends but at cost
- Drill Frequency: Annual paper-only (too disruptive for testnet)
Scenario 11: Memory Leak Causing Rolling Restarts
- Trigger: Bug causes validator memory to grow unbounded
- Detection:
- Operator notices RSS growing
- Performance dashboards show abnormal memory
- OOM crashes
- Initial Response:
- Identify affected validators
- Restart affected (each)
- Investigation:
- Heap profiling
- Identify leaked structure
- Patch the bug
- Recovery:
- Software update
- Rolling restart (not simultaneous)
- Time to Recovery: Hours to days
- Slashing: Downtime for extended restarts
- Lessons: Better memory profiling, soak testing
- Drill Frequency: Continuous (every soak test)
Scenario 12: Genesis State Inconsistency Discovered
- Trigger: After mainnet launch, discrepancy found in genesis state
- Detection: Foundation review, validator report
- Initial Response:
- Determine if functional or cosmetic
- If functional: emergency halt
- Investigation:
- Identify cause (founder error, hardcoded discrepancy)
- Calculate impact
- Recovery:
- Cosmetic: file a note, no action
- Functional: hard fork required (re-genesis or state correction)
- Time to Recovery: Days to weeks (hard fork is coordination-heavy)
- Slashing: None (genesis issue)
- Lessons: Genesis review must be thorough; multiple parties verify
- Drill Frequency: Pre-launch paper review only (irreversible post-launch)
Generalized Lessons
| Pattern | Recommendation |
|---|---|
| Multiple validators affected together | Encourage geographic + provider + ISP diversity |
| Operational mistakes | HSM, multisig for critical ops, runbooks |
| Software bugs | Bug bounty, formal verification, extensive testing |
| Network issues | Partition-aware slashing, sentry nodes, diverse routes |
| Time to recovery | Pre-rehearsed drills > improvising under pressure |
Runbook Library Structure
Each scenario should have a written runbook:
runbooks/
├── 01-validator-offline-single.md
├── 02-validator-key-compromise.md
├── 03-network-partition.md
├── 04-state-root-divergence.md
├── 05-dkg-failure.md
├── 06-pvm-bug-disclosed.md
├── 07-active-exploit.md
├── 08-multisig-key-event.md
├── 09-cloud-provider-outage.md
├── 10-coordinated-attack.md
├── 11-memory-leak.md
├── 12-genesis-discrepancy.md
└── README.md (decision tree → which runbook)
Each runbook contains: trigger conditions, detection criteria, step-by-step response (commands to run, calls to make), recovery procedures, escalation paths, communication templates, post-incident checklist.
Drill Schedule
| Drill | Frequency | Format |
|---|---|---|
| Validator restart | Quarterly | Live (testnet) |
| Network partition | Quarterly | Live (testnet) |
| State root divergence | Quarterly | Live (testnet, injection) |
| DKG failure | Annual | Live (testnet) |
| Active exploit | Annual | Simulated |
| Coordinated attack | Annual | Paper only |
| Key compromise | Annual | Paper only |
| Multisig key event | Annual | Paper only |
| Genesis discrepancy | Pre-launch only | Paper review |
| Cloud outage | Quarterly | Live (testnet, region isolation) |
Track every drill: time-to-detect, time-to-respond, time-to-recover. Improve runbooks based on observed gaps.
Integration with Other Documents
- Threat model: see THREAT_MODEL.md for the "what could attack us"
- Chain halt: see CHAIN_HALT.md for halt mechanics
- Performance harness: see PERFORMANCE_HARNESS.md for chaos testing infrastructure
- Slashing: see SLASHING.md for slashing details
Document version: 0.1
License: See repository root