Skip to main content

Incident Response

Goal

Contain user-facing or operator-facing failures quickly and document what happened.

When to use

production outage
bad deploy
auth failure
incorrect historical data
simulation or copy-trading workflow malfunction

Prerequisites

access to logs
access to deployment state
access to contract and platform metadata when data quality is involved

Steps

Identify the affected surface: docs, web, API, worker, or historical data.
Stop further blast radius if needed.
Check current release and infrastructure health.
Inspect logs and the most recent related change.
Roll back or reconfigure if that is the fastest safe recovery path.
Document what failed, what was impacted, and what follow-up is required.

Verification

affected user path is restored
no repeated crash loops
follow-up reindex or migration work is scheduled if data correctness was impacted

Troubleshooting

separate availability problems from data-correctness problems
treat simulation and live-execution incidents differently

Goal
When to use
Prerequisites
Steps
Verification
Troubleshooting