Internal
Operator runbook
This page is an index. The canonical steps live in server/docs/RUNBOOK.md. Click into any section to jump to the relevant heading.
- Production deploy
Tag a server release, watch the GitHub Actions deploy, verify Cloud Run revision health and migration status.
- Cloud Run rollback
Switch traffic back to the previous revision when a deploy regresses. Skip-tags-on-rollback is intentional.
- MongoDB Atlas alerts
Triage replica-lag, slow-query, and disk-full alerts. Escalate to ops if Atlas reports primary failover.
- Stripe webhook outage
Replay missed events from the Stripe dashboard. Confirm signing-secret rotation has not orphaned the endpoint.
- DPDPA breach (72h clock)
Start the 72-hour clock the moment confirmation is reasonable. Loop in DPO and security same hour; do not delay for full forensics.
- Self-hosted runner recovery
Bring the backup runner online; revoke and reissue the runner registration token; replay queued workflows.
Can't find what you need?
Post in #ops Slack. The runbook gets updated within 14 days of any real incident — if a topic isn't here, it just hasn't happened yet (or it has, and the post-mortem is in flight).