Internal

Operator runbook

This page is an index. The canonical steps live in server/docs/RUNBOOK.md. Click into any section to jump to the relevant heading.

Production deploy
Tag a server release, watch the GitHub Actions deploy, verify Cloud Run revision health and migration status.
Cloud Run rollback
Switch traffic back to the previous revision when a deploy regresses. Skip-tags-on-rollback is intentional.
MongoDB Atlas alerts
Triage replica-lag, slow-query, and disk-full alerts. Escalate to ops if Atlas reports primary failover.
Stripe webhook outage
Replay missed events from the Stripe dashboard. Confirm signing-secret rotation has not orphaned the endpoint.
DPDPA breach (72h clock)
Start the 72-hour clock the moment confirmation is reasonable. Loop in DPO and security same hour; do not delay for full forensics.
Self-hosted runner recovery
Bring the backup runner online; revoke and reissue the runner registration token; replay queued workflows.

Can't find what you need?

Post in #ops Slack. The runbook gets updated within 14 days of any real incident — if a topic isn't here, it just hasn't happened yet (or it has, and the post-mortem is in flight).