Disaster Recovery Plan - Sport Tech Club
Status: Ativo | Versão: 1.0 | Última Revisão: 2026-01-09 Owner: DevOps Team | Aprovado por: CTO
Índice
- Objetivos e Métricas
- Classificação de Desastres
- Arquitetura de Alta Disponibilidade
- Estratégia de Backup
- Procedimentos de Recovery
- Runbooks
- Testes de DR
- Comunicação de Crise
- Modo Degradado
- Custos
1. Objetivos e Métricas
Service Level Objectives (SLOs)
| Métrica | Objetivo | Justificativa |
|---|---|---|
| RTO (Recovery Time Objective) | 1 hora | Downtime máximo aceitável para operação de arenas |
| RPO (Recovery Point Objective) | 15 minutos | Perda máxima de dados tolerável (reservas, pagamentos) |
| SLA (Service Level Agreement) | 99.9% | ~8.76h de downtime por ano (~43min/mês) |
| MTTR (Mean Time To Recovery) | < 30 minutos | Tempo médio para recuperação de incidentes |
Impacto de Negócio
Downtime de 1 hora:
- 20 arenas x 8 quadras x 4 reservas/hora = 640 reservas afetadas
- ~R$ 80.000 em receita perdida
- Danos à reputação
- Multas contratuais (SLA com arenas)2. Classificação de Desastres
Matriz de Severidade
| Nível | Descrição | Escopo | RTO | Equipe |
|---|---|---|---|---|
| Nível 1 | Falha de componente único | 1 instância/serviço | 5 min | Auto-healing |
| Nível 2 | Falha de Availability Zone | 1 AZ AWS | 15 min | On-call DevOps |
| Nível 3 | Falha regional | Região AWS | 1 hora | DevOps + Infra Lead |
| Nível 4 | Desastre catastrófico | Multi-região | 4 horas | War Room (CTO, CEO) |
Exemplos por Nível
Nível 1 - Falha Isolada
- Container crashando
- Instância EC2 unhealthy
- Worker Redis offline
- Cache miss elevado
Nível 2 - Falha de Zona
- AZ AWS indisponível
- Subnet/Security Group misconfiguration
- Database replica offline
- Network partition
Nível 3 - Falha Regional
- Região AWS com outage
- DNS/Route53 failure
- CDN global degradation
- Multi-service cascading failure
Nível 4 - Desastre Total
- Ransomware/Cryptolocker
- Compromisso total de credenciais
- Corrupção de dados em cascata
- Delete acidental de recursos críticos
- Ataque DDoS volumétrico
3. Arquitetura de Alta Disponibilidade
Topologia Multi-AZ
┌─────────────────────────────────────────────────────────────────┐
│ AWS Route 53 (DNS) │
│ Cloudflare CDN │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────┐
│ Application Load Balancer (Multi-AZ) │
│ Health Check: /health (interval: 10s) │
└───────────┬───────────────┬───────────────┘
│ │
┌────────────┴───────┐ ┌───┴─────────────────┐
│ AZ us-east-1a │ │ AZ us-east-1b │
│ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ ECS Tasks │ │ │ │ ECS Tasks │ │
│ │ (3-10 inst.)│ │ │ │ (3-10 inst.)│ │
│ └──────┬───────┘ │ │ └──────┬───────┘ │
│ │ │ │ │ │
│ ┌──────▼───────┐ │ │ ┌──────▼───────┐ │
│ │Redis Cluster │ │ │ │Redis Cluster │ │
│ │ (Master) │◄─┼──┼──┤ (Replica) │ │
│ └──────────────┘ │ │ └──────────────┘ │
└────────┬───────────┘ └───────┬────────────┘
│ │
└──────────┬───────────┘
│
┌───────────▼────────────┐
│ RDS PostgreSQL 14 │
│ Primary (us-east-1a) │
│ Read Replica (1b) │
│ Automated Backups │
└────────────────────────┘Componentes Críticos
3.1 Application Layer
# ECS Service Auto-Scaling
MinInstances: 3
MaxInstances: 10
TargetCPU: 70%
TargetMemory: 80%
# Health Check
Endpoint: /api/health
Interval: 10s
Timeout: 5s
UnhealthyThreshold: 2
HealthyThreshold: 2
# Deployment Strategy
Type: Rolling Update
MaxUnavailable: 25%
MaxSurge: 50%3.2 Database Layer (PostgreSQL RDS)
Primary Instance:
- Type: db.r6g.xlarge (4 vCPU, 32GB RAM)
- Storage: 500GB GP3 (12000 IOPS)
- Multi-AZ: Enabled (synchronous replica)
- Automated Backups: Daily, retention 35 days
- Point-in-Time Recovery: 5-minute granularity
Read Replica:
- Cross-AZ for read scaling
- Promotion time: ~1 minute
- Lag monitoring: < 100ms
Connection Pooling:
- PgBouncer (100 connections)
- Application pool: 20 connections/instance3.3 Cache Layer (Redis)
ElastiCache Redis Cluster:
- Mode: Cluster Enabled
- Nodes: 6 (3 shards x 2 replicas)
- Instance: cache.r6g.large
- Multi-AZ: Enabled
- Automatic Failover: < 60s
Persistence:
- AOF (Append-Only File): Every second
- RDB Snapshot: Daily
- Backup retention: 7 days3.4 Object Storage (S3)
Bucket: sport-tech-club-media
- Versioning: Enabled
- Cross-Region Replication: us-west-2
- Lifecycle:
- Transition to IA after 30 days
- Transition to Glacier after 90 days
- Delete after 7 years (compliance)
Bucket: sport-tech-club-backups
- MFA Delete: Enabled
- Encryption: AES-256
- Object Lock: Compliance mode (immutable)4. Estratégia de Backup
4.1 Database Backups
Automated Backups (RDS)
# Configuração
BACKUP_WINDOW="03:00-04:00 UTC" # 00:00-01:00 BRT
RETENTION_PERIOD=35 # dias (compliance)
PITR_ENABLED=true # Point-in-Time Recovery
# Restore time
FULL_RESTORE_TIME="~30 minutes" # database 500GB
PITR_RESTORE_TIME="~45 minutes" # + replay logsManual Snapshots
# Snapshot pré-deploy (crítico)
aws rds create-db-snapshot \
--db-instance-identifier sport-tech-prod \
--db-snapshot-identifier prod-pre-deploy-$(date +%Y%m%d-%H%M)
# Retention: Indefinida (delete manual)WAL Archiving (Write-Ahead Logs)
-- PostgreSQL Configuration
archive_mode = on
archive_command = 'aws s3 cp %p s3://sport-tech-wal-archive/%f'
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB
-- RPO: 1 minuto (WAL shipping interval)
-- Armazenamento: S3 (replicado para us-west-2)
-- Retention: 7 dias4.2 Application Backups
Código e Configuração
Repository: GitHub Enterprise
- Branch protection: main, develop
- Signed commits: Obrigatório
- Review requirement: 2 approvers
Infrastructure as Code:
- Terraform state: S3 + State locking (DynamoDB)
- Versioning: Enabled
- Encryption: At-rest + in-transit
Kubernetes Manifests:
- ArgoCD: GitOps (auto-sync disabled)
- Helm charts: versioned in Git
- Secrets: Sealed Secrets (encrypted)Secrets e Credenciais
# HashiCorp Vault
BACKUP_FREQUENCY="Diário (automated)"
BACKUP_LOCATION="S3 encrypted bucket"
BACKUP_ENCRYPTION="PGP key (offline master key)"
# Backup command
vault operator raft snapshot save backup-$(date +%Y%m%d).snap
aws s3 cp backup-*.snap s3://sport-tech-vault-backup/ \
--sse AES256 --storage-class STANDARD_IA
# Retention: 90 dias
# Test restore: Mensal4.3 Object Storage (Mídia)
Vídeos e Fotos de Arenas:
- Bucket: sport-tech-club-media
- Size: ~2TB (crescimento: 50GB/mês)
- Versioning: Enabled (últimas 3 versões)
- Cross-Region Replication: us-west-2 (async)
- Lifecycle: IA (30d) → Glacier (90d)
RPO: ~5 minutos (S3 replication time)
RTO: Imediato (multi-region access)4.4 Logs e Métricas
CloudWatch Logs:
- Retention: 90 dias (compliance)
- Export to S3: Diário (análise)
- Compression: gzip
Prometheus Metrics:
- Retention: 15 dias (local)
- Long-term storage: Thanos (S3)
- Retention: 1 ano
Jaeger Traces:
- Hot storage: 7 dias
- Cold storage: 30 dias (S3)4.5 Cronograma de Backups
| Componente | Frequência | Retenção | RPO | Localização |
|---|---|---|---|---|
| Database | Contínuo (WAL) + Daily snapshot | 35 dias | 1 min | RDS + S3 (multi-region) |
| Redis AOF | Contínuo | 7 dias | 1 seg | ElastiCache + S3 |
| S3 Media | Contínuo (replication) | Versioning (3x) | 5 min | us-east-1 + us-west-2 |
| Vault | Diário | 90 dias | 24h | S3 encrypted |
| Code | Every commit | Indefinido | 0 | GitHub |
| Logs | Contínuo | 90 dias | 1 min | CloudWatch + S3 |
5. Procedimentos de Recovery
5.1 Falha de Instância EC2/Container
Cenário: Container crashando, health check failing
Detecção
Alertas:
- ECS Service: UnhealthyHostCount > 0 (1 min)
- ALB: UnhealthyTargetCount > 0 (1 min)
- APM: Error rate > 5% (30s)
Monitoramento:
- CloudWatch Container Insights
- Datadog APM
- PagerDuty escalationAção Automática (Auto-Healing)
# ECS Service auto-restart
# Triggered by health check failure (2 consecutive fails)
# Força restart manual (se necessário)
aws ecs update-service \
--cluster sport-tech-prod \
--service api-service \
--force-new-deployment
# Tempo de recovery: ~2 minutos
# Impacto: Zero (outras instâncias atendem)Validação
# Check service health
aws ecs describe-services \
--cluster sport-tech-prod \
--services api-service \
--query 'services[0].runningCount'
# Expected: runningCount == desiredCount
# Check ALB targets
aws elbv2 describe-target-health \
--target-group-arn $TG_ARN
# Expected: All targets "healthy"
# Application smoke test
curl -f https://api.sporttechclub.com.br/health
# Expected: 200 OK, response time < 500ms5.2 Corrupção de Database
Cenário: Data corruption, tabela deletada, ransomware
Detecção
Alertas:
- Data integrity check failure
- Unexpected DELETE/TRUNCATE queries
- Application errors: "relation does not exist"
- Monitoring: Row count drop > 10%
Ferramentas:
- pg_stat_statements (query monitoring)
- CloudWatch RDS logs
- Application error tracking (Sentry)Procedimento de Recovery - Point-in-Time Recovery (PITR)
# 1. Identificar timestamp ANTES da corrupção
RESTORE_TIME="2026-01-09T14:30:00Z"
# 2. Criar nova instância a partir do backup
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier sport-tech-prod \
--target-db-instance-identifier sport-tech-restored \
--restore-time $RESTORE_TIME \
--db-subnet-group-name sport-tech-db-subnet \
--publicly-accessible false \
--multi-az
# Tempo estimado: 30-45 minutos (500GB database)
# 3. Aguardar disponibilidade
aws rds wait db-instance-available \
--db-instance-identifier sport-tech-restored
# 4. Validar dados restaurados
psql -h sport-tech-restored.abc123.us-east-1.rds.amazonaws.com \
-U admin -d sporttechdb -c "SELECT COUNT(*) FROM bookings;"
# 5. Promover instância restaurada (se validado)
# Update DNS/connection string to point to restored instance
# 6. Investigar causa raiz
# Check application logs, audit RDS query logsValidação
-- Check data integrity
SELECT
schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Validate business rules
SELECT COUNT(*) FROM bookings WHERE status = 'confirmed';
SELECT SUM(amount) FROM payments WHERE created_at >= NOW() - INTERVAL '7 days';
-- Check foreign key constraints
SELECT conname, conrelid::regclass, confrelid::regclass
FROM pg_constraint WHERE contype = 'f';Comunicação
T+0min: Detectar anomalia
T+5min: Confirmar corrupção, iniciar war room
T+10min: Comunicar status page: "Investigando inconsistência de dados"
T+45min: Database restored, validação em progresso
T+60min: Serviço restaurado, post-mortem agendado5.3 Falha de Availability Zone
Cenário: AWS AZ us-east-1a indisponível
Detecção
AWS Health Dashboard:
- AZ status: Degraded/Unavailable
- ECS Tasks: Unhealthy in AZ-a
- RDS: Automatic failover initiated
Alertas:
- CloudWatch: AZ-a metrics flatlined
- Datadog: Host availability < 50%
- PagerDuty: High-priority alertAção Automática
RDS Multi-AZ Failover:
- Detection: 60-120 segundos
- Automatic failover to standby (us-east-1b)
- DNS update: 30-60 segundos
- Total downtime: ~2 minutos
ECS Service:
- ALB reroute traffic to healthy AZ
- Auto-scaling spawn new tasks in AZ-b
- Downtime: ~0 segundos (rolling)
Redis Cluster:
- Automatic failover to replica
- Promotion time: 30-60 segundos
- Application retry handles transitionAção Manual (Se Necessário)
# 1. Verificar status de failover
aws rds describe-db-instances \
--db-instance-identifier sport-tech-prod \
--query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]'
# Expected: ["available", "us-east-1b"]
# 2. Forçar redistribuição de ECS tasks
aws ecs update-service \
--cluster sport-tech-prod \
--service api-service \
--force-new-deployment
# 3. Validar distribuição
aws ecs list-tasks \
--cluster sport-tech-prod \
--service-name api-service \
--query 'taskArns[*]' | wc -l
# Esperado: Tasks distribuídos em AZ-b e AZ-cValidação
# Health check de todos os componentes
./scripts/health-check-all.sh
# Output esperado:
# ✓ ALB: Healthy (all targets)
# ✓ ECS: 8/8 tasks running (us-east-1b, us-east-1c)
# ✓ RDS: Available (us-east-1b - PRIMARY)
# ✓ Redis: Cluster healthy (3/3 shards)
# ✓ API Response time: 180ms (avg)5.4 Ataque de Ransomware
Cenário: Cryptolocker, credenciais comprometidas, delete malicioso
Detecção
Indicadores:
- Arquivos S3 deletados em massa
- Database tables truncadas
- Logs de acesso suspeitos (IP incomum, horário atípico)
- Alertas: GuardDuty, CloudTrail anomalies
- Ransom note em buckets S3
CloudTrail Events:
- s3:DeleteObject (batch)
- rds:DeleteDBInstance
- iam:CreateAccessKey (não autorizado)Contenção Imediata (T+0 a T+15min)
# 1. ISOLAR SISTEMAS COMPROMETIDOS
# Revogar credenciais suspeitas
aws iam update-access-key \
--access-key-id $SUSPECTED_KEY \
--status Inactive \
--user-name $USER
# Desabilitar usuários IAM
aws iam delete-login-profile --user-name $COMPROMISED_USER
aws iam list-access-keys --user-name $COMPROMISED_USER | \
jq -r '.AccessKeyMetadata[].AccessKeyId' | \
xargs -I {} aws iam delete-access-key --user-name $COMPROMISED_USER --access-key-id {}
# 2. BLOQUEAR TRÁFEGO SUSPEITO
# Atualizar Security Groups
aws ec2 revoke-security-group-ingress \
--group-id $SG_ID \
--ip-permissions IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges=[{CidrIp=$MALICIOUS_IP/32}]
# 3. SNAPSHOT IMEDIATO (forense)
aws rds create-db-snapshot \
--db-instance-identifier sport-tech-prod \
--db-snapshot-identifier incident-ransomware-$(date +%Y%m%d-%H%M)
# 4. ATIVAR WAR ROOM
# Notificar: CTO, CEO, DevOps Lead, Security LeadRecovery (T+15min a T+60min)
# 1. RESTAURAR DATABASE (limpa)
# Usar backup ANTES do ataque (CloudTrail timestamp)
ATTACK_TIME="2026-01-09T10:45:00Z"
SAFE_RESTORE_TIME="2026-01-09T10:30:00Z" # 15min antes
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier sport-tech-prod \
--target-db-instance-identifier sport-tech-clean \
--restore-time $SAFE_RESTORE_TIME \
--multi-az
# 2. RESTAURAR S3 OBJECTS (versioning)
# Script para restaurar versões anteriores
aws s3api list-object-versions \
--bucket sport-tech-club-media \
--prefix "arenas/" \
--query 'DeleteMarkers[?IsLatest==`true`].[Key,VersionId]' \
--output text | \
while read key version; do
aws s3api delete-object \
--bucket sport-tech-club-media \
--key "$key" \
--version-id "$version"
done
# 3. REDEPLOYAR APLICAÇÃO (imagens limpas)
# Rebuild from clean Git commit
git checkout main
git reset --hard $LAST_KNOWN_GOOD_COMMIT
docker build -t sport-tech-api:clean .
docker push sport-tech-api:clean
# Deploy
kubectl set image deployment/api-deployment \
api=sport-tech-api:clean
# 4. ROTACIONAR TODOS OS SECRETS
# Database passwords
aws rds modify-db-instance \
--db-instance-identifier sport-tech-clean \
--master-user-password $(openssl rand -base64 32)
# API keys, JWT secrets
vault kv put secret/sport-tech/prod \
jwt_secret=$(openssl rand -base64 64) \
api_key=$(openssl rand -base64 32)Comunicação
T+0min: Detectar ataque, ativar war room
T+5min: Contenção iniciada, sistemas isolados
T+15min: Status page: "Manutenção emergencial"
T+30min: Recovery em progresso (não mencionar ransomware publicamente)
T+60min: Sistemas restaurados, validação em andamento
T+90min: Status page: "Serviços restaurados. Investigação continua."
T+24h: Email clientes: "Incidente resolvido, nenhum dado vazado"
T+7d: Post-mortem público (opcional, dependendo do caso)5.5 Erro Humano (Delete Acidental)
Cenário: Desenvolvedor executa DROP TABLE em produção
Detecção
Alertas:
- Application errors: "relation does not exist"
- Monitoring: Table row count = 0
- CloudWatch RDS Events: DDL statement executed
Notificação:
- Desenvolvedor reporta erro imediatamente
- Slackbot: #incidents channelProcedimento
# 1. PARAR NOVAS ESCRITAS (prevenir perda de dados)
# Colocar aplicação em read-only mode
kubectl set env deployment/api-deployment \
READ_ONLY_MODE=true
# 2. IDENTIFICAR TIMESTAMP DO DELETE
# RDS Query Logs
aws rds download-db-log-file-portion \
--db-instance-identifier sport-tech-prod \
--log-file-name error/postgresql.log.2026-01-09-14 \
--output text | grep "DROP TABLE"
# Output: 2026-01-09 14:32:15 UTC [12345]: DROP TABLE bookings CASCADE;
# 3. RESTAURAR TABELA (PITR)
RESTORE_TIME="2026-01-09T14:30:00Z" # 2 min antes do DROP
# Criar instância temporária
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier sport-tech-prod \
--target-db-instance-identifier sport-tech-temp-restore \
--restore-time $RESTORE_TIME
# 4. EXTRAIR DADOS DA TABELA
pg_dump -h sport-tech-temp-restore.xxx.rds.amazonaws.com \
-U admin -d sporttechdb \
--table=bookings --data-only \
--file=bookings_restore.sql
# 5. RECRIAR ESTRUTURA (se necessário)
psql -h sport-tech-prod.xxx.rds.amazonaws.com \
-U admin -d sporttechdb < schema/bookings_table.sql
# 6. IMPORTAR DADOS
psql -h sport-tech-prod.xxx.rds.amazonaws.com \
-U admin -d sporttechdb < bookings_restore.sql
# 7. REMOVER READ-ONLY MODE
kubectl set env deployment/api-deployment \
READ_ONLY_MODE-
# 8. CLEANUP
aws rds delete-db-instance \
--db-instance-identifier sport-tech-temp-restore \
--skip-final-snapshot
# Tempo total: ~45 minutos5.6 Falha de Integração (Ziggy Offline)
Cenário: Ziggy (catracas, pagamentos) inacessível por 2+ horas
Detecção
Alertas:
- HTTP 503 errors para Ziggy API
- Payment webhooks não recebidos (15+ min)
- Arena check-ins failing
- Monitoring: Ziggy uptime < 99%
Impact:
- Clientes não conseguem acessar quadras
- Pagamentos não processados
- Reservas não validadasModo Degradado (Failover)
// Activate offline mode
// app/config/feature-flags.ts
export const FeatureFlags = {
ZIGGY_OFFLINE_MODE: true, // Manual toggle or auto-detect
OFFLINE_GRACE_PERIOD: 60 * 60 * 2, // 2 hours
}
// app/services/access-control.service.ts
async validateAccess(bookingId: string): Promise<AccessResult> {
// Tentar Ziggy
try {
return await this.ziggyClient.validateAccess(bookingId);
} catch (error) {
if (!FeatureFlags.ZIGGY_OFFLINE_MODE) throw error;
// Fallback: Validar localmente
const booking = await this.bookingRepo.findById(bookingId);
if (this.isValidForOfflineAccess(booking)) {
// Log para sincronização posterior
await this.offlineAccessLog.create({
bookingId,
timestamp: new Date(),
method: 'offline_validation'
});
return { allowed: true, mode: 'offline' };
}
return { allowed: false, reason: 'offline_validation_failed' };
}
}
isValidForOfflineAccess(booking: Booking): boolean {
// Regras de negócio para modo offline
return (
booking.status === 'confirmed' &&
booking.paymentStatus === 'paid' &&
booking.startTime <= new Date() &&
booking.endTime >= new Date()
);
}Comunicação
T+0min: Ziggy offline detectado
T+5min: Ativar modo degradado automaticamente
T+10min: Notificar arenas: "Catracas em modo offline. Liberação automática ativada."
T+15min: Status page: "Processamento de pagamentos temporariamente indisponível"
T+30min: Escalar para Ziggy support team
T+2h: Avaliar alternativas (gateway de pagamento backup)
T+4h: Ziggy restored, sincronização automática
T+24h: Validar todos os check-ins sincronizados6. Runbooks
6.1 Database Restore (Completo)
#!/bin/bash
# runbook-db-restore.sh
set -euo pipefail
# Configuration
SOURCE_DB="sport-tech-prod"
RESTORE_TIME="${1:-}" # Pass as argument or use latest
TARGET_DB="sport-tech-restored-$(date +%Y%m%d-%H%M)"
# Validação
if [ -z "$RESTORE_TIME" ]; then
echo "Usage: ./runbook-db-restore.sh '2026-01-09T14:30:00Z'"
exit 1
fi
# 1. Criar instância restaurada
echo "[1/7] Creating restored DB instance..."
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier $SOURCE_DB \
--target-db-instance-identifier $TARGET_DB \
--restore-time "$RESTORE_TIME" \
--db-subnet-group-name sport-tech-db-subnet \
--vpc-security-group-ids sg-0abc123def456 \
--multi-az \
--no-publicly-accessible
# 2. Aguardar disponibilidade
echo "[2/7] Waiting for DB to become available (30-45 min)..."
aws rds wait db-instance-available \
--db-instance-identifier $TARGET_DB
# 3. Obter endpoint
RESTORED_ENDPOINT=$(aws rds describe-db-instances \
--db-instance-identifier $TARGET_DB \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
echo "[3/7] Restored DB endpoint: $RESTORED_ENDPOINT"
# 4. Validar dados
echo "[4/7] Validating restored data..."
psql -h $RESTORED_ENDPOINT -U admin -d sporttechdb -c "
SELECT
(SELECT COUNT(*) FROM bookings) as bookings_count,
(SELECT COUNT(*) FROM arenas) as arenas_count,
(SELECT MAX(created_at) FROM bookings) as latest_booking
"
# 5. Executar testes de integridade
echo "[5/7] Running integrity tests..."
./scripts/db-integrity-tests.sh $RESTORED_ENDPOINT
# 6. Próximos passos
echo "[6/7] Restore complete!"
echo ""
echo "Next steps:"
echo "1. Validate data thoroughly"
echo "2. Update application config to point to: $RESTORED_ENDPOINT"
echo "3. Test application with restored DB"
echo "4. Promote to production (rename endpoint or update DNS)"
echo ""
echo "Rollback: Keep old instance for 24h before deleting"6.2 Failover Manual (RDS)
#!/bin/bash
# runbook-db-failover.sh
set -euo pipefail
DB_INSTANCE="sport-tech-prod"
# Pré-checks
echo "[Pre-check] Verifying Multi-AZ configuration..."
MULTI_AZ=$(aws rds describe-db-instances \
--db-instance-identifier $DB_INSTANCE \
--query 'DBInstances[0].MultiAZ' \
--output text)
if [ "$MULTI_AZ" != "True" ]; then
echo "ERROR: Multi-AZ not enabled. Cannot perform failover."
exit 1
fi
echo "[Pre-check] Current status..."
aws rds describe-db-instances \
--db-instance-identifier $DB_INSTANCE \
--query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]' \
--output table
# Executar failover
echo "[Failover] Initiating reboot with failover..."
START_TIME=$(date +%s)
aws rds reboot-db-instance \
--db-instance-identifier $DB_INSTANCE \
--force-failover
# Aguardar conclusão
echo "[Failover] Waiting for completion..."
aws rds wait db-instance-available \
--db-instance-identifier $DB_INSTANCE
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
# Validação
echo "[Validation] New status..."
aws rds describe-db-instances \
--db-instance-identifier $DB_INSTANCE \
--query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]' \
--output table
echo ""
echo "Failover complete in ${DURATION}s"6.3 Rollback de Deploy
#!/bin/bash
# runbook-deploy-rollback.sh
set -euo pipefail
# Configuration
NAMESPACE="production"
DEPLOYMENT="api-deployment"
# 1. Obter versão atual e anterior
echo "[1/5] Getting deployment history..."
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE
CURRENT_REVISION=$(kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE \
--output=json | jq -r '.metadata.generation')
PREVIOUS_REVISION=$((CURRENT_REVISION - 1))
echo "Current revision: $CURRENT_REVISION"
echo "Will rollback to: $PREVIOUS_REVISION"
# 2. Snapshot do estado atual (para análise)
echo "[2/5] Capturing current state..."
kubectl get deployment/$DEPLOYMENT -n $NAMESPACE -o yaml > \
/tmp/deployment-before-rollback-$(date +%Y%m%d-%H%M).yaml
# 3. Executar rollback
echo "[3/5] Rolling back deployment..."
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE \
--to-revision=$PREVIOUS_REVISION
# 4. Aguardar conclusão
echo "[4/5] Waiting for rollback to complete..."
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE \
--timeout=5m
# 5. Validação
echo "[5/5] Validating rolled back deployment..."
kubectl get pods -n $NAMESPACE -l app=api -o wide
# Health check
echo "Waiting 30s for pods to stabilize..."
sleep 30
HEALTHY_PODS=$(kubectl get pods -n $NAMESPACE -l app=api \
--field-selector=status.phase=Running \
--output=json | jq -r '.items | length')
TOTAL_PODS=$(kubectl get deployment/$DEPLOYMENT -n $NAMESPACE \
-o jsonpath='{.spec.replicas}')
echo "Healthy pods: $HEALTHY_PODS / $TOTAL_PODS"
if [ "$HEALTHY_PODS" -eq "$TOTAL_PODS" ]; then
echo "✓ Rollback successful"
./scripts/smoke-tests.sh
else
echo "✗ Rollback incomplete. Manual intervention required."
exit 1
fi6.4 Invalidação de Cache
#!/bin/bash
# runbook-cache-invalidation.sh
set -euo pipefail
# Configuration
REDIS_HOST="sport-tech-redis.abc123.cache.amazonaws.com"
REDIS_PORT=6379
CLOUDFRONT_DIST="E1234ABCDEF5G"
# 1. Invalidar Redis
echo "[1/3] Flushing Redis cache..."
PATTERN="${1:-arena:*}" # Default: arena-related keys
redis-cli -h $REDIS_HOST -p $REDIS_PORT --scan --pattern "$PATTERN" | \
xargs -L 1000 redis-cli -h $REDIS_HOST -p $REDIS_PORT DEL
echo "✓ Redis keys matching '$PATTERN' deleted"
# 2. Invalidar CloudFront
echo "[2/3] Invalidating CloudFront distribution..."
PATHS="${2:-/*}" # Default: all paths
INVALIDATION_ID=$(aws cloudfront create-invalidation \
--distribution-id $CLOUDFRONT_DIST \
--paths "$PATHS" \
--query 'Invalidation.Id' \
--output text)
echo "Invalidation ID: $INVALIDATION_ID"
# Aguardar conclusão (opcional)
echo "Waiting for invalidation to complete (5-15 min)..."
aws cloudfront wait invalidation-completed \
--distribution-id $CLOUDFRONT_DIST \
--id $INVALIDATION_ID
echo "✓ CloudFront invalidation complete"
# 3. Validação
echo "[3/3] Validating cache invalidation..."
REDIS_KEYS=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT --scan --pattern "$PATTERN" | wc -l)
echo "Redis keys matching '$PATTERN': $REDIS_KEYS (should be 0)"
RESPONSE_TIME=$(curl -o /dev/null -s -w '%{time_total}' https://cdn.sporttechclub.com.br/)
echo "CloudFront response time: ${RESPONSE_TIME}s (higher = cache miss)"
echo "✓ Cache invalidation complete"6.5 Rotação de Secrets
#!/bin/bash
# runbook-secrets-rotation.sh
set -euo pipefail
# Configuration
ENVIRONMENT="production"
VAULT_ADDR="https://vault.sporttechclub.internal"
NAMESPACE="production"
# 1. Database password
echo "[1/4] Rotating database password..."
NEW_DB_PASSWORD=$(openssl rand -base64 32)
# Update RDS
aws rds modify-db-instance \
--db-instance-identifier sport-tech-prod \
--master-user-password "$NEW_DB_PASSWORD" \
--apply-immediately
# Update Vault
vault kv put secret/sport-tech/$ENVIRONMENT/database \
password="$NEW_DB_PASSWORD"
# Update Kubernetes secret
kubectl create secret generic db-credentials \
--from-literal=password="$NEW_DB_PASSWORD" \
--namespace=$NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
echo "✓ Database password rotated"
# 2. JWT Secret
echo "[2/4] Rotating JWT secret..."
NEW_JWT_SECRET=$(openssl rand -base64 64)
vault kv put secret/sport-tech/$ENVIRONMENT/jwt \
secret="$NEW_JWT_SECRET"
kubectl create secret generic jwt-secret \
--from-literal=secret="$NEW_JWT_SECRET" \
--namespace=$NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
echo "✓ JWT secret rotated"
# 3. API Keys (external services)
echo "[3/4] Rotating external API keys..."
echo "⚠ Manual step: Generate new Ziggy API key and update vault"
echo "⚠ Manual step: Rotate Stripe API key in dashboard and update vault"
# 4. Restart deployments para aplicar novos secrets
echo "[4/4] Restarting deployments to apply new secrets..."
kubectl rollout restart deployment/api-deployment -n $NAMESPACE
kubectl rollout restart deployment/worker-deployment -n $NAMESPACE
# Aguardar conclusão
kubectl rollout status deployment/api-deployment -n $NAMESPACE --timeout=5m
kubectl rollout status deployment/worker-deployment -n $NAMESPACE --timeout=5m
echo "✓ All secrets rotated successfully"
echo ""
echo "Next steps:"
echo "1. Update manual API keys (Ziggy, Stripe)"
echo "2. Validate application functionality"
echo "3. Monitor for authentication errors (24h)"7. Testes de DR
7.1 Cronograma de Testes
| Teste | Frequência | Duração | Participantes | Próximo Teste |
|---|---|---|---|---|
| Failover RDS | Trimestral | 30 min | DevOps | 2026-03-15 |
| Restore Database | Trimestral | 2h | DevOps + QA | 2026-03-20 |
| Rollback Deploy | Mensal | 30 min | DevOps | 2026-02-05 |
| War Room Simulation | Semestral | 4h | Toda engenharia | 2026-06-10 |
| Modo Degradado | Trimestral | 1h | DevOps + Product | 2026-04-12 |
7.2 Plano de Teste - Database Restore
Objetivos
- Validar RPO (< 15 minutos de perda de dados)
- Validar RTO (< 1 hora de recovery time)
- Testar runbook de restore
- Treinar equipe
Procedimento
1. Preparação (T-30min)
- Agendar janela de manutenção (domingo 03:00-06:00 BRT)
- Notificar stakeholders
- Preparar ambiente de teste
- Clonar produção para staging
2. Simulação de Desastre (T+0min)
- Marcar timestamp de "incidente"
- Não deletar nada (ambiente de teste)
3. Execução de Restore (T+5min)
- Seguir runbook: runbook-db-restore.sh
- Documentar cada passo
- Cronometrar duração
4. Validação (T+45min)
- Comparar contagem de registros
- Validar integridade referencial
- Executar queries de negócio críticas
- Testar conexão da aplicação
5. Métricas (T+90min)
- RTO atingido? (target: < 60min)
- RPO atingido? (target: < 15min perda)
- Runbook precisa de ajustes?
- Equipe sentiu-se confiante?
6. Cleanup (T+120min)
- Deletar instância de teste
- Documentar aprendizados
- Atualizar runbook se necessárioCritérios de Sucesso
Success Criteria:
- Database restored within 60 minutes: PASS/FAIL
- Data loss < 15 minutes: PASS/FAIL
- Application connects successfully: PASS/FAIL
- All foreign keys valid: PASS/FAIL
- Runbook complete and clear: PASS/FAIL
- Team confident in procedure: PASS/FAIL
Minimum passing: 5/6 PASS7.3 War Room Simulation (Full DR Drill)
Cenário: Multi-Component Failure
Scenario: Região AWS us-east-1 degradada
- RDS Primary: Unresponsive
- ECS Tasks: 50% failing health checks
- Redis: Cluster degraded
- S3: Elevated error rates
Business Impact:
- 100% of users unable to access platform
- Bookings cannot be created
- Payments failing
- Arenas calling supportTimeline de Simulação
T+0min (08:00): Injetar falhas no ambiente de staging
- Simular AZ failure (desligar 50% dos recursos)
- Introduzir latência no database
- Rate limit S3 API calls
T+2min: Alertas começam a disparar
- PagerDuty notifica on-call
- Slack #incidents automático
T+5min: On-call engineer junta-se ao war room
- Assess situation
- Declare P0 incident
- Notify stakeholders
T+10min: Executar runbooks de recovery
- Database failover
- Force ECS redeployment
- Activate degraded mode
T+30min: Validação de recovery
- Health checks green
- Smoke tests passing
- User flows working
T+45min: Comunicação externa
- Status page atualizado
- Emails para clientes afetados
T+60min: Post-mortem iniciado
- Documentar timeline
- Identificar gaps
- Action items8. Comunicação de Crise
8.1 Status Page
URL: https://status.sporttechclub.com.brTool: Atlassian Statuspage
Estados
Operational: Todos os sistemas funcionando normalmente
Degraded Performance: Lentidão, mas funcional
Partial Outage: Alguns componentes indisponíveis
Major Outage: Plataforma completamente indisponível
Under Maintenance: Manutenção programadaComponentes Monitorados
Components:
- name: "API - Reservas"
impact: high
sla: 99.9%
- name: "API - Pagamentos"
impact: critical
sla: 99.9%
- name: "Acesso às Quadras (Catracas)"
impact: critical
sla: 99.5%
- name: "App Mobile"
impact: medium
sla: 99.0%
- name: "Painel Administrativo"
impact: low
sla: 99.0%
- name: "Integração Ziggy"
impact: high
sla: 99.0% (SLA externo)Templates de Mensagens
Investigating
🔍 Investigating: API Slowness
We are investigating reports of slow response times on the
booking API. Users may experience delays when creating or
viewing reservations.
We will provide an update in 15 minutes.
Posted: 2026-01-09 14:35 BRTIdentified
🛠️ Identified: Database Performance Issue
We have identified the root cause as elevated database load
due to a long-running query. Our team is implementing a fix.
ETA for resolution: 30 minutes
Workaround: Mobile app may still be slow. Please retry if
bookings fail.
Posted: 2026-01-09 14:50 BRTMonitoring
👀 Monitoring: Fix Implemented
The performance issue has been resolved. We are monitoring
the system to ensure stability.
All systems operational. Response times back to normal.
Posted: 2026-01-09 15:25 BRTResolved
✅ Resolved: API Performance Restored
The booking API performance issue has been fully resolved.
Timeline:
- 14:30 BRT: Issue detected
- 14:50 BRT: Root cause identified
- 15:20 BRT: Fix deployed
- 15:30 BRT: Confirmed resolution
Impact: ~5% of booking attempts experienced delays (retry successful)
Root cause: Unoptimized database query
Prevention: Query optimization + monitoring alerts
We apologize for the inconvenience.
Posted: 2026-01-09 15:35 BRT8.2 Escalonamento Interno
┌─────────────────────────────────────────────────────────┐
│ INCIDENT SEVERITY │
├─────────────┬───────────────────────────────────────────┤
│ P0 │ Platform Down / Data Loss / Security │
│ CRITICAL │ Notify: CTO, CEO, DevOps Lead (imediato) │
│ │ War Room: Zoom + #war-room Slack │
├─────────────┼───────────────────────────────────────────┤
│ P1 │ Major Feature Down / Payment Failing │
│ HIGH │ Notify: DevOps Lead, Product Lead (15min) │
│ │ Response: On-call engineer │
├─────────────┼───────────────────────────────────────────┤
│ P2 │ Degraded Performance / Minor Feature Down │
│ MEDIUM │ Notify: On-call engineer (30min) │
│ │ Response: During business hours │
├─────────────┼───────────────────────────────────────────┤
│ P3 │ Cosmetic Issue / Low Impact │
│ LOW │ Notify: Backlog (next sprint) │
└─────────────┴───────────────────────────────────────────┘On-Call Schedule
Week 1 (Jan 01-07): João Silva (DevOps)
Week 2 (Jan 08-14): Maria Santos (Backend Lead)
Week 3 (Jan 15-21): Pedro Costa (DevOps)
Week 4 (Jan 22-28): Ana Oliveira (Backend)
Backup: Thiago Nicolussi (DevOps Lead) - Always available
Escalation:
1. On-call engineer (PagerDuty)
2. Backup on-call (+15min)
3. DevOps Lead (+30min)
4. CTO (+1h for P0)9. Modo Degradado
9.1 Funcionalidades Essenciais
Prioridade 0 (Deve funcionar sempre):
- Acesso às quadras (check-in)
- Visualizar reservas existentes
- Contato de emergência com arena
Prioridade 1 (Pode degradar):
- Criar novas reservas (fallback: ligação)
- Processar pagamentos (fallback: pagar na arena)
- Notificações push
Prioridade 2 (Pode ficar offline):
- Analytics/Relatórios
- Configurações de perfil
- Upload de fotos
9.2 Fallback Offline para Catracas
Arquitetura
┌─────────────────────────────────────────────────────┐
│ Online Mode │
│ Arena Catraca ──https──> Sport Tech API ──> Ziggy │
└─────────────────────────────────────────────────────┘
│
Timeout / 503
▼
┌─────────────────────────────────────────────────────┐
│ Offline Mode │
│ Arena Catraca ──local cache──> Approve/Deny │
│ │ │
│ └──sync queue──> Sport Tech API (when back) │
└─────────────────────────────────────────────────────┘Implementação - Gateway Local
// catraca-gateway/src/offline-mode.ts
interface CachedBooking {
bookingId: string;
userId: string;
arenaId: string;
startTime: Date;
endTime: Date;
status: 'confirmed' | 'pending';
lastSynced: Date;
}
class OfflineAccessController {
private cache: Map<string, CachedBooking> = new Map();
private syncQueue: AccessLog[] = [];
private offlineMode = false;
// Sync de bookings a cada 5 minutos
@Cron('*/5 * * * *')
async syncBookings() {
try {
const upcomingBookings = await this.api.getUpcomingBookings({
arenaId: this.arenaId,
startTime: new Date(),
endTime: addHours(new Date(), 24) // Próximas 24h
});
// Atualizar cache local
upcomingBookings.forEach(booking => {
this.cache.set(booking.id, {
...booking,
lastSynced: new Date()
});
});
this.offlineMode = false;
} catch (error) {
this.offlineMode = true;
}
}
// Validar acesso (online ou offline)
async validateAccess(bookingId: string): Promise<AccessResult> {
// Tentar online primeiro
if (!this.offlineMode) {
try {
return await this.api.validateAccess(bookingId);
} catch (error) {
this.offlineMode = true;
}
}
// Fallback offline
return this.validateAccessOffline(bookingId);
}
private validateAccessOffline(bookingId: string): AccessResult {
const booking = this.cache.get(bookingId);
if (!booking) {
return {
allowed: false,
reason: 'Booking not found in offline cache',
mode: 'offline'
};
}
const now = new Date();
const gracePeriod = 15 * 60 * 1000; // 15 minutos
const isValidTime = (
now >= new Date(booking.startTime.getTime() - gracePeriod) &&
now <= booking.endTime
);
const isConfirmed = booking.status === 'confirmed';
if (isValidTime && isConfirmed) {
this.syncQueue.push({
bookingId,
timestamp: now,
decision: 'allowed',
mode: 'offline'
});
return {
allowed: true,
mode: 'offline',
warning: 'Validated offline - pending sync'
};
}
return {
allowed: false,
reason: 'Invalid time or status (offline mode)',
mode: 'offline'
};
}
}10. Custos
10.1 Infraestrutura de DR
| Componente | Configuração DR | Custo Mensal (USD) | Justificativa |
|---|---|---|---|
| RDS Multi-AZ | db.r6g.xlarge | $550 | Failover automático < 2min |
| RDS Backups | 35 dias, 500GB | $75 | Compliance + recovery |
| S3 Replication | Cross-region (2TB) | $120 | Media redundancy |
| S3 Versioning | 3 versões (6TB total) | $90 | Rollback de objetos |
| WAL Archive (S3) | 100GB/mês | $10 | PITR granular |
| ElastiCache Replica | Multi-AZ (6 nodes) | $400 | Cache HA |
| ECS Auto-scaling | 3-10 instances | $300-$1000 | Elasticidade |
| CloudFront | CDN global | $150 | Failover de região |
| Monitoring (Datadog) | Infra + APM | $300 | Detecção rápida |
| TOTAL MENSAL | $2,000 - $2,700 | ||
| TOTAL ANUAL | ~$24,000 - $32,000 |
10.2 Trade-offs Custo vs Tempo de Recovery
┌──────────────────────────────────────────────────────┐
│ RTO vs Custo (Database) │
├──────────────────┬──────────────┬────────────────────┤
│ RTO < 1 min │ $1,200/mês │ Multi-AZ + Read │
│ (Hot Standby) │ │ Replica + Aurora │
├──────────────────┼──────────────┼────────────────────┤
│ RTO < 5 min │ $600/mês │ Multi-AZ only │
│ (Warm Standby) │ │ (current) │
├──────────────────┼──────────────┼────────────────────┤
│ RTO < 1 hour │ $150/mês │ Automated │
│ (Cold Backup) │ │ snapshots only │
├──────────────────┼──────────────┼────────────────────┤
│ RTO < 4 hours │ $50/mês │ Manual restore │
│ (Manual) │ │ from S3 export │
└──────────────────┴──────────────┴────────────────────┘Decisão atual: RTO < 5min (Multi-AZ) = $600/mês Justificativa: Balance entre custo e SLA de 99.9%
11. Apêndices
11.1 Contatos de Emergência
Internal:
- name: Thiago Nicolussi
role: DevOps Lead
phone: +55 11 99999-0001
email: thiago.nicolussi@sporttechclub.com.br
timezone: America/Sao_Paulo
External:
- name: AWS Support
level: Enterprise
phone: +1 (877) 662-0100
portal: https://console.aws.amazon.com/support/
- name: Ziggy Support
level: Premium
phone: +55 11 3333-4444
email: suporte@ziggy.com.br
sla: 2h response time11.2 Ferramentas de DR
Monitoring & Alerting:
- Datadog: https://app.datadoghq.com/
- PagerDuty: https://sporttechclub.pagerduty.com/
- Statuspage: https://manage.statuspage.io/
Cloud Management:
- AWS Console: https://console.aws.amazon.com/
- AWS CLI: installed on bastion hosts
- Terraform Cloud: https://app.terraform.io/
Communication:
- Slack: #incidents, #war-room
- Zoom: War room link (pinned in Slack)
- Email: incidents@sporttechclub.com.br
Documentation:
- Runbooks: GitHub (sport-tech-club/docs/runbooks/)
- Post-mortems: GitHub (docs/post-mortems/)11.3 Changelog
## Changelog
### Version 1.0 (2026-01-09)
- Initial DR plan created
- Defined RTO (1h) and RPO (15min)
- Documented runbooks for common scenarios
- Established testing schedule
### Future Improvements
- [ ] Implement automated DR drills (Chaos Engineering)
- [ ] Add multi-region failover (us-west-2)
- [ ] Integrate Vault auto-unseal with AWS KMS
- [ ] Implement blue-green deployment strategy
- [ ] Create disaster recovery mobile app
- [ ] Improve observability with distributed tracing12. Próximos Passos
Curto Prazo (Q1 2026)
□ Executar primeiro DR drill completo (War Room Simulation)
□ Automatizar runbooks críticos (Terraform, Ansible)
□ Implementar cache warming automático
□ Configurar cross-region replication (S3)
□ Treinar toda equipe de engenharia nos runbooks
□ Validar integração com Ziggy em modo degradadoMédio Prazo (Q2-Q3 2026)
□ Multi-region deployment (us-west-2 standby)
□ Chaos Engineering: GameDays mensais
□ Observabilidade avançada (Honeycomb, Grafana)
□ Automated failover para região secundária
□ Service mesh (Istio) para traffic management
□ Feature flags granular (LaunchDarkly)Longo Prazo (2027)
□ Certificação ISO 27001
□ Global load balancing (GeoDNS)
□ Zero-downtime deployment (blue-green + canary)
□ Self-healing infrastructure (Kubernetes operators)
□ AI-powered incident detection (AIOps)Documento vivo. Última revisão: 2026-01-09Próxima revisão agendada: 2026-04-09 (trimestral)
Aprovações:
- [x] Thiago Nicolussi - DevOps Lead
- [ ] [CTO Name] - CTO
- [ ] [CEO Name] - CEO
"Hope is not a strategy. Prepare for the worst, expect the best."