Skip to content

Disaster Recovery Plan - Sport Tech Club

Status: Ativo | Versão: 1.0 | Última Revisão: 2026-01-09 Owner: DevOps Team | Aprovado por: CTO


Índice

  1. Objetivos e Métricas
  2. Classificação de Desastres
  3. Arquitetura de Alta Disponibilidade
  4. Estratégia de Backup
  5. Procedimentos de Recovery
  6. Runbooks
  7. Testes de DR
  8. Comunicação de Crise
  9. Modo Degradado
  10. Custos

1. Objetivos e Métricas

Service Level Objectives (SLOs)

MétricaObjetivoJustificativa
RTO (Recovery Time Objective)1 horaDowntime máximo aceitável para operação de arenas
RPO (Recovery Point Objective)15 minutosPerda máxima de dados tolerável (reservas, pagamentos)
SLA (Service Level Agreement)99.9%~8.76h de downtime por ano (~43min/mês)
MTTR (Mean Time To Recovery)< 30 minutosTempo médio para recuperação de incidentes

Impacto de Negócio

Downtime de 1 hora:
- 20 arenas x 8 quadras x 4 reservas/hora = 640 reservas afetadas
- ~R$ 80.000 em receita perdida
- Danos à reputação
- Multas contratuais (SLA com arenas)

2. Classificação de Desastres

Matriz de Severidade

NívelDescriçãoEscopoRTOEquipe
Nível 1Falha de componente único1 instância/serviço5 minAuto-healing
Nível 2Falha de Availability Zone1 AZ AWS15 minOn-call DevOps
Nível 3Falha regionalRegião AWS1 horaDevOps + Infra Lead
Nível 4Desastre catastróficoMulti-região4 horasWar Room (CTO, CEO)

Exemplos por Nível

Nível 1 - Falha Isolada

  • Container crashando
  • Instância EC2 unhealthy
  • Worker Redis offline
  • Cache miss elevado

Nível 2 - Falha de Zona

  • AZ AWS indisponível
  • Subnet/Security Group misconfiguration
  • Database replica offline
  • Network partition

Nível 3 - Falha Regional

  • Região AWS com outage
  • DNS/Route53 failure
  • CDN global degradation
  • Multi-service cascading failure

Nível 4 - Desastre Total

  • Ransomware/Cryptolocker
  • Compromisso total de credenciais
  • Corrupção de dados em cascata
  • Delete acidental de recursos críticos
  • Ataque DDoS volumétrico

3. Arquitetura de Alta Disponibilidade

Topologia Multi-AZ

┌─────────────────────────────────────────────────────────────────┐
│                        AWS Route 53 (DNS)                        │
│                          Cloudflare CDN                          │
└────────────────────────────┬────────────────────────────────────┘


         ┌───────────────────────────────────────────┐
         │   Application Load Balancer (Multi-AZ)    │
         │   Health Check: /health (interval: 10s)   │
         └───────────┬───────────────┬───────────────┘
                     │               │
        ┌────────────┴───────┐  ┌───┴─────────────────┐
        │   AZ us-east-1a    │  │   AZ us-east-1b     │
        │                    │  │                     │
        │  ┌──────────────┐  │  │  ┌──────────────┐  │
        │  │  ECS Tasks   │  │  │  │  ECS Tasks   │  │
        │  │  (3-10 inst.)│  │  │  │  (3-10 inst.)│  │
        │  └──────┬───────┘  │  │  └──────┬───────┘  │
        │         │          │  │         │          │
        │  ┌──────▼───────┐  │  │  ┌──────▼───────┐  │
        │  │Redis Cluster │  │  │  │Redis Cluster │  │
        │  │  (Master)    │◄─┼──┼──┤  (Replica)   │  │
        │  └──────────────┘  │  │  └──────────────┘  │
        └────────┬───────────┘  └───────┬────────────┘
                 │                      │
                 └──────────┬───────────┘

                ┌───────────▼────────────┐
                │   RDS PostgreSQL 14    │
                │   Primary (us-east-1a) │
                │   Read Replica (1b)    │
                │   Automated Backups    │
                └────────────────────────┘

Componentes Críticos

3.1 Application Layer

yaml
# ECS Service Auto-Scaling
MinInstances: 3
MaxInstances: 10
TargetCPU: 70%
TargetMemory: 80%

# Health Check
Endpoint: /api/health
Interval: 10s
Timeout: 5s
UnhealthyThreshold: 2
HealthyThreshold: 2

# Deployment Strategy
Type: Rolling Update
MaxUnavailable: 25%
MaxSurge: 50%

3.2 Database Layer (PostgreSQL RDS)

Primary Instance:
- Type: db.r6g.xlarge (4 vCPU, 32GB RAM)
- Storage: 500GB GP3 (12000 IOPS)
- Multi-AZ: Enabled (synchronous replica)
- Automated Backups: Daily, retention 35 days
- Point-in-Time Recovery: 5-minute granularity

Read Replica:
- Cross-AZ for read scaling
- Promotion time: ~1 minute
- Lag monitoring: < 100ms

Connection Pooling:
- PgBouncer (100 connections)
- Application pool: 20 connections/instance

3.3 Cache Layer (Redis)

ElastiCache Redis Cluster:
- Mode: Cluster Enabled
- Nodes: 6 (3 shards x 2 replicas)
- Instance: cache.r6g.large
- Multi-AZ: Enabled
- Automatic Failover: < 60s

Persistence:
- AOF (Append-Only File): Every second
- RDB Snapshot: Daily
- Backup retention: 7 days

3.4 Object Storage (S3)

Bucket: sport-tech-club-media
- Versioning: Enabled
- Cross-Region Replication: us-west-2
- Lifecycle:
  - Transition to IA after 30 days
  - Transition to Glacier after 90 days
  - Delete after 7 years (compliance)

Bucket: sport-tech-club-backups
- MFA Delete: Enabled
- Encryption: AES-256
- Object Lock: Compliance mode (immutable)

4. Estratégia de Backup

4.1 Database Backups

Automated Backups (RDS)

bash
# Configuração
BACKUP_WINDOW="03:00-04:00 UTC"  # 00:00-01:00 BRT
RETENTION_PERIOD=35  # dias (compliance)
PITR_ENABLED=true    # Point-in-Time Recovery

# Restore time
FULL_RESTORE_TIME="~30 minutes"  # database 500GB
PITR_RESTORE_TIME="~45 minutes"  # + replay logs

Manual Snapshots

bash
# Snapshot pré-deploy (crítico)
aws rds create-db-snapshot \
  --db-instance-identifier sport-tech-prod \
  --db-snapshot-identifier prod-pre-deploy-$(date +%Y%m%d-%H%M)

# Retention: Indefinida (delete manual)

WAL Archiving (Write-Ahead Logs)

sql
-- PostgreSQL Configuration
archive_mode = on
archive_command = 'aws s3 cp %p s3://sport-tech-wal-archive/%f'
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB

-- RPO: 1 minuto (WAL shipping interval)
-- Armazenamento: S3 (replicado para us-west-2)
-- Retention: 7 dias

4.2 Application Backups

Código e Configuração

Repository: GitHub Enterprise
- Branch protection: main, develop
- Signed commits: Obrigatório
- Review requirement: 2 approvers

Infrastructure as Code:
- Terraform state: S3 + State locking (DynamoDB)
- Versioning: Enabled
- Encryption: At-rest + in-transit

Kubernetes Manifests:
- ArgoCD: GitOps (auto-sync disabled)
- Helm charts: versioned in Git
- Secrets: Sealed Secrets (encrypted)

Secrets e Credenciais

bash
# HashiCorp Vault
BACKUP_FREQUENCY="Diário (automated)"
BACKUP_LOCATION="S3 encrypted bucket"
BACKUP_ENCRYPTION="PGP key (offline master key)"

# Backup command
vault operator raft snapshot save backup-$(date +%Y%m%d).snap
aws s3 cp backup-*.snap s3://sport-tech-vault-backup/ \
  --sse AES256 --storage-class STANDARD_IA

# Retention: 90 dias
# Test restore: Mensal

4.3 Object Storage (Mídia)

Vídeos e Fotos de Arenas:
- Bucket: sport-tech-club-media
- Size: ~2TB (crescimento: 50GB/mês)
- Versioning: Enabled (últimas 3 versões)
- Cross-Region Replication: us-west-2 (async)
- Lifecycle: IA (30d) → Glacier (90d)

RPO: ~5 minutos (S3 replication time)
RTO: Imediato (multi-region access)

4.4 Logs e Métricas

CloudWatch Logs:
- Retention: 90 dias (compliance)
- Export to S3: Diário (análise)
- Compression: gzip

Prometheus Metrics:
- Retention: 15 dias (local)
- Long-term storage: Thanos (S3)
- Retention: 1 ano

Jaeger Traces:
- Hot storage: 7 dias
- Cold storage: 30 dias (S3)

4.5 Cronograma de Backups

ComponenteFrequênciaRetençãoRPOLocalização
DatabaseContínuo (WAL) + Daily snapshot35 dias1 minRDS + S3 (multi-region)
Redis AOFContínuo7 dias1 segElastiCache + S3
S3 MediaContínuo (replication)Versioning (3x)5 minus-east-1 + us-west-2
VaultDiário90 dias24hS3 encrypted
CodeEvery commitIndefinido0GitHub
LogsContínuo90 dias1 minCloudWatch + S3

5. Procedimentos de Recovery

5.1 Falha de Instância EC2/Container

Cenário: Container crashando, health check failing

Detecção

Alertas:
- ECS Service: UnhealthyHostCount > 0 (1 min)
- ALB: UnhealthyTargetCount > 0 (1 min)
- APM: Error rate > 5% (30s)

Monitoramento:
- CloudWatch Container Insights
- Datadog APM
- PagerDuty escalation

Ação Automática (Auto-Healing)

bash
# ECS Service auto-restart
# Triggered by health check failure (2 consecutive fails)

# Força restart manual (se necessário)
aws ecs update-service \
  --cluster sport-tech-prod \
  --service api-service \
  --force-new-deployment

# Tempo de recovery: ~2 minutos
# Impacto: Zero (outras instâncias atendem)

Validação

bash
# Check service health
aws ecs describe-services \
  --cluster sport-tech-prod \
  --services api-service \
  --query 'services[0].runningCount'

# Expected: runningCount == desiredCount

# Check ALB targets
aws elbv2 describe-target-health \
  --target-group-arn $TG_ARN

# Expected: All targets "healthy"

# Application smoke test
curl -f https://api.sporttechclub.com.br/health
# Expected: 200 OK, response time < 500ms

5.2 Corrupção de Database

Cenário: Data corruption, tabela deletada, ransomware

Detecção

Alertas:
- Data integrity check failure
- Unexpected DELETE/TRUNCATE queries
- Application errors: "relation does not exist"
- Monitoring: Row count drop > 10%

Ferramentas:
- pg_stat_statements (query monitoring)
- CloudWatch RDS logs
- Application error tracking (Sentry)

Procedimento de Recovery - Point-in-Time Recovery (PITR)

bash
# 1. Identificar timestamp ANTES da corrupção
RESTORE_TIME="2026-01-09T14:30:00Z"

# 2. Criar nova instância a partir do backup
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier sport-tech-prod \
  --target-db-instance-identifier sport-tech-restored \
  --restore-time $RESTORE_TIME \
  --db-subnet-group-name sport-tech-db-subnet \
  --publicly-accessible false \
  --multi-az

# Tempo estimado: 30-45 minutos (500GB database)

# 3. Aguardar disponibilidade
aws rds wait db-instance-available \
  --db-instance-identifier sport-tech-restored

# 4. Validar dados restaurados
psql -h sport-tech-restored.abc123.us-east-1.rds.amazonaws.com \
     -U admin -d sporttechdb -c "SELECT COUNT(*) FROM bookings;"

# 5. Promover instância restaurada (se validado)
# Update DNS/connection string to point to restored instance

# 6. Investigar causa raiz
# Check application logs, audit RDS query logs

Validação

sql
-- Check data integrity
SELECT
  schemaname, tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Validate business rules
SELECT COUNT(*) FROM bookings WHERE status = 'confirmed';
SELECT SUM(amount) FROM payments WHERE created_at >= NOW() - INTERVAL '7 days';

-- Check foreign key constraints
SELECT conname, conrelid::regclass, confrelid::regclass
FROM pg_constraint WHERE contype = 'f';

Comunicação

T+0min: Detectar anomalia
T+5min: Confirmar corrupção, iniciar war room
T+10min: Comunicar status page: "Investigando inconsistência de dados"
T+45min: Database restored, validação em progresso
T+60min: Serviço restaurado, post-mortem agendado

5.3 Falha de Availability Zone

Cenário: AWS AZ us-east-1a indisponível

Detecção

AWS Health Dashboard:
- AZ status: Degraded/Unavailable
- ECS Tasks: Unhealthy in AZ-a
- RDS: Automatic failover initiated

Alertas:
- CloudWatch: AZ-a metrics flatlined
- Datadog: Host availability < 50%
- PagerDuty: High-priority alert

Ação Automática

RDS Multi-AZ Failover:
- Detection: 60-120 segundos
- Automatic failover to standby (us-east-1b)
- DNS update: 30-60 segundos
- Total downtime: ~2 minutos

ECS Service:
- ALB reroute traffic to healthy AZ
- Auto-scaling spawn new tasks in AZ-b
- Downtime: ~0 segundos (rolling)

Redis Cluster:
- Automatic failover to replica
- Promotion time: 30-60 segundos
- Application retry handles transition

Ação Manual (Se Necessário)

bash
# 1. Verificar status de failover
aws rds describe-db-instances \
  --db-instance-identifier sport-tech-prod \
  --query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]'

# Expected: ["available", "us-east-1b"]

# 2. Forçar redistribuição de ECS tasks
aws ecs update-service \
  --cluster sport-tech-prod \
  --service api-service \
  --force-new-deployment

# 3. Validar distribuição
aws ecs list-tasks \
  --cluster sport-tech-prod \
  --service-name api-service \
  --query 'taskArns[*]' | wc -l

# Esperado: Tasks distribuídos em AZ-b e AZ-c

Validação

bash
# Health check de todos os componentes
./scripts/health-check-all.sh

# Output esperado:
# ✓ ALB: Healthy (all targets)
# ✓ ECS: 8/8 tasks running (us-east-1b, us-east-1c)
# ✓ RDS: Available (us-east-1b - PRIMARY)
# ✓ Redis: Cluster healthy (3/3 shards)
# ✓ API Response time: 180ms (avg)

5.4 Ataque de Ransomware

Cenário: Cryptolocker, credenciais comprometidas, delete malicioso

Detecção

Indicadores:
- Arquivos S3 deletados em massa
- Database tables truncadas
- Logs de acesso suspeitos (IP incomum, horário atípico)
- Alertas: GuardDuty, CloudTrail anomalies
- Ransom note em buckets S3

CloudTrail Events:
- s3:DeleteObject (batch)
- rds:DeleteDBInstance
- iam:CreateAccessKey (não autorizado)

Contenção Imediata (T+0 a T+15min)

bash
# 1. ISOLAR SISTEMAS COMPROMETIDOS
# Revogar credenciais suspeitas
aws iam update-access-key \
  --access-key-id $SUSPECTED_KEY \
  --status Inactive \
  --user-name $USER

# Desabilitar usuários IAM
aws iam delete-login-profile --user-name $COMPROMISED_USER
aws iam list-access-keys --user-name $COMPROMISED_USER | \
  jq -r '.AccessKeyMetadata[].AccessKeyId' | \
  xargs -I {} aws iam delete-access-key --user-name $COMPROMISED_USER --access-key-id {}

# 2. BLOQUEAR TRÁFEGO SUSPEITO
# Atualizar Security Groups
aws ec2 revoke-security-group-ingress \
  --group-id $SG_ID \
  --ip-permissions IpProtocol=tcp,FromPort=22,ToPort=22,IpRanges=[{CidrIp=$MALICIOUS_IP/32}]

# 3. SNAPSHOT IMEDIATO (forense)
aws rds create-db-snapshot \
  --db-instance-identifier sport-tech-prod \
  --db-snapshot-identifier incident-ransomware-$(date +%Y%m%d-%H%M)

# 4. ATIVAR WAR ROOM
# Notificar: CTO, CEO, DevOps Lead, Security Lead

Recovery (T+15min a T+60min)

bash
# 1. RESTAURAR DATABASE (limpa)
# Usar backup ANTES do ataque (CloudTrail timestamp)
ATTACK_TIME="2026-01-09T10:45:00Z"
SAFE_RESTORE_TIME="2026-01-09T10:30:00Z"  # 15min antes

aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier sport-tech-prod \
  --target-db-instance-identifier sport-tech-clean \
  --restore-time $SAFE_RESTORE_TIME \
  --multi-az

# 2. RESTAURAR S3 OBJECTS (versioning)
# Script para restaurar versões anteriores
aws s3api list-object-versions \
  --bucket sport-tech-club-media \
  --prefix "arenas/" \
  --query 'DeleteMarkers[?IsLatest==`true`].[Key,VersionId]' \
  --output text | \
while read key version; do
  aws s3api delete-object \
    --bucket sport-tech-club-media \
    --key "$key" \
    --version-id "$version"
done

# 3. REDEPLOYAR APLICAÇÃO (imagens limpas)
# Rebuild from clean Git commit
git checkout main
git reset --hard $LAST_KNOWN_GOOD_COMMIT

docker build -t sport-tech-api:clean .
docker push sport-tech-api:clean

# Deploy
kubectl set image deployment/api-deployment \
  api=sport-tech-api:clean

# 4. ROTACIONAR TODOS OS SECRETS
# Database passwords
aws rds modify-db-instance \
  --db-instance-identifier sport-tech-clean \
  --master-user-password $(openssl rand -base64 32)

# API keys, JWT secrets
vault kv put secret/sport-tech/prod \
  jwt_secret=$(openssl rand -base64 64) \
  api_key=$(openssl rand -base64 32)

Comunicação

T+0min: Detectar ataque, ativar war room
T+5min: Contenção iniciada, sistemas isolados
T+15min: Status page: "Manutenção emergencial"
T+30min: Recovery em progresso (não mencionar ransomware publicamente)
T+60min: Sistemas restaurados, validação em andamento
T+90min: Status page: "Serviços restaurados. Investigação continua."
T+24h: Email clientes: "Incidente resolvido, nenhum dado vazado"
T+7d: Post-mortem público (opcional, dependendo do caso)

5.5 Erro Humano (Delete Acidental)

Cenário: Desenvolvedor executa DROP TABLE em produção

Detecção

Alertas:
- Application errors: "relation does not exist"
- Monitoring: Table row count = 0
- CloudWatch RDS Events: DDL statement executed

Notificação:
- Desenvolvedor reporta erro imediatamente
- Slackbot: #incidents channel

Procedimento

bash
# 1. PARAR NOVAS ESCRITAS (prevenir perda de dados)
# Colocar aplicação em read-only mode
kubectl set env deployment/api-deployment \
  READ_ONLY_MODE=true

# 2. IDENTIFICAR TIMESTAMP DO DELETE
# RDS Query Logs
aws rds download-db-log-file-portion \
  --db-instance-identifier sport-tech-prod \
  --log-file-name error/postgresql.log.2026-01-09-14 \
  --output text | grep "DROP TABLE"

# Output: 2026-01-09 14:32:15 UTC [12345]: DROP TABLE bookings CASCADE;

# 3. RESTAURAR TABELA (PITR)
RESTORE_TIME="2026-01-09T14:30:00Z"  # 2 min antes do DROP

# Criar instância temporária
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier sport-tech-prod \
  --target-db-instance-identifier sport-tech-temp-restore \
  --restore-time $RESTORE_TIME

# 4. EXTRAIR DADOS DA TABELA
pg_dump -h sport-tech-temp-restore.xxx.rds.amazonaws.com \
  -U admin -d sporttechdb \
  --table=bookings --data-only \
  --file=bookings_restore.sql

# 5. RECRIAR ESTRUTURA (se necessário)
psql -h sport-tech-prod.xxx.rds.amazonaws.com \
  -U admin -d sporttechdb < schema/bookings_table.sql

# 6. IMPORTAR DADOS
psql -h sport-tech-prod.xxx.rds.amazonaws.com \
  -U admin -d sporttechdb < bookings_restore.sql

# 7. REMOVER READ-ONLY MODE
kubectl set env deployment/api-deployment \
  READ_ONLY_MODE-

# 8. CLEANUP
aws rds delete-db-instance \
  --db-instance-identifier sport-tech-temp-restore \
  --skip-final-snapshot

# Tempo total: ~45 minutos

5.6 Falha de Integração (Ziggy Offline)

Cenário: Ziggy (catracas, pagamentos) inacessível por 2+ horas

Detecção

Alertas:
- HTTP 503 errors para Ziggy API
- Payment webhooks não recebidos (15+ min)
- Arena check-ins failing
- Monitoring: Ziggy uptime < 99%

Impact:
- Clientes não conseguem acessar quadras
- Pagamentos não processados
- Reservas não validadas

Modo Degradado (Failover)

typescript
// Activate offline mode
// app/config/feature-flags.ts

export const FeatureFlags = {
  ZIGGY_OFFLINE_MODE: true,  // Manual toggle or auto-detect
  OFFLINE_GRACE_PERIOD: 60 * 60 * 2,  // 2 hours
}

// app/services/access-control.service.ts
async validateAccess(bookingId: string): Promise<AccessResult> {
  // Tentar Ziggy
  try {
    return await this.ziggyClient.validateAccess(bookingId);
  } catch (error) {
    if (!FeatureFlags.ZIGGY_OFFLINE_MODE) throw error;

    // Fallback: Validar localmente
    const booking = await this.bookingRepo.findById(bookingId);

    if (this.isValidForOfflineAccess(booking)) {
      // Log para sincronização posterior
      await this.offlineAccessLog.create({
        bookingId,
        timestamp: new Date(),
        method: 'offline_validation'
      });

      return { allowed: true, mode: 'offline' };
    }

    return { allowed: false, reason: 'offline_validation_failed' };
  }
}

isValidForOfflineAccess(booking: Booking): boolean {
  // Regras de negócio para modo offline
  return (
    booking.status === 'confirmed' &&
    booking.paymentStatus === 'paid' &&
    booking.startTime <= new Date() &&
    booking.endTime >= new Date()
  );
}

Comunicação

T+0min: Ziggy offline detectado
T+5min: Ativar modo degradado automaticamente
T+10min: Notificar arenas: "Catracas em modo offline. Liberação automática ativada."
T+15min: Status page: "Processamento de pagamentos temporariamente indisponível"
T+30min: Escalar para Ziggy support team
T+2h: Avaliar alternativas (gateway de pagamento backup)
T+4h: Ziggy restored, sincronização automática
T+24h: Validar todos os check-ins sincronizados

6. Runbooks

6.1 Database Restore (Completo)

bash
#!/bin/bash
# runbook-db-restore.sh

set -euo pipefail

# Configuration
SOURCE_DB="sport-tech-prod"
RESTORE_TIME="${1:-}"  # Pass as argument or use latest
TARGET_DB="sport-tech-restored-$(date +%Y%m%d-%H%M)"

# Validação
if [ -z "$RESTORE_TIME" ]; then
  echo "Usage: ./runbook-db-restore.sh '2026-01-09T14:30:00Z'"
  exit 1
fi

# 1. Criar instância restaurada
echo "[1/7] Creating restored DB instance..."
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier $SOURCE_DB \
  --target-db-instance-identifier $TARGET_DB \
  --restore-time "$RESTORE_TIME" \
  --db-subnet-group-name sport-tech-db-subnet \
  --vpc-security-group-ids sg-0abc123def456 \
  --multi-az \
  --no-publicly-accessible

# 2. Aguardar disponibilidade
echo "[2/7] Waiting for DB to become available (30-45 min)..."
aws rds wait db-instance-available \
  --db-instance-identifier $TARGET_DB

# 3. Obter endpoint
RESTORED_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier $TARGET_DB \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

echo "[3/7] Restored DB endpoint: $RESTORED_ENDPOINT"

# 4. Validar dados
echo "[4/7] Validating restored data..."
psql -h $RESTORED_ENDPOINT -U admin -d sporttechdb -c "
  SELECT
    (SELECT COUNT(*) FROM bookings) as bookings_count,
    (SELECT COUNT(*) FROM arenas) as arenas_count,
    (SELECT MAX(created_at) FROM bookings) as latest_booking
"

# 5. Executar testes de integridade
echo "[5/7] Running integrity tests..."
./scripts/db-integrity-tests.sh $RESTORED_ENDPOINT

# 6. Próximos passos
echo "[6/7] Restore complete!"
echo ""
echo "Next steps:"
echo "1. Validate data thoroughly"
echo "2. Update application config to point to: $RESTORED_ENDPOINT"
echo "3. Test application with restored DB"
echo "4. Promote to production (rename endpoint or update DNS)"
echo ""
echo "Rollback: Keep old instance for 24h before deleting"

6.2 Failover Manual (RDS)

bash
#!/bin/bash
# runbook-db-failover.sh

set -euo pipefail

DB_INSTANCE="sport-tech-prod"

# Pré-checks
echo "[Pre-check] Verifying Multi-AZ configuration..."
MULTI_AZ=$(aws rds describe-db-instances \
  --db-instance-identifier $DB_INSTANCE \
  --query 'DBInstances[0].MultiAZ' \
  --output text)

if [ "$MULTI_AZ" != "True" ]; then
  echo "ERROR: Multi-AZ not enabled. Cannot perform failover."
  exit 1
fi

echo "[Pre-check] Current status..."
aws rds describe-db-instances \
  --db-instance-identifier $DB_INSTANCE \
  --query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]' \
  --output table

# Executar failover
echo "[Failover] Initiating reboot with failover..."
START_TIME=$(date +%s)

aws rds reboot-db-instance \
  --db-instance-identifier $DB_INSTANCE \
  --force-failover

# Aguardar conclusão
echo "[Failover] Waiting for completion..."
aws rds wait db-instance-available \
  --db-instance-identifier $DB_INSTANCE

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

# Validação
echo "[Validation] New status..."
aws rds describe-db-instances \
  --db-instance-identifier $DB_INSTANCE \
  --query 'DBInstances[0].[DBInstanceStatus,AvailabilityZone]' \
  --output table

echo ""
echo "Failover complete in ${DURATION}s"

6.3 Rollback de Deploy

bash
#!/bin/bash
# runbook-deploy-rollback.sh

set -euo pipefail

# Configuration
NAMESPACE="production"
DEPLOYMENT="api-deployment"

# 1. Obter versão atual e anterior
echo "[1/5] Getting deployment history..."
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE

CURRENT_REVISION=$(kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE \
  --output=json | jq -r '.metadata.generation')

PREVIOUS_REVISION=$((CURRENT_REVISION - 1))

echo "Current revision: $CURRENT_REVISION"
echo "Will rollback to: $PREVIOUS_REVISION"

# 2. Snapshot do estado atual (para análise)
echo "[2/5] Capturing current state..."
kubectl get deployment/$DEPLOYMENT -n $NAMESPACE -o yaml > \
  /tmp/deployment-before-rollback-$(date +%Y%m%d-%H%M).yaml

# 3. Executar rollback
echo "[3/5] Rolling back deployment..."
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE \
  --to-revision=$PREVIOUS_REVISION

# 4. Aguardar conclusão
echo "[4/5] Waiting for rollback to complete..."
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE \
  --timeout=5m

# 5. Validação
echo "[5/5] Validating rolled back deployment..."
kubectl get pods -n $NAMESPACE -l app=api -o wide

# Health check
echo "Waiting 30s for pods to stabilize..."
sleep 30

HEALTHY_PODS=$(kubectl get pods -n $NAMESPACE -l app=api \
  --field-selector=status.phase=Running \
  --output=json | jq -r '.items | length')

TOTAL_PODS=$(kubectl get deployment/$DEPLOYMENT -n $NAMESPACE \
  -o jsonpath='{.spec.replicas}')

echo "Healthy pods: $HEALTHY_PODS / $TOTAL_PODS"

if [ "$HEALTHY_PODS" -eq "$TOTAL_PODS" ]; then
  echo "✓ Rollback successful"
  ./scripts/smoke-tests.sh
else
  echo "✗ Rollback incomplete. Manual intervention required."
  exit 1
fi

6.4 Invalidação de Cache

bash
#!/bin/bash
# runbook-cache-invalidation.sh

set -euo pipefail

# Configuration
REDIS_HOST="sport-tech-redis.abc123.cache.amazonaws.com"
REDIS_PORT=6379
CLOUDFRONT_DIST="E1234ABCDEF5G"

# 1. Invalidar Redis
echo "[1/3] Flushing Redis cache..."

PATTERN="${1:-arena:*}"  # Default: arena-related keys

redis-cli -h $REDIS_HOST -p $REDIS_PORT --scan --pattern "$PATTERN" | \
  xargs -L 1000 redis-cli -h $REDIS_HOST -p $REDIS_PORT DEL

echo "✓ Redis keys matching '$PATTERN' deleted"

# 2. Invalidar CloudFront
echo "[2/3] Invalidating CloudFront distribution..."

PATHS="${2:-/*}"  # Default: all paths

INVALIDATION_ID=$(aws cloudfront create-invalidation \
  --distribution-id $CLOUDFRONT_DIST \
  --paths "$PATHS" \
  --query 'Invalidation.Id' \
  --output text)

echo "Invalidation ID: $INVALIDATION_ID"

# Aguardar conclusão (opcional)
echo "Waiting for invalidation to complete (5-15 min)..."
aws cloudfront wait invalidation-completed \
  --distribution-id $CLOUDFRONT_DIST \
  --id $INVALIDATION_ID

echo "✓ CloudFront invalidation complete"

# 3. Validação
echo "[3/3] Validating cache invalidation..."

REDIS_KEYS=$(redis-cli -h $REDIS_HOST -p $REDIS_PORT --scan --pattern "$PATTERN" | wc -l)
echo "Redis keys matching '$PATTERN': $REDIS_KEYS (should be 0)"

RESPONSE_TIME=$(curl -o /dev/null -s -w '%{time_total}' https://cdn.sporttechclub.com.br/)
echo "CloudFront response time: ${RESPONSE_TIME}s (higher = cache miss)"

echo "✓ Cache invalidation complete"

6.5 Rotação de Secrets

bash
#!/bin/bash
# runbook-secrets-rotation.sh

set -euo pipefail

# Configuration
ENVIRONMENT="production"
VAULT_ADDR="https://vault.sporttechclub.internal"
NAMESPACE="production"

# 1. Database password
echo "[1/4] Rotating database password..."

NEW_DB_PASSWORD=$(openssl rand -base64 32)

# Update RDS
aws rds modify-db-instance \
  --db-instance-identifier sport-tech-prod \
  --master-user-password "$NEW_DB_PASSWORD" \
  --apply-immediately

# Update Vault
vault kv put secret/sport-tech/$ENVIRONMENT/database \
  password="$NEW_DB_PASSWORD"

# Update Kubernetes secret
kubectl create secret generic db-credentials \
  --from-literal=password="$NEW_DB_PASSWORD" \
  --namespace=$NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -

echo "✓ Database password rotated"

# 2. JWT Secret
echo "[2/4] Rotating JWT secret..."

NEW_JWT_SECRET=$(openssl rand -base64 64)

vault kv put secret/sport-tech/$ENVIRONMENT/jwt \
  secret="$NEW_JWT_SECRET"

kubectl create secret generic jwt-secret \
  --from-literal=secret="$NEW_JWT_SECRET" \
  --namespace=$NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -

echo "✓ JWT secret rotated"

# 3. API Keys (external services)
echo "[3/4] Rotating external API keys..."
echo "⚠ Manual step: Generate new Ziggy API key and update vault"
echo "⚠ Manual step: Rotate Stripe API key in dashboard and update vault"

# 4. Restart deployments para aplicar novos secrets
echo "[4/4] Restarting deployments to apply new secrets..."

kubectl rollout restart deployment/api-deployment -n $NAMESPACE
kubectl rollout restart deployment/worker-deployment -n $NAMESPACE

# Aguardar conclusão
kubectl rollout status deployment/api-deployment -n $NAMESPACE --timeout=5m
kubectl rollout status deployment/worker-deployment -n $NAMESPACE --timeout=5m

echo "✓ All secrets rotated successfully"
echo ""
echo "Next steps:"
echo "1. Update manual API keys (Ziggy, Stripe)"
echo "2. Validate application functionality"
echo "3. Monitor for authentication errors (24h)"

7. Testes de DR

7.1 Cronograma de Testes

TesteFrequênciaDuraçãoParticipantesPróximo Teste
Failover RDSTrimestral30 minDevOps2026-03-15
Restore DatabaseTrimestral2hDevOps + QA2026-03-20
Rollback DeployMensal30 minDevOps2026-02-05
War Room SimulationSemestral4hToda engenharia2026-06-10
Modo DegradadoTrimestral1hDevOps + Product2026-04-12

7.2 Plano de Teste - Database Restore

Objetivos

  • Validar RPO (< 15 minutos de perda de dados)
  • Validar RTO (< 1 hora de recovery time)
  • Testar runbook de restore
  • Treinar equipe

Procedimento

1. Preparação (T-30min)
   - Agendar janela de manutenção (domingo 03:00-06:00 BRT)
   - Notificar stakeholders
   - Preparar ambiente de teste
   - Clonar produção para staging

2. Simulação de Desastre (T+0min)
   - Marcar timestamp de "incidente"
   - Não deletar nada (ambiente de teste)

3. Execução de Restore (T+5min)
   - Seguir runbook: runbook-db-restore.sh
   - Documentar cada passo
   - Cronometrar duração

4. Validação (T+45min)
   - Comparar contagem de registros
   - Validar integridade referencial
   - Executar queries de negócio críticas
   - Testar conexão da aplicação

5. Métricas (T+90min)
   - RTO atingido? (target: < 60min)
   - RPO atingido? (target: < 15min perda)
   - Runbook precisa de ajustes?
   - Equipe sentiu-se confiante?

6. Cleanup (T+120min)
   - Deletar instância de teste
   - Documentar aprendizados
   - Atualizar runbook se necessário

Critérios de Sucesso

yaml
Success Criteria:
  - Database restored within 60 minutes: PASS/FAIL
  - Data loss < 15 minutes: PASS/FAIL
  - Application connects successfully: PASS/FAIL
  - All foreign keys valid: PASS/FAIL
  - Runbook complete and clear: PASS/FAIL
  - Team confident in procedure: PASS/FAIL

Minimum passing: 5/6 PASS

7.3 War Room Simulation (Full DR Drill)

Cenário: Multi-Component Failure

Scenario: Região AWS us-east-1 degradada
- RDS Primary: Unresponsive
- ECS Tasks: 50% failing health checks
- Redis: Cluster degraded
- S3: Elevated error rates

Business Impact:
- 100% of users unable to access platform
- Bookings cannot be created
- Payments failing
- Arenas calling support

Timeline de Simulação

T+0min (08:00): Injetar falhas no ambiente de staging
  - Simular AZ failure (desligar 50% dos recursos)
  - Introduzir latência no database
  - Rate limit S3 API calls

T+2min: Alertas começam a disparar
  - PagerDuty notifica on-call
  - Slack #incidents automático

T+5min: On-call engineer junta-se ao war room
  - Assess situation
  - Declare P0 incident
  - Notify stakeholders

T+10min: Executar runbooks de recovery
  - Database failover
  - Force ECS redeployment
  - Activate degraded mode

T+30min: Validação de recovery
  - Health checks green
  - Smoke tests passing
  - User flows working

T+45min: Comunicação externa
  - Status page atualizado
  - Emails para clientes afetados

T+60min: Post-mortem iniciado
  - Documentar timeline
  - Identificar gaps
  - Action items

8. Comunicação de Crise

8.1 Status Page

URL: https://status.sporttechclub.com.brTool: Atlassian Statuspage

Estados

Operational: Todos os sistemas funcionando normalmente
Degraded Performance: Lentidão, mas funcional
Partial Outage: Alguns componentes indisponíveis
Major Outage: Plataforma completamente indisponível
Under Maintenance: Manutenção programada

Componentes Monitorados

yaml
Components:
  - name: "API - Reservas"
    impact: high
    sla: 99.9%

  - name: "API - Pagamentos"
    impact: critical
    sla: 99.9%

  - name: "Acesso às Quadras (Catracas)"
    impact: critical
    sla: 99.5%

  - name: "App Mobile"
    impact: medium
    sla: 99.0%

  - name: "Painel Administrativo"
    impact: low
    sla: 99.0%

  - name: "Integração Ziggy"
    impact: high
    sla: 99.0% (SLA externo)

Templates de Mensagens

Investigating
🔍 Investigating: API Slowness

We are investigating reports of slow response times on the
booking API. Users may experience delays when creating or
viewing reservations.

We will provide an update in 15 minutes.

Posted: 2026-01-09 14:35 BRT
Identified
🛠️ Identified: Database Performance Issue

We have identified the root cause as elevated database load
due to a long-running query. Our team is implementing a fix.

ETA for resolution: 30 minutes

Workaround: Mobile app may still be slow. Please retry if
bookings fail.

Posted: 2026-01-09 14:50 BRT
Monitoring
👀 Monitoring: Fix Implemented

The performance issue has been resolved. We are monitoring
the system to ensure stability.

All systems operational. Response times back to normal.

Posted: 2026-01-09 15:25 BRT
Resolved
✅ Resolved: API Performance Restored

The booking API performance issue has been fully resolved.

Timeline:
- 14:30 BRT: Issue detected
- 14:50 BRT: Root cause identified
- 15:20 BRT: Fix deployed
- 15:30 BRT: Confirmed resolution

Impact: ~5% of booking attempts experienced delays (retry successful)
Root cause: Unoptimized database query
Prevention: Query optimization + monitoring alerts

We apologize for the inconvenience.

Posted: 2026-01-09 15:35 BRT

8.2 Escalonamento Interno

┌─────────────────────────────────────────────────────────┐
│                    INCIDENT SEVERITY                     │
├─────────────┬───────────────────────────────────────────┤
│   P0        │ Platform Down / Data Loss / Security      │
│  CRITICAL   │ Notify: CTO, CEO, DevOps Lead (imediato)  │
│             │ War Room: Zoom + #war-room Slack          │
├─────────────┼───────────────────────────────────────────┤
│   P1        │ Major Feature Down / Payment Failing      │
│    HIGH     │ Notify: DevOps Lead, Product Lead (15min) │
│             │ Response: On-call engineer                │
├─────────────┼───────────────────────────────────────────┤
│   P2        │ Degraded Performance / Minor Feature Down │
│   MEDIUM    │ Notify: On-call engineer (30min)          │
│             │ Response: During business hours           │
├─────────────┼───────────────────────────────────────────┤
│   P3        │ Cosmetic Issue / Low Impact               │
│    LOW      │ Notify: Backlog (next sprint)             │
└─────────────┴───────────────────────────────────────────┘

On-Call Schedule

Week 1 (Jan 01-07): João Silva (DevOps)
Week 2 (Jan 08-14): Maria Santos (Backend Lead)
Week 3 (Jan 15-21): Pedro Costa (DevOps)
Week 4 (Jan 22-28): Ana Oliveira (Backend)

Backup: Thiago Nicolussi (DevOps Lead) - Always available

Escalation:
1. On-call engineer (PagerDuty)
2. Backup on-call (+15min)
3. DevOps Lead (+30min)
4. CTO (+1h for P0)

9. Modo Degradado

9.1 Funcionalidades Essenciais

Prioridade 0 (Deve funcionar sempre):

  • Acesso às quadras (check-in)
  • Visualizar reservas existentes
  • Contato de emergência com arena

Prioridade 1 (Pode degradar):

  • Criar novas reservas (fallback: ligação)
  • Processar pagamentos (fallback: pagar na arena)
  • Notificações push

Prioridade 2 (Pode ficar offline):

  • Analytics/Relatórios
  • Configurações de perfil
  • Upload de fotos

9.2 Fallback Offline para Catracas

Arquitetura

┌─────────────────────────────────────────────────────┐
│                   Online Mode                        │
│  Arena Catraca ──https──> Sport Tech API ──> Ziggy  │
└─────────────────────────────────────────────────────┘

                   Timeout / 503

┌─────────────────────────────────────────────────────┐
│                  Offline Mode                        │
│  Arena Catraca ──local cache──> Approve/Deny        │
│       │                                              │
│       └──sync queue──> Sport Tech API (when back)   │
└─────────────────────────────────────────────────────┘

Implementação - Gateway Local

typescript
// catraca-gateway/src/offline-mode.ts

interface CachedBooking {
  bookingId: string;
  userId: string;
  arenaId: string;
  startTime: Date;
  endTime: Date;
  status: 'confirmed' | 'pending';
  lastSynced: Date;
}

class OfflineAccessController {
  private cache: Map<string, CachedBooking> = new Map();
  private syncQueue: AccessLog[] = [];
  private offlineMode = false;

  // Sync de bookings a cada 5 minutos
  @Cron('*/5 * * * *')
  async syncBookings() {
    try {
      const upcomingBookings = await this.api.getUpcomingBookings({
        arenaId: this.arenaId,
        startTime: new Date(),
        endTime: addHours(new Date(), 24)  // Próximas 24h
      });

      // Atualizar cache local
      upcomingBookings.forEach(booking => {
        this.cache.set(booking.id, {
          ...booking,
          lastSynced: new Date()
        });
      });

      this.offlineMode = false;
    } catch (error) {
      this.offlineMode = true;
    }
  }

  // Validar acesso (online ou offline)
  async validateAccess(bookingId: string): Promise<AccessResult> {
    // Tentar online primeiro
    if (!this.offlineMode) {
      try {
        return await this.api.validateAccess(bookingId);
      } catch (error) {
        this.offlineMode = true;
      }
    }

    // Fallback offline
    return this.validateAccessOffline(bookingId);
  }

  private validateAccessOffline(bookingId: string): AccessResult {
    const booking = this.cache.get(bookingId);

    if (!booking) {
      return {
        allowed: false,
        reason: 'Booking not found in offline cache',
        mode: 'offline'
      };
    }

    const now = new Date();
    const gracePeriod = 15 * 60 * 1000;  // 15 minutos

    const isValidTime = (
      now >= new Date(booking.startTime.getTime() - gracePeriod) &&
      now <= booking.endTime
    );

    const isConfirmed = booking.status === 'confirmed';

    if (isValidTime && isConfirmed) {
      this.syncQueue.push({
        bookingId,
        timestamp: now,
        decision: 'allowed',
        mode: 'offline'
      });

      return {
        allowed: true,
        mode: 'offline',
        warning: 'Validated offline - pending sync'
      };
    }

    return {
      allowed: false,
      reason: 'Invalid time or status (offline mode)',
      mode: 'offline'
    };
  }
}

10. Custos

10.1 Infraestrutura de DR

ComponenteConfiguração DRCusto Mensal (USD)Justificativa
RDS Multi-AZdb.r6g.xlarge$550Failover automático < 2min
RDS Backups35 dias, 500GB$75Compliance + recovery
S3 ReplicationCross-region (2TB)$120Media redundancy
S3 Versioning3 versões (6TB total)$90Rollback de objetos
WAL Archive (S3)100GB/mês$10PITR granular
ElastiCache ReplicaMulti-AZ (6 nodes)$400Cache HA
ECS Auto-scaling3-10 instances$300-$1000Elasticidade
CloudFrontCDN global$150Failover de região
Monitoring (Datadog)Infra + APM$300Detecção rápida
TOTAL MENSAL$2,000 - $2,700
TOTAL ANUAL~$24,000 - $32,000

10.2 Trade-offs Custo vs Tempo de Recovery

┌──────────────────────────────────────────────────────┐
│            RTO vs Custo (Database)                    │
├──────────────────┬──────────────┬────────────────────┤
│  RTO < 1 min     │  $1,200/mês  │  Multi-AZ + Read  │
│  (Hot Standby)   │              │  Replica + Aurora │
├──────────────────┼──────────────┼────────────────────┤
│  RTO < 5 min     │  $600/mês    │  Multi-AZ only    │
│  (Warm Standby)  │              │  (current)        │
├──────────────────┼──────────────┼────────────────────┤
│  RTO < 1 hour    │  $150/mês    │  Automated        │
│  (Cold Backup)   │              │  snapshots only   │
├──────────────────┼──────────────┼────────────────────┤
│  RTO < 4 hours   │  $50/mês     │  Manual restore   │
│  (Manual)        │              │  from S3 export   │
└──────────────────┴──────────────┴────────────────────┘

Decisão atual: RTO < 5min (Multi-AZ) = $600/mês Justificativa: Balance entre custo e SLA de 99.9%


11. Apêndices

11.1 Contatos de Emergência

yaml
Internal:
  - name: Thiago Nicolussi
    role: DevOps Lead
    phone: +55 11 99999-0001
    email: thiago.nicolussi@sporttechclub.com.br
    timezone: America/Sao_Paulo

External:
  - name: AWS Support
    level: Enterprise
    phone: +1 (877) 662-0100
    portal: https://console.aws.amazon.com/support/

  - name: Ziggy Support
    level: Premium
    phone: +55 11 3333-4444
    email: suporte@ziggy.com.br
    sla: 2h response time

11.2 Ferramentas de DR

yaml
Monitoring & Alerting:
  - Datadog: https://app.datadoghq.com/
  - PagerDuty: https://sporttechclub.pagerduty.com/
  - Statuspage: https://manage.statuspage.io/

Cloud Management:
  - AWS Console: https://console.aws.amazon.com/
  - AWS CLI: installed on bastion hosts
  - Terraform Cloud: https://app.terraform.io/

Communication:
  - Slack: #incidents, #war-room
  - Zoom: War room link (pinned in Slack)
  - Email: incidents@sporttechclub.com.br

Documentation:
  - Runbooks: GitHub (sport-tech-club/docs/runbooks/)
  - Post-mortems: GitHub (docs/post-mortems/)

11.3 Changelog

markdown
## Changelog

### Version 1.0 (2026-01-09)
- Initial DR plan created
- Defined RTO (1h) and RPO (15min)
- Documented runbooks for common scenarios
- Established testing schedule

### Future Improvements
- [ ] Implement automated DR drills (Chaos Engineering)
- [ ] Add multi-region failover (us-west-2)
- [ ] Integrate Vault auto-unseal with AWS KMS
- [ ] Implement blue-green deployment strategy
- [ ] Create disaster recovery mobile app
- [ ] Improve observability with distributed tracing

12. Próximos Passos

Curto Prazo (Q1 2026)

□ Executar primeiro DR drill completo (War Room Simulation)
□ Automatizar runbooks críticos (Terraform, Ansible)
□ Implementar cache warming automático
□ Configurar cross-region replication (S3)
□ Treinar toda equipe de engenharia nos runbooks
□ Validar integração com Ziggy em modo degradado

Médio Prazo (Q2-Q3 2026)

□ Multi-region deployment (us-west-2 standby)
□ Chaos Engineering: GameDays mensais
□ Observabilidade avançada (Honeycomb, Grafana)
□ Automated failover para região secundária
□ Service mesh (Istio) para traffic management
□ Feature flags granular (LaunchDarkly)

Longo Prazo (2027)

□ Certificação ISO 27001
□ Global load balancing (GeoDNS)
□ Zero-downtime deployment (blue-green + canary)
□ Self-healing infrastructure (Kubernetes operators)
□ AI-powered incident detection (AIOps)

Documento vivo. Última revisão: 2026-01-09Próxima revisão agendada: 2026-04-09 (trimestral)

Aprovações:

  • [x] Thiago Nicolussi - DevOps Lead
  • [ ] [CTO Name] - CTO
  • [ ] [CEO Name] - CEO

"Hope is not a strategy. Prepare for the worst, expect the best."