Disaster Recovery e Backup

Obiettivi RTO/RPO

Componente	RPO (Recovery Point Objective)	RTO (Recovery Time Objective)	Priorita
MongoDB Atlas (dati operativi)	1 secondo (continuous backup)	1 ora	Critica
PostgreSQL DWH (analytics)	24 ore (daily dump)	4 ore	Alta
Redis Cloud (cache)	N/A (ricostruibile)	15 minuti (cold start)	Media
NATS (transporter)	N/A (stateless)	5 minuti (restart pod)	Media
Keycloak (autenticazione)	24 ore (realm export)	2 ore	Critica
AWS S3 (media/archivio)	Nessuna perdita (11 9s durability)	Immediato	Bassa
Configurazione K8s	N/A (GitOps)	30 minuti (re-apply)	Alta

MongoDB Atlas

Backup Continuo (Oplog-based)

MongoDB Atlas fornisce backup continuo con point-in-time recovery:

# Verificare lo stato dei backup
atlas backups snapshots list --clusterName pos-production

# Elencare i restore job attivi
atlas backups restores list --clusterName pos-production

Configurazione attiva:

Continuous backup: abilitato, retention 7 giorni
Daily snapshots: retention 30 giorni
Weekly snapshots: retention 12 settimane
Monthly snapshots: retention 12 mesi

Procedura di Restore MongoDB

Scenario 1: Restore Point-in-Time (corruzione dati)

# 1. Identificare il timestamp desiderato (UTC)
# Esempio: restore a 2 ore fa
RESTORE_TIME=$(date -u -d "2 hours ago" +"%Y-%m-%dT%H:%M:%SZ")

# 2. Avviare il restore su un cluster temporaneo
atlas backups restores start pointInTime \
  --clusterName pos-production \
  --pointInTimeUTCSeconds $(date -d "$RESTORE_TIME" +%s) \
  --targetClusterName pos-restore-temp \
  --targetProjectId $(atlas projects list --output json | jq -r '.[0].id')

# 3. Monitorare il progresso
atlas backups restores list --clusterName pos-production

# 4. Verificare i dati sul cluster temporaneo
mongosh "mongodb+srv://pos-restore-temp.xxxxx.mongodb.net" --eval "
  db.getSiblingDB('t_ristorante_roma_01').orders.countDocuments({ status: 'open' })
"

# 5. Se i dati sono corretti, copiare i database necessari
# Opzione A: mongodump/mongorestore per collection specifiche
mongodump --uri="mongodb+srv://pos-restore-temp.xxxxx.mongodb.net" \
  --db=t_ristorante_roma_01 --collection=orders \
  --out=/tmp/restore/

mongorestore --uri="mongodb+srv://pos-production.xxxxx.mongodb.net" \
  --db=t_ristorante_roma_01 --collection=orders \
  --drop /tmp/restore/t_ristorante_roma_01/orders.bson

# 6. Eliminare il cluster temporaneo
atlas clusters delete pos-restore-temp --force

Scenario 2: Restore Completo (disastro cluster)

# 1. Creare un nuovo cluster con le stesse specifiche
atlas clusters create pos-production-new \
  --tier M30 \
  --provider AWS \
  --region EU_WEST_1 \
  --diskSizeGB 100

# 2. Restore dall'ultimo snapshot
atlas backups restores start automated \
  --clusterName pos-production \
  --snapshotId <snapshot-id> \
  --targetClusterName pos-production-new \
  --targetProjectId <project-id>

# 3. Aggiornare la connection string nei secret K8s
kubectl edit secret -n pos-enterprise mongodb-credentials
# Aggiornare MONGO_URI

# 4. Restart rolling dei pod
kubectl rollout restart deployment -n pos-enterprise -l tier=backend

PostgreSQL DWH

Strategia di Backup

Il DWH PostgreSQL viene backuppato con pg_dump giornaliero:

# Backup giornaliero (eseguito dal cron job K8s alle 02:00 UTC)
pg_dump -h $PGHOST -U $PGUSER -d pos_analytics \
  --format=custom \
  --compress=9 \
  --file="/tmp/pos_analytics_$(date +%Y%m%d).dump"

# Upload su S3
aws s3 cp "/tmp/pos_analytics_$(date +%Y%m%d).dump" \
  "s3://impronto-backups/postgresql/pos_analytics_$(date +%Y%m%d).dump" \
  --storage-class STANDARD_IA

# Cleanup backup locali
rm -f /tmp/pos_analytics_*.dump

# Retention: 30 giorni su S3 (gestita da S3 Lifecycle Policy)

Procedura di Restore PostgreSQL

# 1. Elencare i backup disponibili
aws s3 ls s3://impronto-backups/postgresql/ --human-readable

# 2. Scaricare il backup desiderato
aws s3 cp "s3://impronto-backups/postgresql/pos_analytics_20260328.dump" /tmp/

# 3. Creare un database temporaneo per verifica
psql -h $PGHOST -U $PGUSER -c "CREATE DATABASE pos_analytics_restore;"

# 4. Restore
pg_restore -h $PGHOST -U $PGUSER -d pos_analytics_restore \
  --clean --if-exists \
  --jobs=4 \
  /tmp/pos_analytics_20260328.dump

# 5. Verificare i dati
psql -h $PGHOST -U $PGUSER -d pos_analytics_restore -c "
  SELECT COUNT(*) FROM analytics.fact_orders
  WHERE date_key >= '2026-03-01';
"

# 6. Se corretto, swap dei database
psql -h $PGHOST -U $PGUSER -c "
  ALTER DATABASE pos_analytics RENAME TO pos_analytics_old;
  ALTER DATABASE pos_analytics_restore RENAME TO pos_analytics;
"

# 7. Refresh materialized views
psql -h $PGHOST -U $PGUSER -d pos_analytics -c "
  REFRESH MATERIALIZED VIEW CONCURRENTLY analytics.mv_daily_sales;
  REFRESH MATERIALIZED VIEW CONCURRENTLY analytics.mv_product_performance;
  REFRESH MATERIALIZED VIEW CONCURRENTLY analytics.mv_tenant_kpis;
"

AWS S3 (Media e Archivio)

Durabilita e Protezione

S3 offre 99.999999999% (11 nines) di durabilita. Misure aggiuntive:

Versioning: abilitato sul bucket impronto-media
Cross-region replication: dal bucket primario (eu-west-1) al bucket DR (eu-central-1)
Object Lock: abilitato per il bucket impronto-archive (dati fiscali, WORM compliance)

# Verificare versioning
aws s3api get-bucket-versioning --bucket impronto-media

# Verificare replication
aws s3api get-bucket-replication --bucket impronto-media

# Restore di un oggetto cancellato (con versioning)
aws s3api list-object-versions --bucket impronto-media \
  --prefix "tenants/ristorante_roma_01/logo.png"

aws s3api get-object \
  --bucket impronto-media \
  --key "tenants/ristorante_roma_01/logo.png" \
  --version-id "xxxxx" \
  /tmp/logo_restored.png

Keycloak

Export Realm

# Export del realm (eseguire mensilmente o prima di ogni aggiornamento)
kubectl exec -n pos-enterprise deployment/keycloak -- \
  /opt/keycloak/bin/kc.sh export \
  --dir /tmp/keycloak-export \
  --realm impronto \
  --users realm_file

# Copiare l'export localmente
kubectl cp pos-enterprise/keycloak-xxxxx:/tmp/keycloak-export ./keycloak-export/

# Upload su S3
aws s3 sync ./keycloak-export/ \
  "s3://impronto-backups/keycloak/$(date +%Y%m%d)/"

Restore Keycloak

# 1. Scaricare l'export
aws s3 sync "s3://impronto-backups/keycloak/20260328/" ./keycloak-restore/

# 2. Copiare nel pod Keycloak
kubectl cp ./keycloak-restore/ pos-enterprise/keycloak-xxxxx:/tmp/keycloak-import/

# 3. Import del realm
kubectl exec -n pos-enterprise deployment/keycloak -- \
  /opt/keycloak/bin/kc.sh import \
  --dir /tmp/keycloak-import \
  --override true

NATS (Stateless)

NATS e un trasporter stateless: non richiede backup. In caso di failure:

# Restart dei pod NATS
kubectl rollout restart statefulset -n pos-enterprise nats

# Verificare che tutti i nodi Moleculer si siano riconnessi
kubectl logs -n pos-enterprise -l node-type=gateway --tail=20 | grep "NATS"

I messaggi in transito durante il restart vengono persi. Il meccanismo di retry di Moleculer ritrasmette le richieste fallite. Per gli eventi, il servizio sync gestisce la riconciliazione.

Redis Cloud (Cache)

Redis e utilizzato come cache: in caso di failure totale, i dati vengono ricostruiti automaticamente dalla cache miss.

# Dopo un restart Redis, i servizi popoleranno la cache gradualmente
# Monitorare il cache hit rate che dovrebbe salire da 0% a >70% in ~15 minuti

# Se necessario, pre-riscaldare la cache (warm-up)
kubectl exec -n pos-enterprise deployment/pos-core -- \
  node -e "require('./scripts/cache-warmup.js').run()"

Redis Streams

A differenza della cache, i Redis Streams contengono eventi non ancora processati. Redis Cloud esegue snapshot periodici che preservano gli stream. Verificare XLEN dopo un restore per confermare che nessun messaggio sia stato perso.

Configurazione Kubernetes (GitOps)

Tutta la configurazione K8s e versionata nel repository Git:

# Re-applicare tutta la configurazione
kubectl apply -f k8s/namespaces/
kubectl apply -f k8s/secrets/        # Da vault, non dal repo
kubectl apply -f k8s/configmaps/
kubectl apply -f k8s/deployments/
kubectl apply -f k8s/services/
kubectl apply -f k8s/ingress/
kubectl apply -f k8s/hpa/
kubectl apply -f k8s/monitoring/

Piano di Comunicazione

In caso di incidente che richiede DR:

Fase	Azione	Responsabile	Canale
T+0	Rilevamento incidente (alert automatico)	Sistema monitoring	Email + Slack
T+5min	Conferma incidente e valutazione impatto	On-call engineer	Slack #incidents
T+15min	Comunicazione iniziale ai reseller	Platform manager	Email template
T+30min	Aggiornamento stato recovery	On-call engineer	Slack #incidents
Ogni 1h	Aggiornamento progressi	On-call engineer	Slack + Email
Recovery	Comunicazione ripristino	Platform manager	Email + Status page
T+48h	Post-mortem	Team	Documento condiviso

Verifica DR (Test Periodici)

Eseguire un test DR trimestrale in ambiente staging:

Restore MongoDB point-in-time su cluster separato
Restore PostgreSQL da dump S3
Import realm Keycloak da export
Verifica funzionamento applicativo su dati ripristinati
Misurare RTO effettivo vs target
Documentare problemi e aggiornare procedure

Riferimenti Incrociati

Manutenzione Database - Procedure di backup ordinarie
Troubleshooting Infrastrutturale - Diagnosi pre-DR
Sistema di Alerting - Alert che possono triggerare il DR
Guida allo Scaling - Ricostruzione cluster post-DR

Questa pagina ti è stata utile?

Obiettivi RTO/RPO​

MongoDB Atlas​

Backup Continuo (Oplog-based)​

Procedura di Restore MongoDB​

Scenario 1: Restore Point-in-Time (corruzione dati)​

Scenario 2: Restore Completo (disastro cluster)​

PostgreSQL DWH​

Strategia di Backup​

Procedura di Restore PostgreSQL​

AWS S3 (Media e Archivio)​

Durabilita e Protezione​

Keycloak​

Export Realm​

Restore Keycloak​

NATS (Stateless)​

Redis Cloud (Cache)​

Configurazione Kubernetes (GitOps)​

Piano di Comunicazione​

Verifica DR (Test Periodici)​

Riferimenti Incrociati​