Sistema di Alerting: Configurazione e Gestione

Architettura dell'Alerting

Il sistema di alerting opera su due livelli complementari:

Prometheus + Alertmanager: alert basati su metriche (infrastruttura e applicativo)
infrastructure-monitor.service.js: alert applicativi dal backend Moleculer, inoltrati come webhook ad Alertmanager

Prometheus ──(evalua regole)──> Alertmanager ──> Email / Slack / Webhook
                                      ▲
infrastructure-monitor.service.js ────┘ (webhook bridge)

Configurazione Alertmanager

File di Configurazione

Alertmanager e configurato tramite Secret Kubernetes:

# k8s/monitoring/alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.sellogic.cloud:587'
      smtp_from: 'alerts@platform.sellogic.cloud'
      smtp_auth_username: 'alerts@platform.sellogic.cloud'
      smtp_auth_password: '<from-vault>'
      smtp_require_tls: true

    route:
      receiver: 'default-receiver'
      group_by: ['alertname', 'namespace', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        # Critical: invio immediato, nessun raggruppamento
        - match:
            severity: critical
          receiver: 'critical-email'
          group_wait: 0s
          group_interval: 1m
          repeat_interval: 1h
          continue: true

        # Warning: raggruppamento per servizio, intervallo 15m
        - match:
            severity: warning
          receiver: 'warning-email'
          group_wait: 2m
          group_interval: 15m
          repeat_interval: 6h

        # Info: solo log, nessuna notifica
        - match:
            severity: info
          receiver: 'null-receiver'

    receivers:
      - name: 'default-receiver'
        email_configs:
          - to: 'ops-team@sellogic.cloud'
            send_resolved: true

      - name: 'critical-email'
        email_configs:
          - to: 'ops-critical@sellogic.cloud'
            send_resolved: true
            headers:
              Subject: '[CRITICAL] {{ .GroupLabels.alertname }} - Impronto Enterprise'

      - name: 'warning-email'
        email_configs:
          - to: 'ops-team@sellogic.cloud'
            send_resolved: true
            headers:
              Subject: '[WARNING] {{ .GroupLabels.alertname }} - Impronto Enterprise'

      - name: 'null-receiver'

    inhibit_rules:
      # Se il nodo e down, sopprimere tutti gli alert dei pod su quel nodo
      - source_match:
          alertname: NodeDown
        target_match_re:
          alertname: '.+'
        equal: ['node']

# Applicare la configurazione
kubectl apply -f k8s/monitoring/alertmanager-config.yaml

# Verificare che Alertmanager abbia ricaricato
kubectl logs -n monitoring alertmanager-0 --tail=10

PrometheusRule: Formato e Regole Custom

Struttura di una PrometheusRule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: impronto-platform-rules
  namespace: monitoring
  labels:
    release: prometheus  # Necessario per il discovery
spec:
  groups:
    - name: impronto.platform
      interval: 30s
      rules:
        - alert: NomeAlert
          expr: <espressione_promql>
          for: <durata_minima>
          labels:
            severity: critical|warning|info
            team: platform
          annotations:
            summary: "Descrizione breve"
            description: "Dettaglio con {{ $labels.service }} e {{ $value }}"
            runbook_url: "https://docs.platform.sellogic.cloud/operations/runbooks/..."

Regole Predefinite della Piattaforma

# k8s/monitoring/rules/platform-rules.yaml
spec:
  groups:
    - name: impronto.services
      rules:
        # Pod restart frequenti
        - alert: PodRestartLoop
          expr: increase(kube_pod_container_status_restarts_total{namespace="pos-enterprise"}[1h]) > 3
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} in restart loop"
            description: "Il pod {{ $labels.pod }} ha riavviato {{ $value }} volte nell'ultima ora."

        # Error rate alto su un servizio
        - alert: HighServiceErrorRate
          expr: |
            (sum(rate(moleculer_action_errors_total{namespace="pos-enterprise"}[5m])) by (service)
            / sum(rate(moleculer_action_requests_total{namespace="pos-enterprise"}[5m])) by (service))
            > 0.05
          for: 3m
          labels:
            severity: critical
          annotations:
            summary: "Error rate >5% su {{ $labels.service }}"
            description: "Il servizio {{ $labels.service }} ha un error rate del {{ $value | humanizePercentage }}."

        # Latenza P95 elevata
        - alert: HighActionLatency
          expr: |
            histogram_quantile(0.95,
              rate(moleculer_action_duration_seconds_bucket{namespace="pos-enterprise"}[5m])
            ) by (service, action) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Latenza P95 >2s per {{ $labels.service }}.{{ $labels.action }}"

    - name: impronto.infrastructure
      rules:
        # MongoDB connection pool quasi esaurito
        - alert: MongoPoolExhaustion
          expr: pos_mongodb_pool_available / pos_mongodb_pool_size < 0.1
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "MongoDB pool quasi esaurito su {{ $labels.cluster }}"
            description: "Solo il {{ $value | humanizePercentage }} delle connessioni disponibili."

        # Redis memoria alta
        - alert: RedisHighMemory
          expr: pos_redis_memory_used_bytes / pos_redis_memory_max_bytes > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Redis memoria >85% su {{ $labels.instance }}"

        # NATS riconnessioni frequenti
        - alert: NatsReconnectionStorm
          expr: increase(pos_nats_reconnections_total[10m]) > 5
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "NATS reconnection storm su {{ $labels.node }}"

        # Cache hit rate basso
        - alert: LowCacheHitRate
          expr: |
            sum(rate(pos_cache_hits_total[5m])) by (service)
            / (sum(rate(pos_cache_hits_total[5m])) by (service) + sum(rate(pos_cache_misses_total[5m])) by (service))
            < 0.5
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cache hit rate <50% per {{ $labels.service }}"

# Applicare le regole
kubectl apply -f k8s/monitoring/rules/

# Verificare che Prometheus abbia caricato le regole
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Poi aprire: http://localhost:9090/rules

Bridge infrastructure-monitor.service.js

Il servizio infrastructure-monitor.service.js esegue controlli applicativi che non sono facilmente esprimibili come PromQL ed invia alert ad Alertmanager via webhook:

// Esempio: invio alert a Alertmanager
async sendAlert(alertName, severity, labels, annotations) {
  await this.broker.call("infrastructure-monitor.pushAlert", {
    alerts: [{
      labels: {
        alertname: alertName,
        severity,
        source: "infrastructure-monitor",
        ...labels
      },
      annotations: {
        summary: annotations.summary,
        description: annotations.description
      },
      startsAt: new Date().toISOString()
    }]
  });
}

L'endpoint Alertmanager per i webhook e:

http://prometheus-kube-prometheus-alertmanager.monitoring.svc.cluster.local:9093/api/v2/alerts

Deduplicazione

Alertmanager deduplica automaticamente gli alert basandosi su:

group_by: alert con le stesse label di raggruppamento vengono uniti
group_interval: nuove istanze di un gruppo esistente vengono inviate dopo questo intervallo
repeat_interval: un alert non risolto viene reinviato dopo questo periodo

Per gli alert da infrastructure-monitor.service.js, e fondamentale includere label stabili (service, alertname) per evitare duplicazioni.

Silenziamento durante Manutenzione

Via CLI

# Creare un silence per 2 ore su tutti gli alert del namespace pos-enterprise
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="ops-team" \
  --comment="Manutenzione pianificata DB" \
  --duration=2h \
  namespace="pos-enterprise"

# Silenziare un alert specifico
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="ops-team" \
  --comment="Tuning in corso" \
  --duration=1h \
  alertname="HighActionLatency" service="commerce"

# Elencare i silence attivi
amtool silence query --alertmanager.url=http://localhost:9093

# Rimuovere un silence (tramite ID)
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Via Grafana UI

Navigare a Alerting → Silences
Cliccare New Silence
Configurare i matcher (label = valore)
Impostare durata e commento
Create

Procedura di Manutenzione

Prima di qualsiasi manutenzione pianificata, creare SEMPRE un silence e verificare che sia attivo prima di procedere. Documentare il silence nel canale ops.

Verifica dello Stato degli Alert

# Alert attualmente in firing
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093
# Aprire: http://localhost:9093/#/alerts

# Alert da Prometheus (pending + firing)
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Aprire: http://localhost:9090/alerts

# Query rapida: alert in firing
curl -s http://localhost:9093/api/v2/alerts?active=true | jq '.[].labels.alertname'

Riferimenti Incrociati

Stack di Monitoring: Panoramica - Architettura dello stack
Dashboard Grafana - Visualizzare le metriche associate agli alert
Troubleshooting Infrastrutturale - Procedure di risoluzione quando scattano gli alert
Health Check e Probes - Alert correlati ai probe failure

Questa pagina ti è stata utile?

Architettura dell'Alerting​

Configurazione Alertmanager​

File di Configurazione​

PrometheusRule: Formato e Regole Custom​

Struttura di una PrometheusRule​

Regole Predefinite della Piattaforma​

Bridge infrastructure-monitor.service.js​

Deduplicazione​

Silenziamento durante Manutenzione​

Via CLI​

Via Grafana UI​

Verifica dello Stato degli Alert​

Riferimenti Incrociati​