Observability And Alerts

The current implementation emits OpenTelemetry signals to the collector. The reference stack routes metrics to Prometheus, logs to Loki, traces to Tempo, and dashboards/alerts to Grafana.

The backend does not expose a Prometheus /metrics endpoint by default. Prometheus scrapes the OpenTelemetry Collector Prometheus exporter.

Local URLs

ComponentURL
Grafanahttp://localhost:3000
Prometheushttp://localhost:9090
Lokihttp://localhost:3100
Tempohttp://localhost:3200
OTLP gRPClocalhost:4317
OTLP HTTPlocalhost:4318

Signal Fields

HTTP logs include:

FieldMeaning
method, path, status, duration_msRequest summary.
request_idGenerated or client-supplied request ID.
trace_idTrace link when OTel span exists.
tenant_id, user_idPresent when a principal is authenticated.

Use request_id for support tickets and trace_id for cross-service triage.

Alerts

AlertFirst checks
WipeAPITelemetryMissingAPI container health, OTel endpoint, collector health, process heartbeat.
WipeHTTP5xxRateFailing route/status in Grafana, Loki logs by route, Tempo trace, dependency readiness.
WipeQueueBacklogHighQueue depth by queue/status, matching worker health, in-flight jobs, dependency saturation.
WipeQueueJobFailuresWorker logs, queue job outcomes, signer/storage/SMTP/chain dependency health.
WipeQueueDeadLetters / WipeQueueDLQNotEmptyNewest DLQ rows, root cause, replay eligibility.
WipeProofProcessingStalledPending proof status, proof worker logs, signer/storage/TSA/chain signals.
WipeProofsFailedFailed proof rows, rejection reason, dependency health.
WipeProofsAwaitingLicenseLicense grants, allocations, quotas, revocation state.
WipeSignerFailuresSigner operation labels: decrypt, hmac, hmac_verify, pades_sign, receipt_sign.
WipeAnchorFailuresAnchor worker logs, chain ID, provider health, authorized sender funding/permission.
WipeDBConnectionsSaturatedDB pool gauges, active queries, slow routes/workers, PostgreSQL health.

Common Triage Flow

  1. Confirm /readyz and dependency container/pod health.
  2. Open Grafana Wipe Backend Overview.
  3. Filter Loki by service_name and request_id or trace_id.
  4. Follow the trace into Tempo when available.
  5. Inspect relevant queue and DLQ rows.
  6. Fix dependency or configuration before replaying jobs.

Production Settings

VariableExpectation
OBSERVABILITY_ENABLED=trueEnable telemetry on API and workers.
OBSERVABILITY_OTLP_ENDPOINTCollector or managed OTel endpoint reachable from all services.
OBSERVABILITY_ENVIRONMENTStable environment label such as prod, staging, or site name.
OBSERVABILITY_OTLP_HEADERSRequired only when the collector/backend needs auth headers.
OBSERVABILITY_METRICS_INTERVALBalance signal freshness and overhead.

Keep logs free of secrets, raw proof payloads, canonical certificate JSON, API key secrets, JWTs, and private tenant data.