On this page
monitoring
Observability And Alerts
Use OpenTelemetry, Grafana, Prometheus, Loki, Tempo, and alert runbooks.
Observability And Alerts
The current implementation emits OpenTelemetry signals to the collector. The reference stack routes metrics to Prometheus, logs to Loki, traces to Tempo, and dashboards/alerts to Grafana.
The backend does not expose a Prometheus /metrics endpoint by default.
Prometheus scrapes the OpenTelemetry Collector Prometheus exporter.
Local URLs
| Component | URL |
|---|---|
| Grafana | http://localhost:3000 |
| Prometheus | http://localhost:9090 |
| Loki | http://localhost:3100 |
| Tempo | http://localhost:3200 |
| OTLP gRPC | localhost:4317 |
| OTLP HTTP | localhost:4318 |
Signal Fields
HTTP logs include:
| Field | Meaning |
|---|---|
method, path, status, duration_ms | Request summary. |
request_id | Generated or client-supplied request ID. |
trace_id | Trace link when OTel span exists. |
tenant_id, user_id | Present when a principal is authenticated. |
Use request_id for support tickets and trace_id for cross-service triage.
Alerts
| Alert | First checks |
|---|---|
WipeAPITelemetryMissing | API container health, OTel endpoint, collector health, process heartbeat. |
WipeHTTP5xxRate | Failing route/status in Grafana, Loki logs by route, Tempo trace, dependency readiness. |
WipeQueueBacklogHigh | Queue depth by queue/status, matching worker health, in-flight jobs, dependency saturation. |
WipeQueueJobFailures | Worker logs, queue job outcomes, signer/storage/SMTP/chain dependency health. |
WipeQueueDeadLetters / WipeQueueDLQNotEmpty | Newest DLQ rows, root cause, replay eligibility. |
WipeProofProcessingStalled | Pending proof status, proof worker logs, signer/storage/TSA/chain signals. |
WipeProofsFailed | Failed proof rows, rejection reason, dependency health. |
WipeProofsAwaitingLicense | License grants, allocations, quotas, revocation state. |
WipeSignerFailures | Signer operation labels: decrypt, hmac, hmac_verify, pades_sign, receipt_sign. |
WipeAnchorFailures | Anchor worker logs, chain ID, provider health, authorized sender funding/permission. |
WipeDBConnectionsSaturated | DB pool gauges, active queries, slow routes/workers, PostgreSQL health. |
Common Triage Flow
- Confirm
/readyzand dependency container/pod health. - Open Grafana
Wipe Backend Overview. - Filter Loki by
service_nameandrequest_idortrace_id. - Follow the trace into Tempo when available.
- Inspect relevant queue and DLQ rows.
- Fix dependency or configuration before replaying jobs.
Production Settings
| Variable | Expectation |
|---|---|
OBSERVABILITY_ENABLED=true | Enable telemetry on API and workers. |
OBSERVABILITY_OTLP_ENDPOINT | Collector or managed OTel endpoint reachable from all services. |
OBSERVABILITY_ENVIRONMENT | Stable environment label such as prod, staging, or site name. |
OBSERVABILITY_OTLP_HEADERS | Required only when the collector/backend needs auth headers. |
OBSERVABILITY_METRICS_INTERVAL | Balance signal freshness and overhead. |
Keep logs free of secrets, raw proof payloads, canonical certificate JSON, API key secrets, JWTs, and private tenant data.