Incident Response

Use this page for first response. Record the incident, capture request IDs, trace IDs, affected tenant/organization IDs where permitted, and the exact time window before replaying or deleting anything.

Keycloak Outage

Impact:

  • New login/token requests fail.
  • Existing API calls may continue briefly while tokens and JWKS cache remain valid.
  • User, invitation, IdP, and provisioning operations should pause.

Triage:

  1. Check Keycloak and Keycloak PostgreSQL health.
  2. Query realm metadata from the API network.
  3. Check API logs for JWT/JWKS errors and elevated 401.
  4. Confirm public issuer URLs still match configured JWT iss.

Recovery:

  1. Restore Keycloak database first, then Keycloak.
  2. Restart API/workers only if issuer or realm import changed.
  3. Run password-token smoke tests for master and exawipe.
  4. Retry failed onboarding/admin work after confirming idempotency behavior.

Signer Outage

Impact:

  • Proof processing can fail during decrypt, HMAC, PAdES signing, receipt signing, or future transaction signing.
  • Proof jobs may retry and eventually move to DLQ.

Triage:

  1. Check WipeSignerFailures by operation.
  2. Inspect proof worker logs for affected proof IDs.
  3. Confirm SIGNER_MODE, endpoint, timeout, bearer/mTLS material, key IDs, and TSA settings.
  4. Inspect proofs.validated, proofs.retry, and DLQ depth.

Recovery:

  1. Restore signer availability and key material.
  2. Run one proof upload smoke test.
  3. Replay only jobs whose referenced proof is still eligible.

Storage Outage

Impact:

  • Proof upload, PDF generation, canonical JSON download, billing exports, and monthly reports can fail.

Triage:

  1. Check MinIO/S3 endpoint health, credentials, bucket existence, and network path.
  2. Inspect API/proof/report worker logs for object keys and error classes.
  3. Confirm bucket names match the environment.

Recovery:

  1. Restore storage and verify read/write permissions.
  2. Re-run failed proof/report/export jobs through queue replay rules.
  3. Sample certificate PDF and canonical JSON downloads.

Chain Or Anchor Outage

Impact:

  • Anchor jobs can retry or DLQ.
  • Certificates may remain CERTIFIED_NO_ANCHOR when anchoring is disabled or unavailable by policy.
  • Public verification can validate certificate integrity but may report chain unavailability or missing anchor status.

Triage:

  1. Check anchor worker logs and WipeAnchorFailures.
  2. Confirm BLOCKCHAIN_ENABLED, default chain, explorer URL, lookup metadata, and provider reachability.
  3. Inspect unanchored certificates:
  SELECT id, tenant_id, organization_id, status, anchor_chain_id, anchor_tx_hash, created_at
FROM certificates
WHERE anchor_tx_hash = ''
ORDER BY created_at DESC
LIMIT 50;
  

Recovery:

  1. Restore chain connectivity, authorized sender, and funding/permission.
  2. Replay certificates.to_anchor DLQ jobs only after verifying certificate state.
  3. Sample /verify for anchored and unanchored certificates.

Public Verification Abuse

Impact:

  • Increased RATE_LIMITED, UNKNOWN_CERTIFICATE, or anomaly counts.
  • Potential scanning of public verification identifiers.

Triage:

  1. Check /admin/api/v1/verification-log and anomaly metrics.
  2. Confirm rate limits and captcha settings.
  3. Check reverse proxy logs for source distribution.

Recovery:

  1. Tighten edge rate limits or captcha policy.
  2. Keep response bodies minimal; do not add internal identifiers to support debugging.
  3. Export logs for SIEM review if the pattern persists.

License Or Billing Incident

Impact:

  • Proofs enter AWAITING_LICENSE.
  • Billing reports or exports fail.
  • Consumption receipt chain may alert.

Triage:

  1. Check active grants, allocation hierarchy, quotas, validity dates, and revocations.
  2. Inspect billing.exports and reports.monthly jobs.
  3. Check report worker logs and storage permissions.
  4. For chain alerts, inspect maintenance worker output before changing data.

Recovery:

  1. Import or correct license grants with a PLATFORM_ADMIN.
  2. Create/adjust allocations with the tenant admin path.
  3. Retry affected proofs or billing export jobs.
  4. Preserve receipt-chain evidence for audit.