Workers And Queues

The backend uses a PostgreSQL table-backed queue. Jobs are claimed with locking semantics and dead-lettered after retry exhaustion. The operator source of truth is the database plus structured worker logs and OpenTelemetry metrics.

Queues

QueueProducerConsumerPurpose
proofs.validatedProof submit routesworker-proofValidate proof, decrypt payload, create certificate, consume license, sign PDF/receipt.
certificates.to_anchorCertificate serviceworker anchorSubmit or update blockchain anchor metadata and regenerate certificate PDF.
notifications.to_sendUser, billing, webhook flowsworker notificationsSend SMTP/webhook notifications, retry failures, write delivery metadata.
billing.exportsBilling export routesworker reportsGenerate requested export artifacts.
reports.monthlyReport schedulerworker reportsGenerate monthly billing report artifacts.
proofs.retryRetry routes/schedulerworker retryRe-enqueue due failed or awaiting-license proofs.

The local PostgreSQL/PGMQ setup also seeds underscore queue names for local compatibility: proofs_validated, certificates_to_anchor, notifications_to_send, billing_exports, reports_monthly, and proofs_retry.

Worker Commands

WorkerCommandUseful variables
Proof/app/worker-proofWORKER_ID, WORKER_POLL_INTERVAL, WORKER_CLAIM_LIMIT, WORKER_ONCE.
Anchor/app/worker anchorSame worker variables plus blockchain/signer/storage settings.
Notifications/app/worker notificationsSame worker variables plus notification/URL policy settings.
Reports/app/worker reportsSame worker variables plus storage, billing, PAdES settings.
Retry/app/worker retrySame worker variables; enqueues due proof retries before claims.
Maintenance/app/worker maintenanceUses retention interval when WORKER_POLL_INTERVAL is unset.

Set WORKER_ONCE=true for a controlled single pass:

  WORKER_ONCE=true WORKER_CLAIM_LIMIT=10 /app/worker retry
  

Queue Inspection

Use SQL for baseline triage:

  SELECT queue, status, count(*)
FROM jobs
GROUP BY queue, status
ORDER BY queue, status;
  

DLQ summary:

  SELECT queue, reason, count(*), max(created_at) AS last_seen
FROM jobs_dlq
GROUP BY queue, reason
ORDER BY last_seen DESC;
  

Newest DLQ entries:

  SELECT id, original_job_id, queue, payload, reason, created_at
FROM jobs_dlq
ORDER BY created_at DESC
LIMIT 20;
  

Replay Rules

  1. Confirm the downstream dependency is healthy before replaying anything.
  2. Inspect representative DLQ rows and the referenced business record.
  3. Replay by inserting a new jobs row with the same queue and payload.
  4. Keep the original DLQ row for audit.
  5. Do not update a dead job back to pending.

If a payload is malformed or references a deleted resource, leave it in DLQ and record the incident decision in the operations log.

Proof Retry Checks

  SELECT id, tenant_id, organization_id, status, attempt_count, next_retry_at,
       rejection_reason, received_at
FROM proofs
WHERE status IN ('FAILED', 'AWAITING_LICENSE', 'REJECTED')
ORDER BY received_at DESC
LIMIT 50;
  
StatusReplay guidance
FAILEDReplay only after fixing signer, storage, PDF, validation config, or other dependency failures.
AWAITING_LICENSEImport/allocate licenses or fix quota scope first, then retry.
REJECTEDTerminal validation failure; replay only if policy or code was wrong.

Worker Scaling

Proof, notifications, reports, retry, and anchor workers are stateless and can scale horizontally where the workload allows it. Use queue depth, in-flight jobs, dependency capacity, and idempotency guarantees to decide scale. The anchor and monthly report schedulers should be scaled conservatively until production chain and scheduler ownership are finalized.