Observability¶
beampipe-core exposes liveness, readiness, Prometheus metrics, provenance events, alerting resources, and backend debug fields. Use metrics for dashboards and events/run records for per-run forensics.
Endpoints¶
| Endpoint | Purpose |
|---|---|
GET /api/v2/health |
Process liveness |
GET /api/v2/ready |
Postgres, queue, worker, and dependency readiness |
GET /api/v2/metrics |
API process metrics |
:9090/metrics |
Prometheus metrics server for API/worker containers |
GET /api/v2/executions/{id}/events |
Execution provenance stream |
GET /api/v2/sources/{id}/events |
Source provenance stream |
GET /api/v2/projects/{module}/events |
Project provenance stream |
Debug order¶
For a stuck source or run, inspect in this order:
GET /api/v2/readyfor database, queue, worker, and dependency state.- Prometheus gauges for queue depth, oldest queued job age, dependency health, and execution state counts.
- Provenance events for source/execution milestones.
- Execution response run record for DIM/Slurm submit and poll details.
Metrics flow¶
+-------------+ +-------------+ +-------------+ | API routes | ---> | recorder | ---> | /metrics | +-------------+ +-------------+ +-------------+ | workers | ---> | recorder | ---> | :9090 scrape| +-------------+ +-------------+ +-------------+ | readiness | ---> | gauges | ---> | Prometheus | +-------------+ +-------------+ +-------------+
Key metric families include API request counters, job queue gauges, dependency health, source processing counts, scheduler tick duration, and execution state counts.
Prometheus stack¶
Prometheus listens on http://127.0.0.1:9099 and scrapes API/worker metrics through the Compose network.
Alerts¶
Alerting resources are API-managed:
| Resource | Endpoints |
|---|---|
| Notification channels | /api/v2/notification-channels |
| Alert rules | /api/v2/alert-rules |
| Deliveries | /api/v2/alert-deliveries |
Use POST /api/v2/notification-channels/{id}/test before enabling production alerts.
Execution debug fields¶
Execution responses include backend-specific fields when available:
| Field | Backend |
|---|---|
dim_session_status_url |
REST/DIM |
dim_graph_status_url |
REST/DIM |
slurm_session_dir |
Slurm |
slurm_login_node |
Slurm |
slurm_remote_user |
Slurm |
Next: keep the operational checklist in Production runbook.