Health API
Probe Prelude TE for liveness and readiness, and scrape Prometheus metrics — when to use each endpoint, common pitfalls, and worked examples.
The Health API surfaces the live state of Prelude TE and its subsystems. After this first mention, the rest of this page refers to Prelude TE as "the engine". The endpoints are read-only and split along an authentication boundary: the JSON snapshot is public (safe for any probe), the Prometheus metrics endpoint is Bearer-protected (safe for a scraper holding a token). For the operator-facing companion of this page, see Monitoring: Health and Monitoring: Metrics.
When to use this API
- Kubernetes / Docker probes.
GET /api/health?verbose=falseis the right liveness signal — small payload, no auth. - Operator dashboards.
GET /api/health(verbose) returns a per-subsystem status that drives an at-a-glance overview. - Prometheus scraping.
GET /api/health/metricsexposes process, BGP, topology, output, and licensing metrics in the text exposition format. - Post-incident review. Pair the JSON snapshot with
Prometheus history to determine when a subsystem dropped to
degradedand what dependency was responsible.
Endpoints
| Method | Path | Auth | Purpose |
|---|---|---|---|
GET |
/api/health |
none | JSON snapshot of engine and per-subsystem health. |
GET |
/api/health/metrics |
Bearer | Prometheus text exposition (version 0.0.4) of operational metrics. |
Common pitfalls
- Treating
degradedas "broken". The endpoint returns HTTP200for bothokanddegraded, and HTTP503only when the database is unreachable. A liveness probe that flips on the HTTP status will not restart a pod just because BGP peers are flapping — that is intentional. - Swallowing the
503. Adownresponse means the database is unreachable. Surface it as a page-out, not a soft warning. - Authenticating the snapshot.
GET /api/healthis public by design. Adding a token to your probe does not break anything but it does mean rotating that secret affects your liveness check. - Treating metrics as a TSDB.
/api/health/metricsexposes point-in-time gauges and monotonic counters. Store history in Prometheus, not by polling and journaling the response. - High-frequency snapshot polling. The JSON endpoint loads peer and output state on every call. A 10-second cadence is fine for dashboards; sub-second polling is wasteful.
Worked example: dashboard plus alert
# 1. Liveness probe (small payload, no auth).
curl -fsSL https://te.example.com/api/health?verbose=false
# {"status":"ok","service":"prelude-te","uptime_seconds":...}
# 2. Full snapshot for an operator dashboard.
curl -fsSL https://te.example.com/api/health
# {"status":"degraded","checks":{"bgp":{"status":"degraded",...},...}}
# 3. Scrape metrics for Prometheus / Grafana.
curl -fsSL https://te.example.com/api/health/metrics \
-H "Authorization: Bearer $TOKEN"
# # HELP prelude_te_bgp_peers Number of configured BGP peers by FSM state.
# # TYPE prelude_te_bgp_peers gauge
# prelude_te_bgp_peers{state="established"} 4
# ...
Bruno: Health / Get health, Health / Get metrics
Status aggregation
The global status field of the JSON payload is computed from
the five per-subsystem checks:
| Global status | Returned when… | HTTP |
|---|---|---|
ok |
Every subsystem is healthy. | 200 |
degraded |
One or more non-critical checks are unhealthy (BGP, topology, outputs, licensing). | 200 |
down |
The database check is down. |
503 |
The database is the only critical dependency. The split lets
you safely wire a Kubernetes liveness probe to restart only on
503, while readiness probes and dashboards can react to
degraded independently.
Reference
The per-check shape (database, BGP, topology, outputs, licensing)
and the trimmed ?verbose=false payload are documented in detail
in Monitoring: Health. The list of
exposed metrics, their types, labels, and PromQL examples lives
in Monitoring: Metrics.