Health API

Probe Prelude TE for liveness and readiness, and scrape Prometheus metrics — when to use each endpoint, common pitfalls, and worked examples.

The Health API surfaces the live state of Prelude TE and its subsystems. After this first mention, the rest of this page refers to Prelude TE as "the engine". The endpoints are read-only and split along an authentication boundary: the JSON snapshot is public (safe for any probe), the Prometheus metrics endpoint is Bearer-protected (safe for a scraper holding a token). For the operator-facing companion of this page, see Monitoring: Health and Monitoring: Metrics.

When to use this API

Kubernetes / Docker probes. GET /api/health?verbose=false is the right liveness signal — small payload, no auth.
Operator dashboards. GET /api/health (verbose) returns a per-subsystem status that drives an at-a-glance overview.
Prometheus scraping. GET /api/health/metrics exposes process, BGP, topology, output, and licensing metrics in the text exposition format.
Post-incident review. Pair the JSON snapshot with Prometheus history to determine when a subsystem dropped to degraded and what dependency was responsible.

Endpoints

Method	Path	Auth	Purpose
`GET`	`/api/health`	none	JSON snapshot of engine and per-subsystem health.
`GET`	`/api/health/metrics`	Bearer	Prometheus text exposition (version 0.0.4) of operational metrics.

Common pitfalls

Treating degraded as "broken". The endpoint returns HTTP 200 for both ok and degraded, and HTTP 503 only when the database is unreachable. A liveness probe that flips on the HTTP status will not restart a pod just because BGP peers are flapping — that is intentional.
Swallowing the 503. A down response means the database is unreachable. Surface it as a page-out, not a soft warning.
Authenticating the snapshot. GET /api/health is public by design. Adding a token to your probe does not break anything but it does mean rotating that secret affects your liveness check.
Treating metrics as a TSDB. /api/health/metrics exposes point-in-time gauges and monotonic counters. Store history in Prometheus, not by polling and journaling the response.
High-frequency snapshot polling. The JSON endpoint loads peer and output state on every call. A 10-second cadence is fine for dashboards; sub-second polling is wasteful.

Worked example: dashboard plus alert

# 1. Liveness probe (small payload, no auth).
curl -fsSL https://te.example.com/api/health?verbose=false
# {"status":"ok","service":"prelude-te","uptime_seconds":...}

# 2. Full snapshot for an operator dashboard.
curl -fsSL https://te.example.com/api/health
# {"status":"degraded","checks":{"bgp":{"status":"degraded",...},...}}

# 3. Scrape metrics for Prometheus / Grafana.
curl -fsSL https://te.example.com/api/health/metrics \
  -H "Authorization: Bearer $TOKEN"
# # HELP prelude_te_bgp_peers Number of configured BGP peers by FSM state.
# # TYPE prelude_te_bgp_peers gauge
# prelude_te_bgp_peers{state="established"} 4
# ...

Bruno: Health / Get health, Health / Get metrics

Status aggregation

The global status field of the JSON payload is computed from the five per-subsystem checks:

Global status	Returned when…	HTTP
`ok`	Every subsystem is healthy.	200
`degraded`	One or more non-critical checks are unhealthy (BGP, topology, outputs, licensing).	200
`down`	The database check is `down`.	503

The database is the only critical dependency. The split lets you safely wire a Kubernetes liveness probe to restart only on 503, while readiness probes and dashboards can react to degraded independently.

Reference

The per-check shape (database, BGP, topology, outputs, licensing) and the trimmed ?verbose=false payload are documented in detail in Monitoring: Health. The list of exposed metrics, their types, labels, and PromQL examples lives in Monitoring: Metrics.