Health
Inspect Prelude Collector health — severity levels, what a degraded vs healthy state means, and how to query the health API for alerting.
Prelude Collector includes a built-in health engine that
continuously evaluates the state of every active subscription. The
collector re-runs its checks every 10 seconds, keeps a 7-day rolling
history for trend inspection, and exposes the current state through
the REST API on port 4030.
The engine is the single source of truth for "is collection working right now?" — both the web UI and any external alerting you wire up should read from it rather than guessing from logs.
How it works
For each active subscription, the collector evaluates a panel of checks across four pipeline stages:
| Stage | What it covers |
|---|---|
collect |
Protocol connection and raw data arrival from the device |
parse |
Field parsing, type coercion, and transforms |
cache |
In-memory data storage and freshness |
output |
Channel buffering and delivery to the configured backend |
Each evaluation produces a per-subscription health record. Device
health is the worst severity across its subscriptions, and overall
system health is the worst severity across all devices. When the
collect stage is critical, downstream stage issues are treated as
secondary symptoms of the same root cause.
Severity levels
Every check result, every subscription, every device, and the system as a whole carry one of four severities:
| Severity | Meaning |
|---|---|
healthy |
All checks passed. Nothing to do. |
info |
Non-blocking advisory — for example, a model field that has no source mapping. Safe to ignore in alerting. |
warning |
Degraded but still functional — stale data, retrying connection, partial parse errors, channel above 85% full. Worth investigating. |
critical |
Collection is broken or data is unreliable — disconnected protocol, authentication failure, channels saturated, all entries failing to parse. Page someone. |
A practical rule of thumb: alert on critical, watch warning on a
dashboard, and let info flow into the UI without bothering anyone.
What the checks look at
The engine groups its checks into six families. You do not need to memorise the individual check names — the API returns them so your alert payloads can include the specific reason — but it helps to know the categories:
- Connection — protocol disconnected, auth failure, retry in
progress, connection timeout. Drives the
collectstage. - Data quality — parse errors, empty fields, missing key fields, type coercion mismatches, stale data (no update for several collection intervals).
- Output pipeline — message drops, channels approaching capacity, backend unavailable.
- Performance — high end-to-end latency, low or zero message rate, mismatch between received and exported rates.
- Configuration — unmapped model fields, missing transforms.
- Cross-vendor consistency — for a model deployed on multiple device OS versions, flag fields whose values diverge across vendors (e.g. enum encoding differences) that may indicate a mapping or normalisation gap rather than a real-world difference.
Most of these have a warning threshold and a critical threshold.
For example, a channel above 85% capacity is a warning; above 95% is
critical. A subscription with no update for 5× its interval is a
warning; 10× is critical.
Querying health from the API
All health endpoints live under the collector REST API on port
4030, return JSON, and require a Bearer token. See the
API reference for the full endpoint list and request
details. The endpoints break down into four shapes:
- System snapshot — the overall state, with aggregate counts of
healthy / info / warning / critical subscriptions and a
devicesarray. - All devices — just the per-device array from the system snapshot, useful when you want to iterate without the system rollup.
- Single device — one device's record, including every subscription health entry under it.
- Single subscription — drill down to one subscription, with the individual check results.
- History — recent snapshots for trend display. Snapshots are retained for 7 days on a rolling window.
A typical system response looks like this:
{
"timestamp": "2026-04-28T10:30:00Z",
"system-severity": "warning",
"critical-count": 0,
"warning-count": 2,
"info-count": 1,
"healthy-count": 14,
"total-count": 17,
"devices": [
{
"device-id": 1,
"hostname": "router-core-01.example.com",
"severity": "warning",
"subscriptions": [ ]
}
]
}
A subscription-level result inside that tree carries the failing check name, its category, the pipeline stage, the message that the UI displays, and (where applicable) the value and threshold that triggered it.
Curl examples
Replace the host and token with your own:
export COLLECTOR=https://collector.example.com:4030
export TOKEN="<your-api-token>"
# System-wide health
curl -s -H "Authorization: Bearer $TOKEN" \
"$COLLECTOR/api/v1/health"
# One device
curl -s -H "Authorization: Bearer $TOKEN" \
"$COLLECTOR/api/v1/health/devices/1"
# Recent history (system-level)
curl -s -H "Authorization: Bearer $TOKEN" \
"$COLLECTOR/api/v1/health/history"
Bruno: 11 Health / System health, 11 Health / Device health, 11 Health / Health history
For local installs, point COLLECTOR at https://collector.example.com
instead.
Healthy vs degraded — how to read it
A single field is enough to drive a binary alert: system-severity
on the system snapshot.
healthyorinfo— the collection pipeline is working. No alert.warning— something is off but data is still flowing. Page if it persists across multiple evaluation cycles.critical— at least one subscription has a broken pipeline. Page immediately.
If you want a more granular alert (per-device, per-subscription,
per-check), iterate devices[*].subscriptions[*] and filter on
severity == "critical" and issue-count > 0. The results array
on each subscription gives you the check name, the category, and a
human-readable message that you can paste into the alert payload so
the on-call engineer knows what to look at.
See also
- Metrics — the Prometheus endpoint for time-series monitoring.
- Snapshots — inspect the live cached data when health is degraded.
- API reference — full request and response details for the health endpoints.