Health

Inspect Prelude Collector health — severity levels, what a degraded vs healthy state means, and how to query the health API for alerting.

Prelude Collector includes a built-in health engine that continuously evaluates the state of every active subscription. The collector re-runs its checks every 10 seconds, keeps a 7-day rolling history for trend inspection, and exposes the current state through the REST API on port 4030.

The engine is the single source of truth for "is collection working right now?" — both the web UI and any external alerting you wire up should read from it rather than guessing from logs.

How it works

For each active subscription, the collector evaluates a panel of checks across four pipeline stages:

Stage	What it covers
`collect`	Protocol connection and raw data arrival from the device
`parse`	Field parsing, type coercion, and transforms
`cache`	In-memory data storage and freshness
`output`	Channel buffering and delivery to the configured backend

Each evaluation produces a per-subscription health record. Device health is the worst severity across its subscriptions, and overall system health is the worst severity across all devices. When the collect stage is critical, downstream stage issues are treated as secondary symptoms of the same root cause.

Severity levels

Every check result, every subscription, every device, and the system as a whole carry one of four severities:

Severity	Meaning
`healthy`	All checks passed. Nothing to do.
`info`	Non-blocking advisory — for example, a model field that has no source mapping. Safe to ignore in alerting.
`warning`	Degraded but still functional — stale data, retrying connection, partial parse errors, channel above 85% full. Worth investigating.
`critical`	Collection is broken or data is unreliable — disconnected protocol, authentication failure, channels saturated, all entries failing to parse. Page someone.

A practical rule of thumb: alert on critical, watch warning on a dashboard, and let info flow into the UI without bothering anyone.

What the checks look at

The engine groups its checks into six families. You do not need to memorise the individual check names — the API returns them so your alert payloads can include the specific reason — but it helps to know the categories:

Connection — protocol disconnected, auth failure, retry in progress, connection timeout. Drives the collect stage.
Data quality — parse errors, empty fields, missing key fields, type coercion mismatches, stale data (no update for several collection intervals).
Output pipeline — message drops, channels approaching capacity, backend unavailable.
Performance — high end-to-end latency, low or zero message rate, mismatch between received and exported rates.
Configuration — unmapped model fields, missing transforms.
Cross-vendor consistency — for a model deployed on multiple device OS versions, flag fields whose values diverge across vendors (e.g. enum encoding differences) that may indicate a mapping or normalisation gap rather than a real-world difference.

Most of these have a warning threshold and a critical threshold. For example, a channel above 85% capacity is a warning; above 95% is critical. A subscription with no update for 5× its interval is a warning; 10× is critical.

Startup grace period

A subscription that has just been created — or every subscription right after a collector restart — has not had time to collect its first sample yet. To avoid a burst of false alarms, the collector gives each subscription a short startup grace period before the "no data" check can flag it. During that window an empty response is reported as a benign info state ("warming up") rather than critical.

The window is max(30s, 2× the collection interval), so subscriptions with a longer interval get proportionally more time. Once the window passes, a subscription that is still returning no data is flagged critical as usual. Startup grace only softens the "no data yet" case; a connection failure or authentication error is reported immediately regardless.

Per-protocol status rollup

A subscription can collect over more than one protocol at once (for example gNMI plus CLI against the same model). Each protocol tracks its own run state — running, stopped, or errored — and the collector rolls those up into a single subscription status so you read one value instead of reconciling several.

The rollup uses simple precedence: error wins over running, which wins over stopped. If any protocol has errored the subscription shows as errored (with that protocol's message); if none have errored but at least one is running, it shows as running; otherwise it is stopped. The subscriptions list shows this single rolled-up status — the older per-subscription Received and Output Rate columns were removed, since the health severity and metrics already cover data flow in more detail.

Querying health from the API

All health endpoints live under the collector REST API on port 4030, return JSON, and require a Bearer token. See the API reference for the full endpoint list and request details. The endpoints break down into four shapes:

System snapshot — the overall state, with aggregate counts of healthy / info / warning / critical subscriptions and a devices array.
All devices — just the per-device array from the system snapshot, useful when you want to iterate without the system rollup.
Single device — one device's record, including every subscription health entry under it.
Single subscription — drill down to one subscription, with the individual check results.
History — recent snapshots for trend display. Snapshots are retained for 7 days on a rolling window.

A typical system response looks like this:

{
  "timestamp": "2026-04-28T10:30:00Z",
  "system-severity": "warning",
  "critical-count": 0,
  "warning-count": 2,
  "info-count": 1,
  "healthy-count": 14,
  "total-count": 17,
  "devices": [
    {
      "device-id": 1,
      "hostname": "router-core-01.example.com",
      "severity": "warning",
      "subscriptions": [ ]
    }
  ]
}

A subscription-level result inside that tree carries the failing check name, its category, the pipeline stage, the message that the UI displays, and (where applicable) the value and threshold that triggered it.

Curl examples

Replace the host and token with your own:

export COLLECTOR=https://collector.example.com:4030
export TOKEN="<your-api-token>"

# System-wide health
curl -s -H "Authorization: Bearer $TOKEN" \
  "$COLLECTOR/api/v1/health"

# One device
curl -s -H "Authorization: Bearer $TOKEN" \
  "$COLLECTOR/api/v1/health/devices/1"

# Recent history (system-level)
curl -s -H "Authorization: Bearer $TOKEN" \
  "$COLLECTOR/api/v1/health/history"

Bruno: 11 Health / System health, 11 Health / Device health, 11 Health / Health history

For local installs, point COLLECTOR at https://collector.example.com instead.

Healthy vs degraded — how to read it

A single field is enough to drive a binary alert: system-severity on the system snapshot.

healthy or info — the collection pipeline is working. No alert.
warning — something is off but data is still flowing. Page if it persists across multiple evaluation cycles.
critical — at least one subscription has a broken pipeline. Page immediately.

If you want a more granular alert (per-device, per-subscription, per-check), iterate devices[*].subscriptions[*] and filter on severity == "critical" and issue-count > 0. The results array on each subscription gives you the check name, the category, and a human-readable message that you can paste into the alert payload so the on-call engineer knows what to look at.