Skip to content
Documentation Prelude Collector 1.0.0

Health

Inspect Prelude Collector health — severity levels, what a degraded vs healthy state means, and how to query the health API for alerting.

Prelude Collector includes a built-in health engine that continuously evaluates the state of every active subscription. The collector re-runs its checks every 10 seconds, keeps a 7-day rolling history for trend inspection, and exposes the current state through the REST API on port 4030.

The engine is the single source of truth for "is collection working right now?" — both the web UI and any external alerting you wire up should read from it rather than guessing from logs.

How it works

For each active subscription, the collector evaluates a panel of checks across four pipeline stages:

Stage What it covers
collect Protocol connection and raw data arrival from the device
parse Field parsing, type coercion, and transforms
cache In-memory data storage and freshness
output Channel buffering and delivery to the configured backend

Each evaluation produces a per-subscription health record. Device health is the worst severity across its subscriptions, and overall system health is the worst severity across all devices. When the collect stage is critical, downstream stage issues are treated as secondary symptoms of the same root cause.

Severity levels

Every check result, every subscription, every device, and the system as a whole carry one of four severities:

Severity Meaning
healthy All checks passed. Nothing to do.
info Non-blocking advisory — for example, a model field that has no source mapping. Safe to ignore in alerting.
warning Degraded but still functional — stale data, retrying connection, partial parse errors, channel above 85% full. Worth investigating.
critical Collection is broken or data is unreliable — disconnected protocol, authentication failure, channels saturated, all entries failing to parse. Page someone.

A practical rule of thumb: alert on critical, watch warning on a dashboard, and let info flow into the UI without bothering anyone.

What the checks look at

The engine groups its checks into six families. You do not need to memorise the individual check names — the API returns them so your alert payloads can include the specific reason — but it helps to know the categories:

  • Connection — protocol disconnected, auth failure, retry in progress, connection timeout. Drives the collect stage.
  • Data quality — parse errors, empty fields, missing key fields, type coercion mismatches, stale data (no update for several collection intervals).
  • Output pipeline — message drops, channels approaching capacity, backend unavailable.
  • Performance — high end-to-end latency, low or zero message rate, mismatch between received and exported rates.
  • Configuration — unmapped model fields, missing transforms.
  • Cross-vendor consistency — for a model deployed on multiple device OS versions, flag fields whose values diverge across vendors (e.g. enum encoding differences) that may indicate a mapping or normalisation gap rather than a real-world difference.

Most of these have a warning threshold and a critical threshold. For example, a channel above 85% capacity is a warning; above 95% is critical. A subscription with no update for 5× its interval is a warning; 10× is critical.

Querying health from the API

All health endpoints live under the collector REST API on port 4030, return JSON, and require a Bearer token. See the API reference for the full endpoint list and request details. The endpoints break down into four shapes:

  • System snapshot — the overall state, with aggregate counts of healthy / info / warning / critical subscriptions and a devices array.
  • All devices — just the per-device array from the system snapshot, useful when you want to iterate without the system rollup.
  • Single device — one device's record, including every subscription health entry under it.
  • Single subscription — drill down to one subscription, with the individual check results.
  • History — recent snapshots for trend display. Snapshots are retained for 7 days on a rolling window.

A typical system response looks like this:

{
  "timestamp": "2026-04-28T10:30:00Z",
  "system-severity": "warning",
  "critical-count": 0,
  "warning-count": 2,
  "info-count": 1,
  "healthy-count": 14,
  "total-count": 17,
  "devices": [
    {
      "device-id": 1,
      "hostname": "router-core-01.example.com",
      "severity": "warning",
      "subscriptions": [ ]
    }
  ]
}

A subscription-level result inside that tree carries the failing check name, its category, the pipeline stage, the message that the UI displays, and (where applicable) the value and threshold that triggered it.

Curl examples

Replace the host and token with your own:

export COLLECTOR=https://collector.example.com:4030
export TOKEN="<your-api-token>"

# System-wide health
curl -s -H "Authorization: Bearer $TOKEN" \
  "$COLLECTOR/api/v1/health"

# One device
curl -s -H "Authorization: Bearer $TOKEN" \
  "$COLLECTOR/api/v1/health/devices/1"

# Recent history (system-level)
curl -s -H "Authorization: Bearer $TOKEN" \
  "$COLLECTOR/api/v1/health/history"

Bruno: 11 Health / System health, 11 Health / Device health, 11 Health / Health history

For local installs, point COLLECTOR at https://collector.example.com instead.

Healthy vs degraded — how to read it

A single field is enough to drive a binary alert: system-severity on the system snapshot.

  • healthy or info — the collection pipeline is working. No alert.
  • warning — something is off but data is still flowing. Page if it persists across multiple evaluation cycles.
  • critical — at least one subscription has a broken pipeline. Page immediately.

If you want a more granular alert (per-device, per-subscription, per-check), iterate devices[*].subscriptions[*] and filter on severity == "critical" and issue-count > 0. The results array on each subscription gives you the check name, the category, and a human-readable message that you can paste into the alert payload so the on-call engineer knows what to look at.

See also

  • Metrics — the Prometheus endpoint for time-series monitoring.
  • Snapshots — inspect the live cached data when health is degraded.
  • API reference — full request and response details for the health endpoints.
Filtering by: