Output selection

How to choose between NATS, Prometheus, InfluxDB, Kafka, webhook, file, and TimescaleDB outputs — with decision criteria and recommended defaults.

Recommendation

Start with NATS as the canonical bus and Prometheus for operator-facing dashboards. Add InfluxDB or Kafka when you have a specific reason. Use webhook for integration glue and file for archive or replay — not for production query.

Why this matters

The collector publishes every Snapshot to every enabled Output in parallel. There is no per-subscription routing and no per-Output queue, retry, or DLQ — the manager fans each batch out, waits for every Output to return, increments a per-backend success/failure counter, and moves on. A failed delivery is dropped.

That makes "pick the right one" more important than it sounds: each Output has its own query model, retention story, and operational cost, and they pull your downstream architecture in different directions. Pick badly and you end up paying twice — once to operate the Output you do not really need, and once to migrate off it later. Pick a flaky receiver and you lose data without a safety net.

Decision criteria

Score each Output against the four questions that actually matter:

Query model — How will downstream consumers read the data? PromQL, Flux/SQL, raw events, or "shipped to another team's system?"
Retention — How long do you need the data, and who pays for storing it?
Scale — How many Snapshots per second sustained? How bursty?
Ops cost — Who runs the backend, and how much do they want to run another one?

At a glance

Output	Best for	Query model	Retention story	Ops cost
NATS	Fan-out, integration with the rest of your platform	Subject subscribe	None (bus, not store)	Low
Prometheus	Operator dashboards, alerting on rates and percentiles	PromQL	Days to weeks (TSDB)	Medium
InfluxDB	Long-horizon time series, ad-hoc Flux/SQL	Flux / InfluxQL	Months to years	Medium
TimescaleDB	Long-horizon time series for SQL/PostgreSQL shops	SQL (PostgreSQL)	Months to years (auto-managed)	Medium
Kafka	High-throughput streaming pipelines, multi-team consumers	Consumer groups	Configurable, days+	High
Webhook	Glue: ticketing, paging, custom apps	None — push events	Wherever the receiver puts it	Low
File	Archive, replay, offline analysis, audit	None — read on disk	As long as you keep the volume	Very low
TimescaleDB	SQL analytics, joins against inventory, Grafana	SQL (PostgreSQL)	Months to years (compress + retention)	Medium

How to choose

NATS — the default bus

NATS is the recommended canonical bus for streaming Snapshots: configure the NATS Output (PUT /api/v1/outputs/nats) and every record lands on collector.data.{model-name}.{device-id}. Use it as the canonical fan-out point so every other consumer (your own services, dashboards, automation) hangs off NATS rather than off the collector directly — that keeps the collector decoupled from changes in your downstream stack.

The NATS output backend's connection is the single NATS connection used for data export, and it also carries the collector's internal signaling (ICMP reachability, OneBoard device-sync). Configure it once in the web UI under Output Settings → NATS; it is not enabled out of the box.

Caveat: NATS is a bus, not a store. The collector publishes to core NATS subjects (no JetStream stream config); replay is whatever the NATS server itself retains. Without server-side JetStream retention, missed messages are gone. Do not treat NATS as your historical archive.

Prometheus — operator dashboards

The collector exposes metrics on port 9090 at /metrics/collector by default. Metric names follow collector_{model}_{field} (no prelude_ prefix). For interface counters, queue depths, and similar high-cardinality operational data, scraping into a Prometheus you already run is the shortest path to a useful Grafana panel. PromQL is the right tool for "what is the 95th percentile of egress error rate over the last hour?"

Caveat: Prometheus retention is intentionally short and TSDB cardinality is the thing you will hit first. Plan retention and label hygiene up front, not when the alerts about Prometheus itself start firing.

InfluxDB — long-horizon time series

Reach for InfluxDB when you need months or years of history that Prometheus is not built to keep, or when your team prefers Flux/SQL. It is also the easier path when you want to mix high-frequency counter data with annotation events in the same query.

The InfluxDB Output uses the v2 client's non-blocking write API and batches points internally. Tune batch-size (records per flush) and flush-interval (max wait in ms) on the backend config — leave both unset to use the client defaults. Async write errors are logged but don't increment the backend's per-batch failures counter, so treat the collector log channel as the primary signal for InfluxDB trouble.

Caveat: it is another database to run. If you already run Prometheus and "long horizon" means 90 days, scaling Prometheus storage is usually cheaper than adding InfluxDB.

TimescaleDB — long-horizon time series in PostgreSQL

Reach for TimescaleDB when your team already runs PostgreSQL and prefers SQL over Flux or PromQL. The backend writes each numeric field as a narrow hypertable row and provisions compression, retention, and an hourly rollup automatically, so long-horizon storage stays manageable without a separate TSDB to learn. It also runs against plain PostgreSQL, minus the automatic policies.

Caveat: it stores numeric metrics only (non-numeric fields are dropped), and like InfluxDB it is another database to operate. If you already run Prometheus and only need 90 days, scaling Prometheus is usually cheaper.

Kafka — multi-team streaming

Kafka makes sense when more than one downstream team consumes the data, when you need durable replay, or when you are feeding a stream processor (Flink, Spark, your own consumers) that already speaks Kafka. Treat it as platform infrastructure, not as a Collector detail.

Caveat: Kafka is the most expensive Output to operate. Do not add it just because you might need it later — add it when the second consumer team shows up.

Webhook — integration glue

Webhook is the right tool for "when this Snapshot looks like X, post to that ticketing system." It is push-based, fire-and-forget, and trivially testable. Pair it with upstream filters so you do not POST every Snapshot to a bug tracker.

Two delivery modes are available via the batch-mode config flag: individual mode (one HTTP request per record — easy mental model, easy receivers) and batch mode (one request per collection cycle, body is a JSON array — fewer round-trips, but a single failed request counts every record in the batch as failed). Pick batch mode for high-throughput receivers; individual for ticket-style "one event per request" integrations.

Caveat: webhook receivers vary wildly in throughput. The collector fans out to every Output in parallel per batch, but there is no per-Output queue, retry, or DLQ — a slow or failing webhook doesn't back up an internal queue, it simply drops records (and adds latency to that batch's wait, since the manager waits for all Outputs to return). Track failures on /api/v1/outputs/metrics and put a buffer in front of the receiver itself when the receiver can't keep up.

File — archive and replay

File output is the cheapest possible long-term archive: write Snapshots to a mounted volume and let your existing backup tooling handle the rest. It is also the easiest way to replay a known good sequence into a test pipeline.

Caveat: there is no query language. Files are inputs to other systems, not a system of record by themselves.

TimescaleDB — SQL analytics

Reach for TimescaleDB when you want collected telemetry in SQL with storage you control: ad-hoc queries, joins against your own inventory tables on device_id, and Grafana via the PostgreSQL datasource. It overlaps with InfluxDB — pick TimescaleDB if your team already runs PostgreSQL or prefers SQL over Flux. The backend self-provisions the hypertable, an analytics index, compression and retention policies, and an hourly rollup, and falls back to a plain PostgreSQL table when the timescaledb extension is absent.

Like InfluxDB, writes are async and batched, so the per-batch failures counter stays zero — watch the collector log channel for TimescaleDB warnings. Only numeric fields are stored (one narrow row per metric); string and boolean values are dropped.

Caveat: another database to run, and there is no on-disk buffer — a wedged database eventually drops the oldest buffered rows. Keep NATS or Kafka as the durable path if you need replay.

Recommended defaults

For a new deployment that has not yet decided what it wants:

Enable NATS. It is already plumbed and gives every future consumer a place to attach.
Enable Prometheus on its default port (9090) and path (/metrics/collector), and point an existing Prometheus at it. You will want a dashboard within a week.
Enable file to a retained volume. Cheap insurance, useful for replay during outages.
Leave InfluxDB, Kafka, and webhook off until a specific need shows up.

Trade-offs

What you give up by following the defaults:

One source of truth. With multiple Outputs enabled, two teams can disagree about a number because they queried different backends. Standardize on which Output answers which kind of question.
Effort spent on retention you may not need. Prometheus and InfluxDB both want retention policies; without them you eventually fill the disk. Treat retention as a Day 1 decision.
Webhook reliability. Webhook is the easiest Output to enable and the easiest one to overload a receiver with. There is no collector-side retry or buffer, so a flaky receiver loses records. Expect to add a queue or proxy in front of the receiver, or stand up a small intermediary that absorbs bursts, before calling it "production."

When to deviate

You already have a streaming platform. If Kafka is the organization's standard, send to Kafka first and let other teams consume from there. Skip the NATS-as-bus pattern.
You do not run Prometheus and do not want to. Use the Collector's metrics endpoint for self-monitoring and send operational data to InfluxDB or to a vendor's hosted TSDB through webhook. Do not stand up Prometheus just to follow the recommendation.
You are in an air-gapped or compliance-bound environment. File output may be the only safe target. Lean on it; size the volume generously; back it up.
You are doing event-shaped work, not metrics. SNMP traps, syslog-derived events, alarm state changes — these belong on a bus (NATS or Kafka), not in a TSDB. Time-series tools handle event sparsity badly.