Check-Engine · one command · whole-BNG health · Symptom → Cause → Action

Operations Brief · Check-Engine

One command tells you what's wrong, where it is, and how to fix it

When a broadband network has a bad night — subscribers dropping to a fallback speed, "no internet," gaming broken — the slow part isn't fixing it, it's finding it: is it the BNG, the upstream link, DNS, the RADIUS server, the access network, or the customer's own router? BNGSOFT's Check-Engine answers that in one command. bngxdpctl check runs a correlated, whole-box health scan where every finding is Symptom → Cause → Action — and an active fault-isolation layer pinpoints which side of the network the fault is actually on.

1 command

whole-BNG scan

license, RADIUS, upstream, data-path, CGNAT, QoS, security, NIC, QoE

Cause + fix

not raw numbers

every finding is Symptom → Cause → suggested Action

Where

fault isolation

BNG · upstream · DNS · RADIUS · access · subscriber side

Seconds

not a war-room

human report · JSON for dashboards · one-line brief · watch mode

A monitoring dashboard tells you something is wrong. The Check-Engine tells you what's wrong, where the fault lives, and the command to fix it — so the first minute of an incident is the last one.

The 2 a.m. problem

Every operator knows the drill: tickets spike, and the clock starts on a scavenger hunt across a dozen tools — RADIUS logs, the NAT table, interface counters, routing, the AQM, the license server. The failure is usually simple; locating it is what burns the hour. The Check-Engine collapses that hunt into one screen.

Symptom seen

"All my users dropped to 10 Mbps"

Could be: license, QoS group, RADIUS rate, a cap somewhere.

Symptom seen

"Connected but no internet"

Could be: upstream down, DNS, CGNAT, forwarding, the access switch, the CPE.

Symptom seen

"Half the town is offline"

Could be: RADIUS server, a NIC/cable, a transit link, a routing change.

What you see: bngxdpctl check

================ bngxdpd check ================= host: bng-node-a time: 2026-06-29 11:00 OVERALL: WARN (fail=0 warn=1 ok=9) ------------------------------------------------ OK License valid OK RADIUS/Subs 1303 active · 0 inactive · events flowing OK Upstream uplink up · default route ok · 710↑/690↓ Mbps OK DataPath forwarding ok · 75 sess/user OK CGNAT translating ~1150 pkt/s · pool healthy OK QoS/AQM dualq adaptive · fleet RPM 53k (~1.1ms) OK EdgeSec antispoof enforce · 0 abusers WARN NIC/Links eth1 rx_dropped +220/s cause : RX ring overflow under micro-bursts action: ethtool -G eth1 rx 8160 ; verify IRQ spread OK QoE/AEC SES 99/100 · 0 anomalous ================================================

A traffic-light line per domain. Green is silent; a problem flips red/amber with the cause and the exact fix command right there.

Every finding is Symptom → Cause → Action

That's the difference between a metric and a diagnosis. A graph shows "inactive subscribers: 812." The Check-Engine says what it means and what to do.

A real example, drawn from a real incident: RADIUS went silent and every new session landed unrated. The Check-Engine names the cause and hands you the emergency fallback command.

What it checks — ten domains, one pass

License

Invalid license → global fallback cap on every subscriber. Names the systemid/hostname cause + the re-register + restart fix.

RADIUS / Subscribers

Mass-inactive spike or unrated subscribers → "RADIUS down" + the emergency rate-fallback command.

Upstream / Internet

Uplink state, default route, and a TX-but-no-RX black-hole test.

Data path

Subscribers connected but no traffic → forwarding/cap/rate fault (the "connected, no internet" case).

CGNAT

Liveness by translation rate (not a stale counter) + port/block exhaustion pressure.

QoS / AQM

Bufferbloat under load (responsiveness RPM) and adaptive-AQM health.

Edge security

Anti-spoof, scanner, quarantine and DDoS activity at a glance.

NIC / Links

Per-card errors (cable/SFP/duplex) and drops (ring overflow) — by rate, across every ethernet port.

Platform & QoE

Daemon/XDP/maps, CPU/RAM, plus per-subscriber QoE anomalies and blast-radius grouping.

Built on signals it already has. The data plane already measures all of this in the XDP fast path; the Check-Engine is the correlation layer that turns those signals into one human verdict. Pure read-only diagnostics — safe to run any time, including --watch during live triage.

The headline: fault isolation NEW

During an incident the real question is "whose problem is it?" The Check-Engine's second stage runs active probes from the BNG — and a unique subscriber-path probe that sources through the live CGNAT/forwarding path — to localize the fault to one side of the network.

Probes radiate from the BNG; the verdict synthesizes passive + active results into one line — e.g. "BNG forwarding healthy, gateway reachable, but no path to 8.8.8.8 → problem is UPSTREAM, not us."

The verdict line — one answer, in or out of your network

# healthy — collapses to one word OK Connectivity all probes OK FAULT ISOLATION: NONE (healthy) reason: all probes + passive checks OK # a real fault — named, with the evidence and who to call FAIL Connectivity internet unreachable (gateway OK) FAULT ISOLATION: UPSTREAM reason: gateway reachable; no path to 8.8.8.8 / 1.1.1.1; subscriber-path also fails probes: GW:OK Inet:FAIL DNS:FAIL HTTP:FAIL RADIUS:OK NOC2:FAIL SubPath:FAIL → not the BNG, not RADIUS — escalate to transit / check default route + BGP

The fault verdict in one line: healthy collapses to NONE; a real fault names the domain (here UPSTREAM) with the evidence and the next move. Output format from bngxdpctl check; fault case illustrative.

Result A

Not us

BNG & subscriber-path to internet both pass → look at the access switch / customer CPE.

Result B

Upstream

Gateway pings, internet doesn't → transit / routing / provider.

Result C

Our data path

BNG's own internet works, subscriber-path fails → CGNAT / forwarding / cap on our side.

Forged in real incidents

Every check exists because the failure behind it actually happened in production — so the Check-Engine recognizes the patterns operators really hit:

Domain	The real incident it learned from
License	An upgrade changed the hardware-ID derivation; the license went invalid and a 10/10 fallback cap hit every subscriber.
RADIUS	RADIUS stopped sending rates; new sessions landed unrated/inactive en masse.
Data path	After a config change, subscribers connected but had no internet until the forwarding state was re-applied.
CGNAT	A counter read zero after a reload while CGNAT was actually fine — taught the engine to judge by translation rate, not a stale gauge.
NIC / Links	Micro-burst ring overflow showed up as silent NIC drops before the packets ever reached XDP.

Find it in the first minute, not the first hour

The Check-Engine turns "something's wrong" into "this is wrong, here's where, here's the fix" — one command, every subsystem, correlated to a cause and an action, with fault-isolation that tells you whether to call your transit provider, your RADIUS admin, or nobody at all because it's the customer's router.

It runs read-only on the same XDP data plane that already measures everything, ships a JSON mode for your NOC dashboards, and a one-line brief for a status light.

Want a walkthrough? We'll run it live on a node, trigger a fault, and watch the right line turn red with the fix attached.

Honest framing: This is an operations brief; no throughput or price figures are claimed. The Check-Engine (bngxdpctl check) is a read-only diagnostic that correlates signals the bngxdpd data plane already produces (license, subscriber active/inactive, upstream/route, traffic, CGNAT translation, AQM/responsiveness, edge-security, NIC statistics, QoE) into per-domain findings (OK / WARN / FAIL) each carrying a cause and a suggested action, with human, JSON and one-line/brief and watch output modes. The passive multi-domain scan (including CGNAT-by-translation-rate and per-NIC error/drop-by-rate) is implemented and in validation; the active fault-isolation layer (upstream/internet/DNS/HTTP/RADIUS Status-Server probes plus the subscriber-path-through-CGNAT probe and the consolidated fault verdict) is implemented and in active validation. The sample outputs and diagrams illustrate the design and output format, not a benchmark. Related per-topic briefs — Edge DDoS Protection, Subscriber Experience, CGNAT Arena, OrionOS — are available alongside this guide.