Check-Engine · one command · whole-BNG health · Symptom → Cause → Action
Operations Brief · Check-Engine
One command tells you what's wrong, where it is, and how to fix it
When a broadband network has a bad night — subscribers dropping to a fallback speed, "no internet," gaming broken — the slow part isn't fixing it, it's finding it: is it the BNG, the upstream link, DNS, the RADIUS server, the access network, or the customer's own router? BNGSOFT's Check-Engine answers that in one command. bngxdpctl check runs a correlated, whole-box health scan where every finding is Symptom → Cause → Action — and an active fault-isolation layer pinpoints which side of the network the fault is actually on.
every finding is Symptom → Cause → suggested Action
Where
fault isolation
BNG · upstream · DNS · RADIUS · access · subscriber side
Seconds
not a war-room
human report · JSON for dashboards · one-line brief · watch mode
A monitoring dashboard tells you something is wrong. The Check-Engine tells you what's wrong, where the fault lives, and the command to fix it — so the first minute of an incident is the last one.
The 2 a.m. problem
Every operator knows the drill: tickets spike, and the clock starts on a scavenger hunt across a dozen tools — RADIUS logs, the NAT table, interface counters, routing, the AQM, the license server. The failure is usually simple; locating it is what burns the hour. The Check-Engine collapses that hunt into one screen.
Symptom seen
"All my users dropped to 10 Mbps"
Could be: license, QoS group, RADIUS rate, a cap somewhere.
Symptom seen
"Connected but no internet"
Could be: upstream down, DNS, CGNAT, forwarding, the access switch, the CPE.
Symptom seen
"Half the town is offline"
Could be: RADIUS server, a NIC/cable, a transit link, a routing change.
What you see: bngxdpctl check
================ bngxdpd check =================
host: bng-node-a time: 2026-06-29 11:00OVERALL: WARN (fail=0 warn=1 ok=9)
------------------------------------------------
OK License valid
OK RADIUS/Subs 1303 active · 0 inactive · events flowing
OK Upstream uplink up · default route ok · 710↑/690↓ Mbps
OK DataPath forwarding ok · 75 sess/user
OK CGNAT translating ~1150 pkt/s · pool healthy
OK QoS/AQM dualq adaptive · fleet RPM 53k (~1.1ms)
OK EdgeSec antispoof enforce · 0 abusers
WARN NIC/Links eth1 rx_dropped +220/s
cause : RX ring overflow under micro-bursts
action: ethtool -G eth1 rx 8160 ; verify IRQ spread
OK QoE/AEC SES 99/100 · 0 anomalous
================================================
A traffic-light line per domain. Green is silent; a problem flips red/amber with the cause and the exact fix command right there.
Every finding is Symptom → Cause → Action
That's the difference between a metric and a diagnosis. A graph shows "inactive subscribers: 812." The Check-Engine says what it means and what to do.
A real example, drawn from a real incident: RADIUS went silent and every new session landed unrated. The Check-Engine names the cause and hands you the emergency fallback command.
What it checks — ten domains, one pass
License
Invalid license → global fallback cap on every subscriber. Names the systemid/hostname cause + the re-register + restart fix.
RADIUS / Subscribers
Mass-inactive spike or unrated subscribers → "RADIUS down" + the emergency rate-fallback command.
Upstream / Internet
Uplink state, default route, and a TX-but-no-RX black-hole test.
Data path
Subscribers connected but no traffic → forwarding/cap/rate fault (the "connected, no internet" case).
CGNAT
Liveness by translation rate (not a stale counter) + port/block exhaustion pressure.
QoS / AQM
Bufferbloat under load (responsiveness RPM) and adaptive-AQM health.
Edge security
Anti-spoof, scanner, quarantine and DDoS activity at a glance.
NIC / Links
Per-card errors (cable/SFP/duplex) and drops (ring overflow) — by rate, across every ethernet port.
Platform & QoE
Daemon/XDP/maps, CPU/RAM, plus per-subscriber QoE anomalies and blast-radius grouping.
Built on signals it already has. The data plane already measures all of this in the XDP fast path; the Check-Engine is the correlation layer that turns those signals into one human verdict. Pure read-only diagnostics — safe to run any time, including --watch during live triage.
The headline: fault isolation NEW
During an incident the real question is "whose problem is it?" The Check-Engine's second stage runs active probes from the BNG — and a unique subscriber-path probe that sources through the live CGNAT/forwarding path — to localize the fault to one side of the network.
Probes radiate from the BNG; the verdict synthesizes passive + active results into one line — e.g. "BNG forwarding healthy, gateway reachable, but no path to 8.8.8.8 → problem is UPSTREAM, not us."
The verdict line — one answer, in or out of your network
# healthy — collapses to one wordOK Connectivity all probes OK
FAULT ISOLATION:NONE (healthy)reason: all probes + passive checks OK
# a real fault — named, with the evidence and who to callFAIL Connectivity internet unreachable (gateway OK)
FAULT ISOLATION:UPSTREAMreason: gateway reachable; no path to 8.8.8.8 / 1.1.1.1; subscriber-path also fails
probes: GW:OK Inet:FAIL DNS:FAIL HTTP:FAIL RADIUS:OK NOC2:FAIL SubPath:FAIL→ not the BNG, not RADIUS — escalate to transit / check default route + BGP
The fault verdict in one line: healthy collapses to NONE; a real fault names the domain (here UPSTREAM) with the evidence and the next move. Output format from bngxdpctl check; fault case illustrative.
Result A
Not us
BNG & subscriber-path to internet both pass → look at the access switch / customer CPE.
Result B
Upstream
Gateway pings, internet doesn't → transit / routing / provider.
Result C
Our data path
BNG's own internet works, subscriber-path fails → CGNAT / forwarding / cap on our side.
Forged in real incidents
Every check exists because the failure behind it actually happened in production — so the Check-Engine recognizes the patterns operators really hit:
Domain
The real incident it learned from
License
An upgrade changed the hardware-ID derivation; the license went invalid and a 10/10 fallback cap hit every subscriber.
RADIUS
RADIUS stopped sending rates; new sessions landed unrated/inactive en masse.
Data path
After a config change, subscribers connected but had no internet until the forwarding state was re-applied.
CGNAT
A counter read zero after a reload while CGNAT was actually fine — taught the engine to judge by translation rate, not a stale gauge.
NIC / Links
Micro-burst ring overflow showed up as silent NIC drops before the packets ever reached XDP.
Find it in the first minute, not the first hour
The Check-Engine turns "something's wrong" into "this is wrong, here's where, here's the fix" — one command, every subsystem, correlated to a cause and an action, with fault-isolation that tells you whether to call your transit provider, your RADIUS admin, or nobody at all because it's the customer's router.
It runs read-only on the same XDP data plane that already measures everything, ships a JSON mode for your NOC dashboards, and a one-line brief for a status light.
Want a walkthrough? We'll run it live on a node, trigger a fault, and watch the right line turn red with the fix attached.
Honest framing: This is an operations brief; no throughput or price figures are claimed. The Check-Engine (bngxdpctl check) is a read-only diagnostic that correlates signals the bngxdpd data plane already produces (license, subscriber active/inactive, upstream/route, traffic, CGNAT translation, AQM/responsiveness, edge-security, NIC statistics, QoE) into per-domain findings (OK / WARN / FAIL) each carrying a cause and a suggested action, with human, JSON and one-line/brief and watch output modes. The passive multi-domain scan (including CGNAT-by-translation-rate and per-NIC error/drop-by-rate) is implemented and in validation; the active fault-isolation layer (upstream/internet/DNS/HTTP/RADIUS Status-Server probes plus the subscriber-path-through-CGNAT probe and the consolidated fault verdict) is implemented and in active validation. The sample outputs and diagrams illustrate the design and output format, not a benchmark. Related per-topic briefs — Edge DDoS Protection, Subscriber Experience, CGNAT Arena, OrionOS — are available alongside this guide.