Check-Engine · one command · whole-BNG health · Symptom → Cause → Action
Operations Brief · Check-Engine

One command tells you what's wrong, where it is, and how to fix it

When a broadband network has a bad night — subscribers dropping to a fallback speed, "no internet," gaming broken — the slow part isn't fixing it, it's finding it: is it the BNG, the upstream link, DNS, the RADIUS server, the access network, or the customer's own router? BNGSOFT's Check-Engine answers that in one command. bngxdpctl check runs a correlated, whole-box health scan where every finding is Symptom → Cause → Action — and an active fault-isolation layer pinpoints which side of the network the fault is actually on.
1 command
whole-BNG scan
license, RADIUS, upstream, data-path, CGNAT, QoS, security, NIC, QoE
Cause + fix
not raw numbers
every finding is Symptom → Cause → suggested Action
Where
fault isolation
BNG · upstream · DNS · RADIUS · access · subscriber side
Seconds
not a war-room
human report · JSON for dashboards · one-line brief · watch mode
A monitoring dashboard tells you something is wrong. The Check-Engine tells you what's wrong, where the fault lives, and the command to fix it — so the first minute of an incident is the last one.

The 2 a.m. problem

Every operator knows the drill: tickets spike, and the clock starts on a scavenger hunt across a dozen tools — RADIUS logs, the NAT table, interface counters, routing, the AQM, the license server. The failure is usually simple; locating it is what burns the hour. The Check-Engine collapses that hunt into one screen.

Symptom seen

"All my users dropped to 10 Mbps"

  • Could be: license, QoS group, RADIUS rate, a cap somewhere.
Symptom seen

"Connected but no internet"

  • Could be: upstream down, DNS, CGNAT, forwarding, the access switch, the CPE.
Symptom seen

"Half the town is offline"

  • Could be: RADIUS server, a NIC/cable, a transit link, a routing change.
What you see: bngxdpctl check
================ bngxdpd check ================= host: bng-node-a time: 2026-06-29 11:00 OVERALL: WARN (fail=0 warn=1 ok=9) ------------------------------------------------ OK License valid OK RADIUS/Subs 1303 active · 0 inactive · events flowing OK Upstream uplink up · default route ok · 710↑/690↓ Mbps OK DataPath forwarding ok · 75 sess/user OK CGNAT translating ~1150 pkt/s · pool healthy OK QoS/AQM dualq adaptive · fleet RPM 53k (~1.1ms) OK EdgeSec antispoof enforce · 0 abusers WARN NIC/Links eth1 rx_dropped +220/s cause : RX ring overflow under micro-bursts action: ethtool -G eth1 rx 8160 ; verify IRQ spread OK QoE/AEC SES 99/100 · 0 anomalous ================================================
A traffic-light line per domain. Green is silent; a problem flips red/amber with the cause and the exact fix command right there.

Every finding is Symptom → Cause → Action

That's the difference between a metric and a diagnosis. A graph shows "inactive subscribers: 812." The Check-Engine says what it means and what to do.

SYMPTOM (measured) 812/1303 subs inactive, 0 RADIUS events 15m CAUSE (correlated) RADIUS server down / not sending rates ACTION (fix) check RADIUS reachability; sub set-rate --inactive --state active
A real example, drawn from a real incident: RADIUS went silent and every new session landed unrated. The Check-Engine names the cause and hands you the emergency fallback command.

What it checks — ten domains, one pass

License

Invalid license → global fallback cap on every subscriber. Names the systemid/hostname cause + the re-register + restart fix.

RADIUS / Subscribers

Mass-inactive spike or unrated subscribers → "RADIUS down" + the emergency rate-fallback command.

Upstream / Internet

Uplink state, default route, and a TX-but-no-RX black-hole test.

Data path

Subscribers connected but no traffic → forwarding/cap/rate fault (the "connected, no internet" case).

CGNAT

Liveness by translation rate (not a stale counter) + port/block exhaustion pressure.

QoS / AQM

Bufferbloat under load (responsiveness RPM) and adaptive-AQM health.

Edge security

Anti-spoof, scanner, quarantine and DDoS activity at a glance.

NIC / Links

Per-card errors (cable/SFP/duplex) and drops (ring overflow) — by rate, across every ethernet port.

Platform & QoE

Daemon/XDP/maps, CPU/RAM, plus per-subscriber QoE anomalies and blast-radius grouping.

Built on signals it already has. The data plane already measures all of this in the XDP fast path; the Check-Engine is the correlation layer that turns those signals into one human verdict. Pure read-only diagnostics — safe to run any time, including --watch during live triage.

The headline: fault isolation NEW

During an incident the real question is "whose problem is it?" The Check-Engine's second stage runs active probes from the BNG — and a unique subscriber-path probe that sources through the live CGNAT/forwarding path — to localize the fault to one side of the network.

BNGSOFT BNG check-engine probes RADIUS (Status-Server) DNS resolve subscriber CPE / access upstream gateway internet 8.8.8.8 / HTTP subscriber-path(through CGNAT) VERDICT: BNG | upstream | DNS | RADIUS | access | subscriber
Probes radiate from the BNG; the verdict synthesizes passive + active results into one line — e.g. "BNG forwarding healthy, gateway reachable, but no path to 8.8.8.8 → problem is UPSTREAM, not us."
The verdict line — one answer, in or out of your network
# healthy — collapses to one word OK Connectivity all probes OK FAULT ISOLATION: NONE (healthy) reason: all probes + passive checks OK # a real fault — named, with the evidence and who to call FAIL Connectivity internet unreachable (gateway OK) FAULT ISOLATION: UPSTREAM reason: gateway reachable; no path to 8.8.8.8 / 1.1.1.1; subscriber-path also fails probes: GW:OK Inet:FAIL DNS:FAIL HTTP:FAIL RADIUS:OK NOC2:FAIL SubPath:FAIL → not the BNG, not RADIUS — escalate to transit / check default route + BGP
The fault verdict in one line: healthy collapses to NONE; a real fault names the domain (here UPSTREAM) with the evidence and the next move. Output format from bngxdpctl check; fault case illustrative.
Result A

Not us

  • BNG & subscriber-path to internet both pass → look at the access switch / customer CPE.
Result B

Upstream

  • Gateway pings, internet doesn't → transit / routing / provider.
Result C

Our data path

  • BNG's own internet works, subscriber-path fails → CGNAT / forwarding / cap on our side.

Forged in real incidents

Every check exists because the failure behind it actually happened in production — so the Check-Engine recognizes the patterns operators really hit:

DomainThe real incident it learned from
LicenseAn upgrade changed the hardware-ID derivation; the license went invalid and a 10/10 fallback cap hit every subscriber.
RADIUSRADIUS stopped sending rates; new sessions landed unrated/inactive en masse.
Data pathAfter a config change, subscribers connected but had no internet until the forwarding state was re-applied.
CGNATA counter read zero after a reload while CGNAT was actually fine — taught the engine to judge by translation rate, not a stale gauge.
NIC / LinksMicro-burst ring overflow showed up as silent NIC drops before the packets ever reached XDP.

Find it in the first minute, not the first hour

The Check-Engine turns "something's wrong" into "this is wrong, here's where, here's the fix" — one command, every subsystem, correlated to a cause and an action, with fault-isolation that tells you whether to call your transit provider, your RADIUS admin, or nobody at all because it's the customer's router.

It runs read-only on the same XDP data plane that already measures everything, ships a JSON mode for your NOC dashboards, and a one-line brief for a status light.

Want a walkthrough? We'll run it live on a node, trigger a fault, and watch the right line turn red with the fix attached.

Honest framing: This is an operations brief; no throughput or price figures are claimed. The Check-Engine (bngxdpctl check) is a read-only diagnostic that correlates signals the bngxdpd data plane already produces (license, subscriber active/inactive, upstream/route, traffic, CGNAT translation, AQM/responsiveness, edge-security, NIC statistics, QoE) into per-domain findings (OK / WARN / FAIL) each carrying a cause and a suggested action, with human, JSON and one-line/brief and watch output modes. The passive multi-domain scan (including CGNAT-by-translation-rate and per-NIC error/drop-by-rate) is implemented and in validation; the active fault-isolation layer (upstream/internet/DNS/HTTP/RADIUS Status-Server probes plus the subscriber-path-through-CGNAT probe and the consolidated fault verdict) is implemented and in active validation. The sample outputs and diagrams illustrate the design and output format, not a benchmark. Related per-topic briefs — Edge DDoS Protection, Subscriber Experience, CGNAT Arena, OrionOS — are available alongside this guide.