High-Performance XDP BNG · CGNAT · QoS · Low-Latency · Operations Intelligence

Broadband Gateway · Operations Intelligence

Answer "My Internet Doesn't Work" in One Command — and See the Capacity Wall Weeks Before It Hits

Two operator-facing tools built into the BNGSOFT XDP BNG: one-command subscriber diagnostics that run every check a senior network engineer would and print a single verdict, and predictive capacity planning that trends each box's own daily peaks and forecasts the saturation date — so you upgrade on a schedule instead of firefighting at peak. Read-only, no new hardware, no per-subscriber licences.

Most BNG troubleshooting time is spent finding the fault, not fixing it. And most capacity upgrades happen after subscribers already felt the slowdown. These two tools close both gaps.

1 cmd

replaces a multi-step manual
investigation per complaint

6 checks

session · firewall-set · captive
· MSS · NAT-reputation · QoS

Saturation
DATE

per-resource forecast from the
box's own daily-peak history

new hardware, daemons, agents
or per-subscriber licences

Both features ship inside the existing bngxdpd daemon and the bngxdpctl control tool. The diagnostics command is read-only and safe to run against a live production node. The capacity collector is a lightweight task already inside the daemon's periodic loop — it adds no measurable CPU and writes a tiny history file the planner reads. Nothing new to deploy, monitor, or license.

1 · The two operational money-leaks

LEAK 1 · TIME-TO-DIAGNOSE

A subscriber says "nothing loads." Now what?

An engineer SSHes in, checks the session, the firewall sets, the MSS clamp, conntrack, the QoS state — by hand, one box at a time.
It needs a senior person who knows where to look. L1 staff escalate; the ticket sits.
Every operator does this differently, so triage quality is inconsistent and undocumented.
The cost is engineer-hours and churn — the customer is offline while you hunt.

LEAK 2 · CAPACITY SURPRISE

The box hits its ceiling at peak — and you find out from complaints.

Bus, CPU, conntrack, subscriber tables, CGNAT port-blocks — each has a ceiling, and growth is gradual.
Without a trend you only see "it's busy tonight," never "this box runs out on the 14th."
Upgrades become emergency capex — rushed NIC swaps or box splits under pressure, at the worst time.
Or worse: you over-provision everything "to be safe" and burn capital you didn't need to spend.

One product, both leaks. bngxdpctl diagnose <ip> collapses the hunt into one command with a verdict. bngxdpctl capacity --project turns each box's own peak history into a dated forecast and a concrete recommendation. Both already live inside the daemon you are running.

2 · One-command subscriber diagnostics — `bngxdpctl diagnose`

Give it a subscriber IP (or interface). It is mode-aware — it knows whether the box runs QoS-only, full-CGNAT, or CGNAT-only and only runs the checks that apply. It resolves the IP to the live session, runs the full battery of checks a senior engineer would, then prints a single most-likely root cause and the suggested fix — followed by the full evidence so the engineer can confirm.

SESSION

Is the line even up?

Resolves IP → live PPP/IPoE interface
operstate + MTU
Catches "session gone / never authenticated"

FIREWALL SETS

Is the subscriber treated as active?

Membership in @user_interfaces
@inactive → stuck in captive portal
Explains "redirected to portal, can't reach servers"

MSS / PMTU

The classic "big sites won't load."

Detects a clamp set to the uplink MTU
vs. the subscriber's real PPPoE link
Flags the server→user blackhole

NAT / REPUTATION

Is the shared CGNAT IP burned?

Conntrack signature: unanswered outbound vs established
Flags shared-pool-IP reputation blocks
The "phones work, smart-TV doesn't" case

QoS / EXPERIENCE

Is the BNG itself the bottleneck?

Per-subscriber QoS + latency view
Plan rate, queuing, experience score
Separates "our shaping" from "their CPE"

VERDICT

One answer, not a data dump.

Single most-likely cause, priority-ranked
Concrete suggested fix
Full evidence printed below for confirmation

Real output — a healthy subscriber on a live production node

root@NodeA:~# bngxdpctl diagnose 10.16.5.130

=== bng-diagnose: 10.16.5.130 ===
  box mode : qos-only    iface: ppp0

  [ OK ] Session           ppp0 up, mtu=1492
  [ OK ] @user_interfaces   ppp0 present
  [ OK ] @inactive         not captive-redirected
  [ OK ] MSS clamp         fixed clamp present (1432, 1452)
  [ OK ] NAT/connectivity   33 established vs 0 pending (healthy ratio)

  VERDICT: Healthy at the BNG
  All BNG-side checks pass. If the customer still has issues, the cause is most
  likely upstream, the CPE/router, or content/service-side.

Captured live from a production access node carrying ~8,700 subscribers (QoS-only mode, i40e, native XDP). The same command emits --format json for ticketing/automation.

Why a verdict, not just data: the value isn't the individual checks — a skilled engineer can run those. The value is collapsing them into one command any L1 tech can run, with a ranked single answer. The full evidence still prints below the verdict, so a senior engineer loses nothing and a junior one gains a guided diagnosis.

Mode-aware — same command, the box's own data plane. The checks above are the realization on a QoS-only node, where firewall, captive-redirect and MSS clamping live in the kernel's nftables. A full-XDP-CGNAT node runs all of that inside the XDP program, with no nftables at all — so on those boxes diagnose reads the BPF data plane directly: the live session, the per-subscriber CGNAT/NAT state, port-block headroom and QoS/experience, surfaced straight from the same pinned maps the XDP program itself uses. The verdict experience is identical; the underlying source adapts to how the box forwards.

3 · The faults it catches — taken from real field cases REAL DATA

These are not hypotheticals. Each row below is a real subscriber-complaint pattern resolved on production BNGSOFT nodes — the kind of case that used to take an engineer anywhere from twenty minutes to several hours of manual work. diagnose now flags each one automatically.

Real symptom reported	Actual root cause	What `diagnose` now says
"LG & Android smart-TV apps load nothing; a phone on the same line works fine."	MSS clamp resolved to the uplink MTU, not the subscriber's 1492 PPPoE link → large server→user packets silently blackholed (PMTU). Big-TLS app screens stay blank.	[FAIL] MSS clamp — "uses 'set rt mtu' (resolves to uplink, not the 1492 link) → server→user blackhole on big packets." Verdict: MSS / PMTU blackhole → fix to fixed 1452/1432.
"Smart-TV streaming apps fail on CGNAT; assigning a static public IP fixes it instantly."	The shared CGNAT pool IP is reputation-blocked by the streaming/app servers; general browsing (phones) still works. Not a daemon bug — an IP-reputation problem.	[FAIL] NAT/connectivity — "outbound conns get no reply vs established → likely SHARED-IP REPUTATION block." Verdict points to a clean/less-burned pool IP.
"Customer authenticated but every site redirects to the captive portal."	Session is in the `@inactive` set → all traffic DNAT'd to the portal. An auth/CoA state mismatch.	[FAIL] @inactive → Verdict: Stuck in captive portal — check auth/CoA, force a reactivate.
"Subscriber online but gets no shaping / no firewall treatment."	The interface never made it into `@user_interfaces` (a set-sync gap) → no MSS clamp, no QoS-set treatment.	[FAIL] @user_interfaces → Verdict: Missing from @user_interfaces — reconnect the session / check set-sync.

The point: the two hardest cases above — the MSS/PMTU blackhole and the CGNAT-reputation block — each took a senior engineer hours of packet capture and conntrack inspection to pin down the first time. They are now a one-line verdict. Every future occurrence is caught in seconds by whoever picks up the ticket.

4 · Predictive capacity planning — `bngxdpctl capacity --project`

A lightweight task inside the daemon's periodic loop records each day's peak of every resource that can limit the box — bus/PCIe throughput, CPU and softirq, subscriber-table fill, conntrack/NAT, and CGNAT port-blocks — to a tiny per-box history file. It is deliberately cheap: scalar counter reads, no walking the big per-subscriber maps, written to persistent storage a few times an hour. After a week of history, the planner fits a trend to each resource and reports which one saturates first, on what date, and what to do about it.

COLLECT — inside the daemon

The box watches itself.

Tracks the daily peak, not an average — the peak is what hurts subscribers.
Bus throughput, CPU/softirq, subscribers, conntrack, CGNAT port-blocks.
No agent, no cron, no external collector — it is part of bngxdpd.
Negligible cost; survives reboots on persistent disk.

PROJECT — one command

A date and a recommendation.

Per-resource trend → days until 90% (the "limiting" line).
First-to-saturate is the headline; the rest are listed too.
Concrete action: add an E810/PCIe-gen4 NIC, split the box, raise a limit, add pool IPs.
--json for dashboards and fleet roll-ups.

Real "how full is this box" snapshot — live production node

root@NodeA:~# bngxdpctl capacity
Capacity — NodeA  (headroom = how much is left before this resource limits the box)
  resource                    used        max   use%  headroom  status
  ------------------------------------------------------------------------------
  CPU (busy, all cores)     16%     100%    16%      84%  OK
    └ softirq (XDP/NAPI)   13%          data-plane cost lives here
  Subscribers (IPv4)     8538    131072    6%     94%  OK
  Subscribers (IPv6)     8318    131072    6%     94%  OK
  Kernel conntrack     464196   8388608    5%     95%  OK

  Tightest resource: CPU at 16% used (84% headroom)  OK

Live snapshot from the same ~8,700-subscriber node. This box is comfortable on CPU, subscriber tables and conntrack — so the planner's job is to watch the one resource that is climbing (bus throughput at peak) and tell the operator when it runs out.

The forecast it produces ILLUSTRATIVE FORMAT

Once a box has accumulated a week of daily peaks, capacity --project produces output in exactly this form. The example below shows a box whose peak bus throughput is climbing ~3 percentage-points per day:

root@bng:~# bngxdpctl capacity --project
Capacity Projection — bng-edge-07  (daily peaks → earliest 90% saturation; 10 days analyzed)
  resource            current%  slope %/day  projected saturation
  --------------------------------------------------------------------
  CPU                 20%        0.02  stable
  subscribers          49%        1.00  ~2026-07-09 (41 days)
  bus/PCIe            87%        3.00  ~2026-05-30 (1 day)
  conntrack/NAT       15%        0.00  stable

  First to saturate: bus/PCIe ~2026-05-30 (1 day)
  Bus/PCIe saturating: add an E810/PCIe-gen4 NIC or split subscribers to a
  second box before 2026-05-30.

Honest note: the projection above is an illustrative example of the tool's real output format on a fast-climbing box; a live node needs ≥7 days of its own history before it produces a dated forecast. The two live snapshots on this page (diagnose and capacity) are captured verbatim from production.

From "it's busy tonight" to "this box runs out on the 30th — buy the NIC now." That single shift turns capacity from a reactive scramble into a line item on a planning calendar — and tells you which box and what to buy, not just that something somewhere is full.

5 · What it means for the business

Operations Intelligence · operator value

↓

Lower time-to-resolutionOne command replaces a manual hunt across session, firewall, MSS, NAT and QoS. Tickets close faster; the subscriber is back online sooner.

↑

L1 staff handle moreA guided verdict lets first-line techs resolve cases that previously had to escalate to a senior engineer. Fewer escalations, lower cost per ticket.

📅

Planned capex, not emergency capexA dated saturation forecast per box means NICs and box-splits are budgeted and scheduled — not rushed through at peak after complaints.

Spend where it's actually neededPer-resource headroom shows which boxes have room and which don't — so you upgrade the one that's filling, not the whole fleet "to be safe."

★

Consistent, documented triageEvery complaint is checked the same way, every time. The verdict and evidence are JSON-exportable straight into your ticketing system.

No new cost surfaceBuilt into the daemon and CLI you already run. No new server, agent, database, dashboard licence or per-subscriber fee.

6 · Zero-risk to adopt

READ-ONLY DIAGNOSTICS

Safe on production.

diagnose only reads state — maps, nft sets, conntrack, /sys.
It changes nothing; run it against a live subscriber during an active complaint.

NEGLIGIBLE COLLECTOR

No measurable load.

Scalar counter reads on a slow cadence; no per-subscriber map walks.
Designed around the daemon's already-optimized periodic loop.

DROP-IN

Nothing new to run.

Part of the existing bngxdpd / bngxdpctl release.
One zero-downtime restart enables the collector; the data plane keeps forwarding.

Upgrade the box to the release carrying Operations Intelligence — a zero-downtime restart; forwarding never stops.

Use diagnose immediately on the next subscriber complaint — no waiting, no data accumulation needed.

Let the collector bank a week of peaks, then run capacity --project for the first dated forecast per box.

Feed --json into your NOC — ticketing for diagnostics, dashboards/fleet roll-up for capacity.

The bottom line

Your BNG already knows everything needed to diagnose a subscriber and to forecast its own ceiling. Operations Intelligence simply surfaces it — as a one-command verdict for the engineer on the ticket, and a dated saturation forecast for the person planning capex.

Less time hunting faults. Fewer escalations. Upgrades scheduled instead of scrambled. No new hardware, no new licences, no new systems to run.

Methodology and honest framing: Two output blocks on this page are captured verbatim from a single live production access node (hostname NodeA), carrying approximately 8,700 subscribers in QoS-only mode on an i40e NIC in native XDP: (1) the bngxdpctl diagnose 10.16.5.130 verdict for a healthy subscriber, and (2) the bngxdpctl capacity headroom snapshot (CPU 16% / softirq 13%, IPv4 subscribers 8,538 of 131,072, IPv6 8,318, kernel conntrack 464,196 of 8,388,608). The capacity --project projection shown is an illustrative example of the tool's real output format, not a live measurement: a production node requires at least seven days of its own recorded daily peaks before it emits a dated forecast, and the collector was newly enabled on this node. The four field cases in Section 3 (MSS/PMTU clamp blackhole, CGNAT shared-pool-IP reputation, captive-portal/@inactive, missing @user_interfaces) are real subscriber-complaint patterns previously diagnosed by hand on BNGSOFT production nodes; the diagnose tool now detects each automatically. The "6 checks" headline counts session, @user_interfaces, @inactive (captive), MSS clamp, NAT/conntrack-reputation, and per-subscriber QoS/experience; the exact checks run depend on the box's operating mode (QoS-only, full-CGNAT, or CGNAT-only). On a full-XDP-CGNAT node — which has no nftables, because firewall, captive-redirect, MSS clamping and NAT all run inside the XDP program — diagnose reads the BPF data plane (the live session, per-subscriber CGNAT/NAT and port-block state, and QoS/experience from the pinned maps) rather than nftables or kernel conntrack; the nft-set and kernel-conntrack checks shown in this document are the QoS-only realization and do not apply to full-XDP nodes. The capacity collector tracks the daily peak of bus/PCIe throughput, CPU and softirq, subscriber-table fill, conntrack/NAT, kernel conntrack, and CGNAT port-block usage; the bus/PCIe metric is measured per-direction against an operator-configurable ceiling. "Saturation" is defined as a resource reaching 90% of its capacity; the projection uses a linear least-squares fit over the recorded daily peaks and is a planning aid, not a guarantee — real growth is rarely perfectly linear, and the forecast updates as new peaks are recorded. Both features are read-only with respect to subscriber traffic (diagnostics) or add no measurable data-plane cost (the in-daemon collector). Prepared as a management and operations overview for large-scale operators. Operations Intelligence is a feature set of the BNGSOFT XDP BNG product.