High-Performance XDP BNG · CGNAT · QoS · Low-Latency · Commodity-Hardware Economics

Broadband Gateway · Performance · Scale · TCO

One Commodity x86 Server Carrying ~12,000 Subscribers — Instead of a Chassis BNG

The entire BNG data plane — subscriber forwarding, CGNAT, firewall, per-subscriber QoS and AQM/L4S — runs in XDP at the NIC driver, before the kernel network stack. That removes the per-packet kernel cost that limits traditional Linux and router-based BNGs, and lets a single dual-socket commodity server do the job of a chassis: no line cards, no redundant supervisors, no per-subscriber licences. This is a CFO-and-architect view of the performance, the real scaling ceiling, and the total cost of ownership.

A chassis BNG sells you slots, supervisors and per-subscriber licences. This sells you a server. The data plane is fast enough that the first wall you hit isn't the CPU — it's the PCIe bus, and we ship a planner that watches exactly that.

~64k

concurrent subscribers per
commodity dual-socket box

23% → 2.5%

measured CPU drop from the
monolithic-XDP QoS path

PCIe
not CPU

the real ceiling is the bus
hairpin, with CPU to spare

per-subscriber licence fees,
line cards or chassis slots

The pitch is simple and honest: line-rate, full-feature BNG on hardware you can buy from any server vendor. You grow capacity by adding or splitting commodity boxes — not by buying chassis slots — and you plan that growth against a measured bus-throughput ceiling rather than guesswork. The sections below walk the data plane that makes it fast, the scaling ceiling that is honestly the bus and not the CPU, the headroom in the tables, the tuning that is measured rather than assumed, and the resulting capex/opex/power story versus a chassis.

1 · Why it is fast — the data plane runs in XDP, before the kernel

A traditional Linux or router-based BNG pays a per-packet cost to push every subscriber packet up through the kernel network stack — routing, netfilter, qdisc — before it can forward. That cost is what caps subscriber density and drives operators onto expensive purpose-built chassis silicon. The BNGSOFT data plane sidesteps it entirely: the whole pipeline runs in XDP (eXpress Data Path) at the NIC driver, before the packet ever enters the kernel stack.

TRADITIONAL BNG

Every packet pays the kernel-stack tax.

Packet traverses driver → kernel network stack → routing → netfilter → qdisc before forwarding.
Per-packet cost scales with subscriber count and packet rate — the density ceiling is reached early.
So operators buy chassis silicon (line cards, NPUs) to escape the host CPU.

BNGSOFT XDP DATA PLANE

Forwarding decided at the driver.

Subscriber forwarding, CGNAT, firewall, per-subscriber QoS and AQM/L4S all run in XDP, before the stack.
Native (driver-mode) XDP on Intel i40e / ice / E810 NICs; automatic fallback to SKB mode on NICs and hypervisors that need it.
The per-packet kernel cost that caps a traditional BNG simply isn't on the path.

This is the whole economic argument in one sentence: because the data plane never pays the kernel-stack per-packet cost, a commodity server reaches chassis-class subscriber density — so you can replace purpose-built silicon with a server you already know how to buy, rack and refresh.

2 · The monolithic-XDP optimization — a measured 23% → 2.5% CPU drop

Running in XDP is necessary but not sufficient; how the pipeline is built inside XDP matters enormously. Early versions chained the pipeline as a series of XDP programs connected by tail-calls — each stage re-parsed the packet and paid a tail-call cost. Consolidating that whole pipeline into a single monolithic XDP program removed both the per-stage tail-call overhead and the repeated re-parsing.

BEFORE · CHAINED TAIL-CALLS

Per-stage overhead on every packet

~23%

CPU on the QoS path (measured)

Each pipeline stage is a separate XDP program.
Tail-call between stages + re-parse of the packet at each hop.
Overhead multiplies with packet rate.

AFTER · MONOLITHIC XDP

One program, parse once, forward

~2.5%

CPU on the same QoS path (measured)

Whole pipeline consolidated into one XDP program.
No inter-stage tail-calls; the packet is parsed once.
The freed CPU becomes subscriber-density and latency headroom.

QoS-path CPU — chained tail-calls vs monolithic XDP

Measured CPU on the QoS data-plane path before and after consolidating into a single XDP program. Lower is better.

Chained tail-call pipeline

~23% CPU

~23%

Monolithic XDP program

~2.5%

Why this is real, not marketing: the 23% → 2.5% figure is a directly measured before/after on the QoS path, not a synthetic benchmark. An order-of-magnitude reduction in data-plane CPU is what converts "a server can do some subscribers" into "a server can do chassis-class subscriber counts with headroom to spare."

3 · The honest scaling ceiling — it's the bus, not the CPU

This is the part most vendors hide and we lead with. On a dual-socket commodity x86 server with 2×100G NICs (PCIe gen4 ×16), the bus allows on the order of ~64,000 concurrent subscribers at a ~3 Mbps busy-hour rate, with CPU headroom to spare. The practical ceiling is therefore not CPU and not the subscriber tables — it is the PCIe / bus throughput of the in↔out "hairpin." Every subscriber packet crosses the bus twice (in on the access side, out on the uplink, or vice-versa), so the usable per-direction bus bandwidth is what runs out first — capacity ≈ NIC usable line rate ÷ busy-hour per-subscriber rate.

THE HAIRPIN

Every packet crosses the bus twice.

In on one interface, out on another → two bus traversals per packet.
So the planning unit is usable bus bandwidth per direction, not raw CPU.

PCIe x8 gen3 REALITY

~50 Gbps per direction usable.

On a PCIe x8 gen3 link the usable ceiling is roughly 50 Gbps each way.
Production peak observed around ~28 Gbps per direction — real headroom remained on the link.

CPU & TABLES SPARE

The bus hits first.

At that peak, CPU and subscriber tables still had large headroom.
So the first constraint to manage is the bus — which is why we ship a planner for it.

Where a production node actually sits — per-direction bus utilisation at peak

Observed production peak against the usable per-direction ceiling of a PCIe x8 gen3 link. The point is that the bus is the binding constraint while CPU and tables are not.

Usable bus ceiling (x8 gen3)

~50 Gbps / direction

~50 Gbps

Observed production peak

~28 Gbps

CPU at that peak

low, spare

headroom

Subscriber tables at that peak

~6% full

headroom

We sell on the bus and we don't hide it. Because the ceiling is PCIe throughput, the real lever for more capacity per box is the NIC / PCIe generation (E810-class, PCIe gen4) — not more CPU cores. And because the constraint is predictable, the product ships a capacity planner that watches the bus, not just CPU, so you scale on a schedule instead of by surprise.

How full is this box — live capacity snapshot

root@NodeA:~# bngxdpctl capacity
Capacity — NodeA  (headroom = how much is left before this resource limits the box)
  resource                    used       max   use%  headroom  status
  ------------------------------------------------------------------------------
  Bus/PCIe (per dir)  ~28Gb  ~50Gb    56%     44%  WATCH
  CPU (busy, all cores)     16%     100%    16%      84%  OK
    └ softirq (XDP/NAPI)   13%          data-plane cost lives here
  Subscribers (IPv4)    12480   131072   10%     90%  OK
  Subscribers (IPv6)    12455   131072   10%     90%  OK
  Kernel conntrack     612300  8388608    7%     93%  OK

  Tightest resource: Bus/PCIe at 56% used (44% headroom)  WATCH — bus is the ceiling

ILLUSTRATIVE FORMAT Representative of the bngxdpctl capacity output on a ~12,000-subscriber dual-socket node; the bus/PCIe row is the operator's first-to-watch line. The numbers are production-derived and deployment-dependent — see the closing note. The bus is the binding constraint while CPU, subscriber tables and conntrack all sit comfortably.

4 · Tuning is measured, not guesswork — spread, don't NUMA-pin

A common assumption on a 2-socket box is that you should NUMA-pin NIC IRQs to the socket nearest each NIC. On this data plane that assumption is measured to be wrong, and we ship the box tuned for what actually wins.

MEASURED-OPTIMAL · SPREAD

Spread NIC IRQs across all cores.

Distributing receive queues / IRQs across every core is the measured-best configuration.
The box ships tuned this way; RPS/XPS drift (e.g. after a link flap) is auto-corrected.
No per-site hand-tuning needed to get the published density.

MEASURED-WORSE · NUMA-PIN

NUMA-pinning made it worse.

The BPF maps span both NUMA nodes, so pinning a queue to one socket increases cross-NUMA traffic.
Pinning was measured to raise cross-NUMA cost — the opposite of the intent.
So the real lever for more throughput is the PCIe / NIC generation, not more cores or clever pinning.

Why this matters to the buyer: the box is tuned by measurement, not folklore. That means predictable density out of the box, and a clear answer when someone asks "how do we get more out of one node?" — upgrade the bus/NIC, because CPU and core-pinning are not the constraint.

5 · Headroom — the tables are sized well under the bus ceiling

Because the bus is the first thing to fill, the in-memory tables are deliberately sized with generous headroom so they are never the limiting factor at the densities the bus allows. You are not rationing table slots; you are managing bus bandwidth.

SUBSCRIBER TABLES

131,072 per address family.

IPv4 and IPv6 subscriber tables each sized to 131,072 entries.
At ~12k subscribers that's roughly 10% full — vast headroom under the bus ceiling.

CONNTRACK / NAT

Large connection tables.

Large conntrack / NAT tables to hold the flow state of a full subscriber base.
Sits comfortably below capacity at production peak.

CGNAT PORT-BLOCKS

Scalable port-block pools.

Scalable CGNAT port-block pools for the shared-address-pool footprint.
Pool growth is a config/IP lever, not a hardware wall.

Net effect: every dimension the operator might worry about — subscribers, flows, NAT, ports — has room to spare at the densities a single box's bus will actually carry. The planning conversation collapses to one honest metric: per-direction bus throughput.

6 · Operations that lower TCO further — built in, no extra boxes

The cost story isn't only the absent chassis and licences. Three operational capabilities ship inside the same bngxdpd daemon and bngxdpctl tool and each removes cost a chassis design would otherwise demand.

ZERO-DOWNTIME UPGRADE

Hitless, no redundant chassis.

In-service upgrades that keep forwarding through the restart.
You do not need a second redundant chassis purely to upgrade without an outage.
The classic "buy two of everything for maintenance windows" cost goes away.

ONE-COMMAND DIAGNOSTICS

Faster tickets, fewer escalations.

bngxdpctl diagnose <ip> collapses a manual subscriber hunt into one verdict.
L1 staff resolve cases that used to escalate — lower cost per ticket.
Read-only and safe on a live production node.

PREDICTIVE CAPACITY

Planned capex, not emergency.

capacity --project trends each box's own peaks to a dated saturation forecast.
Watches the bus first — upgrade the NIC or split the box on a schedule.
Spend where it's needed; don't over-provision the whole fleet "to be safe."

Cost category	Traditional chassis BNG	BNGSOFT commodity XDP server
Hitless upgrade	Often a second redundant chassis / supervisor purely for maintenance windows.	Built-in zero-downtime in-service upgrade — no redundant unit needed for hitless.
Diagnostics	Vendor tooling and senior-engineer time per complaint.	One-command verdict any L1 tech can run; JSON for ticketing.
Capacity planning	Reactive — slot exhaustion discovered late, emergency procurement.	Dated, per-box, bus-aware forecast — buy the NIC or split before it bites.

7 · The TCO case — a server instead of a chassis

Put together, the hardware substitution is the headline and the operations are the multiplier. A commodity 1–2U x86 server with standard NICs replaces a chassis BNG and everything that chassis implies.

Dimension	Traditional chassis BNG	BNGSOFT commodity XDP server
Hardware	Chassis + multiple line cards + redundant route processors / supervisors.	One 1–2U dual-socket x86 server with standard 100G-class NICs.
Per-subscriber licensing	Per-subscriber feature / scale licences; capacity gated by entitlement.	No per-subscriber licence fees. Density is bounded by the bus, not a licence.
How you grow	Buy chassis slots / line cards; eventually fork-lift to a bigger chassis.	Add or split commodity boxes; the unit of growth is a cheap server.
Power & rack	High power draw and rack footprint per chassis.	Lower power and a 1–2U footprint per node.
Refresh & sparing	Vendor-specific spares, support contracts and refresh cadence.	Standard server refresh cycle and commodity sparing you already run.
Scaling lever	More slots / bigger chassis.	Newer NIC / PCIe generation (E810, gen4) — the measured real lever.

Performance · Scale · TCO — buyer value

↓

Capex collapses to a serverReplace chassis + line cards + redundant supervisors with one commodity 1–2U dual-socket box and standard NICs.

No per-subscriber licencesSubscriber density is bounded by the bus, not by an entitlement you keep re-buying as you grow.

⚡

Lower opex & powerLess power, less rack, standard server sparing and refresh — no vendor-specific chassis support cadence.

＋

Grow by adding cheap boxesCapacity grows by adding or splitting commodity servers, not by buying chassis slots or fork-lifting.

📅

Predictable, bus-based scalingAn honest per-direction bus ceiling and a dated forecast mean planned procurement, not peak-time scrambles.

★

No redundancy tax for upgradesZero-downtime in-service upgrade removes the "buy a second chassis to patch without an outage" line item.

The bottom line

The BNGSOFT data plane runs the whole BNG — forwarding, CGNAT, firewall, per-subscriber QoS, AQM/L4S — in XDP before the kernel stack, and consolidates it into a single monolithic program that measured a 23% → 2.5% CPU drop on the QoS path. The result is chassis-class subscriber density — ~64,000 per 2×100G box — on a commodity server, where the first ceiling is the PCIe bus, not the CPU.

So the buy is a server, not a chassis: no line cards, no redundant supervisors, no per-subscriber licences. You grow by adding cheap boxes, you upgrade hitlessly without a redundant unit, and you plan against an honest bus-throughput ceiling with a built-in dated forecast. Lower capex, lower opex, lower power — and a scaling model you can actually predict.

Methodology and honest framing: The figures on this page are derived from BNGSOFT production nodes and are representative, not contractual. Subscriber density: per-node subscriber capacity is throughput-driven — a dual-socket commodity x86 server with 2×100G NICs (PCIe gen4 ×16) can carry on the order of ~64,000 concurrent subscribers at a ~3 Mbps busy-hour average (capacity ≈ NIC usable line rate ÷ busy-hour per-subscriber rate), capped by a ~131,072-entry per-node table ceiling, with CPU headroom remaining; exact capacity depends on traffic mix, average packet size, NIC model and PCIe generation. Observed production peaks to date are on smaller PCIe gen3 ×8 nodes (see below), which the bus — not CPU or table size — limits well before that figure. The scaling ceiling is the bus, not the CPU: every subscriber packet crosses the PCIe bus twice (the in↔out "hairpin"), so usable per-direction bus bandwidth is the binding constraint. On a PCIe x8 gen3 link (~50 Gbps per direction usable) a production peak of around ~28 Gbps per direction was observed while CPU and subscriber tables still had large headroom — the real lever for more per-box throughput is therefore the NIC / PCIe generation (E810-class, PCIe gen4), not additional CPU cores. The 23% → 2.5% figure is the measured CPU reduction on the QoS data-plane path from consolidating a chained tail-call XDP pipeline into a single monolithic XDP program; it is a directly observed before/after, not a synthetic benchmark. XDP mode: the data plane uses native (driver-mode) XDP on Intel i40e / ice / E810 NICs and automatically falls back to SKB mode on NICs and hypervisors that require it. Tuning: spreading NIC IRQs / receive queues across all cores was measured to be optimal; NUMA-pinning was measured to be worse, because the BPF maps span both NUMA nodes and pinning increases cross-NUMA traffic — the box ships tuned for spread. Table headroom: subscriber tables are sized to 131,072 entries per address family (IPv4 and IPv6), with large conntrack / NAT tables and scalable CGNAT port-block pools; these sit well below capacity at the densities the bus allows. Illustrative output: the bngxdpctl capacity terminal block is marked ILLUSTRATIVE FORMAT — it represents the real shape of the tool's output and uses production-derived, deployment-dependent values rather than a single verbatim capture; node names (NodeA, etc.) are generic. The capacity --project dated forecast referenced in Section 6 requires at least seven days of a node's own recorded daily peaks before it emits a projection, and any projection is a linear-fit planning aid, not a guarantee. TCO comparison: any cost, power or currency comparison against a traditional chassis BNG is illustrative and qualitative — the actual saving depends entirely on the operator's incumbent platform, scale and commercial terms. The substitution described (a commodity 1–2U x86 server with standard NICs replacing a chassis with multiple line cards, redundant route processors / supervisors and per-subscriber feature / scale licensing, with no per-subscriber licence fees) reflects the product architecture; specific savings should be validated against the operator's own incumbent quote and refresh economics. Prepared as a management and architecture overview for large-scale operators. Performance, Scale & TCO are characteristics of the BNGSOFT XDP BNG product.