High-Performance XDP BNG · CGNAT · QoS · Low-Latency · Carrier-Grade Reliability

Broadband Gateway · Reliability · Churn & Storm Resilience

Surviving Power Cuts, Fibre Cuts and Reconnect Storms — Without Dropping the Network

A neighbourhood power cut ends with every subscriber reconnecting at once. That reconnect storm — not steady-state traffic — is what breaks a BNG. This release hardened the BNGSOFT BNG service (the subscriber/session control plane — PPPoE, IPoE, RADIUS, address pools, routing) for exactly those moments: a crash class eliminated, per-subscriber IPv6 made cheap, faster time-to-online, and storage-failure tolerance — all validated under a 10,000-subscriber load test and an 8,000-subscriber, 4-hour power-cut / fibre-cut soak.

Steady-state is easy. The bad day is the test. We built for the power cut, the fibre cut and the reconnect storm — then spent hours simulating them on a live 8,000-subscriber bench before any of it shipped.

~10,000

subscribers carried through
validation on one box

8k × 4h

power-cut / fibre-cut soak
with churn every ~15 min

crashes, memory, file-descriptor
or route leaks through churn

crash-class

eliminated — no daemon crash
under reconnect storms

This cycle was about the control plane under stress: the connect/disconnect storms after an outage, and the slow drift that builds over weeks of uptime. We moved to a newer kernel base (7.1.1), removed a use-after-free crash class, and cut the per-subscriber control-plane work that piles up during churn — then proved each change under load that mirrors a real bad day. The sections below walk the threat, each fix, and the validation that backs it.

1 · The threat — the reconnect storm, not steady traffic

When the power comes back to an area, thousands of CPEs power on within seconds and all try to reconnect at the same instant. The BNG must tear down the dead sessions, then re-establish thousands of new ones — re-authenticating, re-allocating IPv4 and IPv6 addresses, re-installing routes — in a burst. This is the single most demanding thing a BNG does, and it is where weak implementations crash, leak, or stall.

THE BAD DAY · RECONNECT STORM

What an outage actually does to the BNG

all at once

thousands of CPEs reconnect within seconds

Mass teardown of dead sessions and a flood of new authentications collide.
Address pools, routing engine and session tables all hammered simultaneously.
A crash or stall here turns one outage into a longer second outage.

THE DESIGN GOAL

Absorb the storm, stay up, recover fast

bounded

orderly recovery, no crash, no leak

Never crash on a churn race; never leak memory, descriptors or routes.
Keep accepting new sessions while old ones drain in the background.
Transient blips (a brief fibre flap) must not falsely drop subscribers.

Why this is the right thing to harden: subscribers forgive a short outage; they do not forgive an outage that the BNG itself prolongs by crashing while everyone is trying to get back online. Every change in this release is aimed at that moment.

2 · Crash-class eliminated — pool use-after-free under churn

The most important fix in this cycle removes an entire category of crash. Under heavy connect/disconnect churn — the exact reconnect-storm condition — the IPv4/IPv6 address-pool lease handling could touch memory that had just been freed (a use-after-free), and the service could crash. Three related defects were found and fixed across the lease-claim, pending-lease-wakeup and lease-create paths. This was the root cause of crashes seen on a production node.

BEFORE

Use-after-free on the lease path

crash

possible during a reconnect storm

Many subscribers race for addresses at the same instant.
A freed lease could be touched on the wakeup / claim path.
Worst case: the control plane restarts mid-storm.

AFTER

Lease lifecycle made race-safe

stays up

soak-tested through repeated storms

Lease memory is cleanly owned across claim, wakeup and create.
No freed-memory access under maximal churn.
Validated through a 4-hour soak of repeated mass reconnects.

Operator impact: the BNGSOFT BNG service stays up through reconnect storms instead of restarting. One avoided crash during a peak-hour outage can spare thousands of subscribers a second outage — and spare the NOC a 2 a.m. escalation.

3 · IPv6 made cheap — no routing-engine churn per subscriber

Standard practice assigns an address to every subscriber's virtual interface, which makes the kernel create a connected route and arm address-configuration timers — and floods the routing engine with add/remove events every time a subscriber connects or disconnects. The new IPv6-unnumbered mode installs the subscriber's /64 as a lightweight device route instead. The subscriber still gets standard IPv6 auto-configuration and prefix delegation — nothing changes on their side — but the per-subscriber routing churn disappears.

Per-subscriber IPv6 control-plane cost — standard vs unnumbered

Relative routing-engine work generated per subscriber connect/disconnect. Lower is better. Unnumbered removes the interface address, the connected route and the address-configuration timer.

Standard (address per interface)

address + connected route + addrconf churn

high

IPv6-unnumbered (device route)

one device route

low

SUBSCRIBER

No change at all.

Same dual-stack service, same SLAAC, same delegated prefix.
Fewer micro-interruptions during network-wide events.

ROUTING ENGINE

Stays calm during churn.

No connected-route add/remove storm on connect/disconnect.
Cleaner table; the flap-time CPU spikes go away.

DENSITY

More subs per box.

Control plane saturates later; existing hardware lasts longer.
Runs cooler at the same subscriber count.

4 · Faster online — IPv4 the instant the session starts

Dual-stack bring-up used to gate the IPv4 service behind the slower IPv6 prefix-delegation handshake. Early IPv4 binding decouples them: the subscriber's IPv4 forwarding and accounting are armed at session start, not after IPv6 finishes negotiating.

BEFORE

IPv4 waited on IPv6.

IPv4 path armed only after the DHCPv6 prefix-delegation exchange completed.
A short "session active but no IPv4 yet" window on every connect.
Worse after a storm, when everything negotiates at once.

AFTER

IPv4 armed at session start.

IPv4 forwarding + accounting active immediately, independent of IPv6.
Faster time-to-first-byte — the v4-first device is online a beat sooner.
Clean accounting start; no "active but offline" gap.

5 · Robustness — survives a failed disk, smooth under storms

Two more changes harden the service against the messy realities of field hardware and storm load.

DISK-FAILURE-SAFE LOGGING

A bad log disk can't stop service.

If the log directory is missing or the disk fails, the service warns and keeps retrying in the background — it does not fail.
Logging is fully decoupled from the forwarding / session path.
Self-rotating, self-pruning logs that never fill the partition.
The BNG boots and serves subscribers even if /log is unmounted after a disk fault overnight.

ROUTE-OPERATION FUNNEL

Routes drain off the critical path.

Route add/remove (a globally-serialised kernel operation) is asynchronously decoupled from session processing.
Breaks the cascade where a burst of route changes stalls everything behind it.
Worker threads keep accepting new sessions while routes drain in the background.
Lower latency spikes and more even CPU during a reconnect storm.

6 · Newer kernel base — 7.1.1, validated at 10k subscribers

The platform moved from the 7.0.x kernel series to 7.1.1, carrying forward staggered batch teardown of subscriber interfaces so that thousands of PPP/IPoE interfaces are cleaned up in groups instead of one-at-a-time. The new base was validated end-to-end — establish, hold, traffic, tear down — at 10,000 subscribers on both virtual (vmxnet3) and bare-metal (Intel ixgbe) hardware, with zero driver or module errors.

Why it matters: a modern, supported kernel base means a longer support runway and fewer driver surprises — and the batched interface teardown means the BNG clears dead sessions and accepts reconnections faster after a mass outage.

7 · Proven, not just shipped — the validation

None of the above relies on "it compiled". Every change was exercised under load that mirrors a real bad day, monitoring every counter across the whole stack — session service, XDP data plane, routing engine, and system memory / CPU.

10k LOAD TEST

Full lifecycle at scale.

10,000 subscribers established, held, passed traffic, torn down.
Virtual + bare-metal NICs; zero module/driver errors on the new kernel.

TEARDOWN PROFILING

Measured, leak-checked.

Profiled exactly where time goes when 10k drop at once.
Verified no leaks — memory, file descriptors or routes — after the dust settled.

8k × 4h CHURN SOAK

Repeated simulated outages.

Power cuts, fibre cuts and brief flaps every ~15 minutes for hours.
Watched for crashes, memory drift, stuck sessions and route leaks.

Inside the soak — the counters we watch through every simulated outage

root@bench:~# soak-monitor --since power-cut
8,000-subscriber soak — counters across one power-cut -> recovery cycle

  phase         active  mem   fd    routes     crash  status
  -----------  -------  ----- ----- --------- ------  -------
  steady          8000  flat  flat  ~48k           0  OK
  power-cut    8000->0  flat  flat  48k->~0        0  DRAIN
  re-ramp      0->8000  flat  flat  ~0->48k        0  RECOVER
  steady          8000  flat  flat  ~48k           0  OK

Across the full cycle: service memory & file descriptors flat, routes return to baseline (no leak), process never restarted

ILLUSTRATIVE FORMAT Representative of the soak-monitor view across one simulated power-cut and recovery on an 8,000-subscriber bench: subscriber count drains to zero and re-ramps to full, while service memory, file-descriptor count and the route table return to baseline with the process never crashing. Values are bench-derived and deployment-dependent — see the closing note.

The point of the soak: a feature that works once in a demo is not the same as one you can put in front of paying subscribers. Repeating the worst moment — the mass reconnect — for hours, while watching for the smallest drift in memory, descriptors or routes, is how we know it holds.

Reliability & Resilience — buyer value

✓

No self-inflicted second outageA crash class under reconnect storms is eliminated — the BNG stays up while everyone gets back online.

↻

Faster outage recoveryBatched teardown and off-critical-path route draining mean dead sessions clear and reconnections complete sooner.

⬇

More subscribers per boxIPv6-unnumbered removes per-subscriber routing churn, so the control plane saturates later on the same hardware.

⚡

Faster time-to-onlineEarly IPv4 binding gets the common case onto the internet a beat sooner — fewer "slow to come up" tickets.

🛡

Tolerates failed field hardwareA failed or missing log disk can no longer stop or slow subscriber service — one less truck roll.

★

Proven under a 4-hour churn soakEvery change validated at 10k load and through hours of simulated power/fibre cuts — zero crashes or leaks.

The bottom line

This release hardened the BNGSOFT BNG service for the moment that actually hurts subscribers: the reconnect storm after a power cut or fibre cut. We eliminated a use-after-free crash class, made per-subscriber IPv6 cost almost nothing on the routing engine, got IPv4 online at session start, made the service survive a failed log disk, and moved to a 7.1.1 kernel base — then validated all of it at 10,000 subscribers and through an 8,000-subscriber, 4-hour power-cut / fibre-cut soak with zero crashes and zero leaks.

For the operator: more subscribers per box, fewer emergency call-outs, faster outage recovery, cooler hardware. For the subscriber: faster to get online, fewer drops, outages that end sooner. For the business: lower opex per subscriber, fewer truck rolls, and a reliability story you can sell.

Methodology and honest framing: The figures on this page are derived from BNGSOFT bench validation and production experience and are representative, not contractual. Validation load: approximately 10,000 subscribers were established, held, passed traffic and torn down on a single box on the 7.1.1 kernel across virtual (vmxnet3) and bare-metal (Intel ixgbe) NICs; exact behaviour depends on traffic mix, NIC model and hardware. Churn soak: an 8,000-subscriber bench was subjected to repeated simulated power cuts, fibre cuts and brief flaps over a multi-hour run, monitoring subscriber counts, control-plane memory and file-descriptor counts, routing-engine and data-plane memory, route-table size and process liveness; "no crashes or leaks" refers to that monitored bench run. Crash-class fix: three related use-after-free defects in the IPv4/IPv6 address-pool lease handling (claim, pending-lease wakeup, lease-create) were identified and fixed; the "before" crash was observed under churn on a production node. IPv6-unnumbered: installs the subscriber /64 as a device route rather than an interface address, removing the kernel connected route and address-configuration timer while preserving standard SLAAC and prefix delegation; the chart depicts the qualitative reduction in per-subscriber routing-engine work, not a single measured ratio. Early IPv4 binding: the IPv4 binding is pushed at session start (accounting-start) rather than after the DHCPv6 prefix-delegation exchange. Disk-failure-safe logging: the high-speed log path retries in the background and never fails session/forwarding service if the log target is missing or fails. Route-operation funnel: kernel route programming is serialised and asynchronously decoupled from session processing. Illustrative output: the soak-monitor terminal block is marked ILLUSTRATIVE FORMAT — it represents the shape of the monitored counters across a power-cut/recovery cycle using bench-derived, deployment-dependent values rather than a single verbatim capture. The "BNGSOFT BNG service" refers to the subscriber/session control plane (PPPoE/IPoE, RADIUS/AAA, address pools and routing); the XDP data plane is a separate component. Production rollout proceeds box-by-box during approved maintenance windows. Prepared as a management and operations overview for large-scale operators. Reliability and resilience characteristics described are of the BNGSOFT XDP BNG product.