The Loop That Wasn't on the Switches

#networking #cisco #productivity

How a brand-new firewall took down international company for 37 minutes — and why LACP told us everything was fine.
It started the way these things always start: monitoring lighting up with BGP neighbor flaps on a pair of edge routers. Not one neighbor — several. iBGP between the routers, eBGP towards external peers, all of it bouncing on hold-time expiry. And then, just to make the morning more interesting, HSRP decided both routers should be Active at the same time.
When everything on a router starts failing at once, my first instinct is: the router is probably fine. Something is starving it.
Reading the router logs
I pulled syslog from both routers and sorted the noise. Two things jumped out immediately.
First, both routers were logging the same message every 30 seconds, for over half an hour:

%PUNT_INJECT-5-DROP_PUNT_CAUSE: punt policer drops packets,
cause: arp (0x7) from TenGigabitEthernetX/X/X.NN src mac xxxx.xxxx.xxxx

Same subinterface, same source MAC, on both routers. The punt policer was drowning in ARP. (Side note: that log is rate-limited to one entry per 30 seconds — the actual drop rate was continuous. Don't let throttled logs fool you into thinking something is intermittent.)
Second, the QFP load alarm fired on both boxes, and the traffic snapshot next to it was the real smoking gun:

%IOSXE_QFP-2-LOAD_EXCEED: Slot: 0, QFP:0, Load 83% exceeds threshold 80%.
5 secs traffic rate on QFP:
Total Input: ~2.7 Mpps (~1.4 Gbps)
Total Output: ~3 kpps (~2.5 Mbps)

Millions of packets per second in, basically nothing out.
That asymmetry is a signature. Legitimate traffic gets forwarded — input and output roughly track each other. Traffic that arrives but never leaves is traffic the router has to consume: broadcasts, punted control-plane packets. Combine that with the ARP drops and you have a classic broadcast storm.
And a broadcast storm of one MAC's ARP frames, hitting two routers at once, for 37 straight minutes? That smells like a Layer 2 loop.
Everything else suddenly made sense as collateral damage. The BGP keepalives were getting dropped by the same punt policer that was fighting the ARP flood — hence the hold-time expirations. The HSRP hellos drowned too — hence both routers going Active (split-brain). The routers weren't broken. They were deaf.

Confirming it on the switch

If you suspect a loop, the switches will usually tell you. Sure enough:

%SW_MATM-4-MACFLAP_NOTIF: Host xxxx.xxxx.xxxx in vlan NN
is flapping between port Te2/0/2 and port Po1

A MAC flapping between two ports means the same frame is arriving from two directions. Here, Te2/0/2 was the back-to-back trunk to the second switch, and Po1 was a new LACP port-channel to a freshly deployed firewall. The firewall's own MAC was among the flapping ones — its frames were coming in both directly and via the other switch.
Draw that out and you get a triangle:

switch-1 ──Po1── firewall ──Po1── switch-2
└───────── inter-switch trunk ─────────┘

Three legs, all carrying the same VLAN. A closed ring.

The twist: LACP said everything was fine

Here's where it got interesting. My first theory was the classic mistake: a single LACP bundle from the firewall split across two independent switches (these were standalone chassis — no StackWise Virtual, so no legitimate cross-chassis LAG).
So I checked:

switch-1# show lacp neighbor
Port Dev ID
Te2/0/9 xxxx.xxxx.xx5c
Te2/0/10 xxxx.xxxx.xx5c

switch-2# show lacp neighbor
Port Dev ID
Te1/0/9 xxxx.xxxx.xx5d
Te1/0/10 xxxx.xxxx.xx5d

Two different partner system IDs — consecutive MACs, one chassis, two separate aggregate interfaces. Each switch had a perfectly healthy, perfectly legitimate LAG. LACP had nothing to complain about, because nothing about either bundle was wrong.
Which left exactly one place the loop could close: inside the firewall.
And that's what it was. The firewall was supposed to be a routed device. But during deployment, both aggregates had been placed into a software-switch construct — effectively bridging them together at Layer 2.
The "router" was quietly behaving as a transparent cable between the two switches.
Why didn't spanning tree save us? Because this firewall vendor (like most) drops BPDUs by default. Each switch looked at its port-channel, saw no bridge on the other side, and kept it forwarding. STP can't block a ring it can't see. The loop ran until a human broke it.
The timeline lined up perfectly, too: the storm started at the exact minute the second aggregate came up during the new firewall's deployment.

The fix

Containment first: pick one path for that VLAN between the switches. We pruned the VLAN from the inter-switch trunk, which opened the ring instantly. (Validate before you prune — know what's actually using that VLAN across the trunk, or you'll trade a loop for a black hole.)
Then the real fix, on the firewall:

Dissolve the software switch. A routed firewall has no business bridging two upstream aggregates. Remove the references, delete the bridge, get two standalone interfaces back.
One active L2 attachment. Since you can't span a LAG across two independent switches, the firewall keeps its full LACP bundle to one switch. The unused bundle on the other switch gets shut and deconfigured so nobody re-cables the triangle six months from now. (Active/standby redundant-interface designs or a proper cross-chassis LAG with stacked switches are the longer-term options — that's a design decision, not an incident-night improvisation.)
Put the VLAN back on the inter-switch trunk — it's now the only path between the switches again, so no ring is possible.
Storm control on every leg. storm-control broadcast with a sane pps threshold on the port-channels and the trunk. Baseline your normal broadcast rate first, then set the limit. The next loop should throttle itself in seconds, not eat the control plane for half an hour.

Post-change validation: no MAC flaps for 30 minutes, punt drop counters flat, QFP load back to single digits, BGP and HSRP stable and where they're supposed to be.

What I'm taking away from this

"Input huge, output tiny" is a one-line diagnosis. When a router's ingress is megapackets and egress is kilopackets, stop looking at routing protocols. The protocols are victims. Look for what's flooding.
A healthy LACP state proves nothing about loops. LACP validates each bundle in isolation. Two flawless bundles can still form a flawless ring. The protocol did its job perfectly — the topology was the bug.
Firewalls eat BPDUs. Any time a firewall, hypervisor vswitch, or "transparent" appliance bridges between two of your switches, your spanning tree is blind there. Either make sure there's exactly one path by design, or explicitly enable BPDU forwarding on the middlebox — and treat that as the fallback, not the plan.
Alert on causes, not just symptoms. We had alarms for BGP flaps within a minute. We had nothing watching MACFLAP_NOTIF or punt-policer drops — the two messages that named the actual problem (and the VLAN!) from the very first minute. That's fixed now.
Rate-limited logs lie about intensity. One log line per 30 seconds looked almost calm. It represented a multi-million-pps flood.
The routers flapped. The protocols screamed. But the issue was a quiet little checkbox on a brand-new firewall, bridging two interfaces that were never meant to meet.

Loops rarely live where the symptoms are.

DEV Community

The Loop That Wasn't on the Switches

Top comments (0)