Distributed Systems Networking & Load Balancing

ARP and the L2/L3 Boundary: Why Your VIP Failover Sometimes Takes Four Minutes

March 11, 2026

ARP exists because IP addresses and Ethernet MAC addresses live in different namespaces. IP tells you which host you want to reach. The MAC tells the local network where on the wire to actually drop the frame. Neither one is sufficient on its own, and neither one knows the other.

When Host A wants to talk to Host B on the same subnet, it knows B's IP from DNS or config. It still does not know B's MAC. So it shouts. The ARP request is an L2 broadcast that asks every device on the LAN "who owns this IP?" Only B answers, and it answers with a unicast ARP reply containing its MAC. A caches the mapping for a few minutes and starts sending frames directly.

If B sits on a different subnet, A never ARPs for B. It ARPs for the default gateway's MAC and hands the frame to the router. The router strips the L2 header, picks the next hop, ARPs there, and forwards. The IP destination stays constant across the path. The MAC header is rewritten at every hop. That is the rule worth memorizing: MAC is hop-by-hop, IP is end-to-end.

This still matters in virtualized infrastructure. An EC2 instance ARPs for the VPC router. A Kubernetes pod ARPs for its CNI bridge or veth gateway. The names change, the protocol does not.

The production failure that catches teams: a VIP-based HA pair fails over. The new primary sends a Gratuitous ARP, an unsolicited broadcast announcing "this IP now lives at this MAC." Upstream switches and routers are supposed to update their CAM and ARP caches immediately. One upstream switch had a security policy that quietly dropped unsolicited ARP replies to prevent ARP poisoning. It kept the dead node's MAC pinned for the full ARP cache timeout of four minutes. Traffic kept flowing to the corpse. Health checks said the new primary was up. Customers said the site was down.

The fix has two parts. First, audit upstream gear for "Dynamic ARP Inspection" or "ARP guard" settings that suppress GARP, and either trust the failover source or shorten the ARP cache TTL aggressively. Second, do not assume L2 announcements alone are enough: force a TCP-level signal, such as sending RST or briefly toggling the VIP interface, so existing connections tear down and clients reconnect against fresh resolution.

ARP is invisible until it is the only thing that matters.

Key takeaway

ARP is the glue between L3 routing and L2 delivery. Failovers depend on Gratuitous ARP to refresh upstream caches, and any device that quietly ignores unsolicited ARP turns your sub-second cutover into a four-minute outage.

Originally posted on LinkedIn. View original.