Overlay Network

Bridge and Host networks work seamlessly on a single machine. However, modern production architectures (like Docker Swarm or Kubernetes) operate across clusters where containers are distributed over multiple physical servers or virtual machines.

The Problem: How does a container on Server A communicate with a container on Server B as if they were physically plugged into the same local network switch, without exposing all container traffic to the public internet?

The answer is the Overlay Network.

1. Intuition: The “FedEx Envelope” Analogy

Imagine two corporate branch offices, one in New York and one in London. Employees in both offices use internal 4-digit extension numbers (e.g., x1234) to talk to each other.

If an employee in NY writes a physical memo to an employee in London, the postal service doesn’t understand “Deliver to x1234”.

  1. The NY mailroom takes the internal memo.
  2. They encapsulate it inside a standard FedEx envelope addressed to the London office’s public address.
  3. FedEx transports the envelope across the ocean.
  4. The London mailroom receives it, decapsulates (opens) the FedEx envelope, reads the internal extension x1234, and delivers the memo to the correct desk.

An Overlay Network works exactly like this mailroom. It creates a virtual, private network (the internal extensions) on top of the physical network (FedEx) using a technique called VXLAN (Virtual Extensible LAN).

2. Deep Dive: The Magic of VXLAN

Docker’s overlay networking is powered by VXLAN, an industry-standard tunneling protocol. VXLAN encapsulates Layer 2 Ethernet frames inside Layer 4 UDP packets.

The Anatomy of a VXLAN Packet

To understand the technical reality, we must look at the exact anatomy of what travels across the wire:

+-----------------------------------------------------------------------+
|  Outer MAC  |  Outer IP  |  UDP (4789) | VXLAN Header |   INNER FRAME |
| (Physical)  | (Physical) |             |   (VNI)      | (Cont. Comm.) |
+-------------+------------+-------------+--------------+---------------+
                                                        | Inner MAC & IP|
                                                        |   Payload     |
                                                        +---------------+
  • Inner Frame: The original packet Container A wanted to send to Container B (e.g., 10.0.0.3 to 10.0.0.5).
  • VXLAN Header: Contains a 24-bit VXLAN Network Identifier (VNI). This acts as a VLAN tag, allowing up to 16 million distinct isolated overlay networks (solving the old VLAN limit of 4,096).
  • Outer Headers: The physical IP addresses of Node A and Node B.

Control Plane vs. Data Plane

For Node A to know that Container B (10.0.0.5) lives on Node B (192.168.1.20), there must be a discovery mechanism.

  • Data Plane: The actual UDP VXLAN traffic carrying the encapsulated packets.
  • Control Plane: Docker Swarm uses an embedded Raft Consensus store (similar to etcd) running in the manager nodes. When a container starts, its IP and MAC address mappings are gossiped to all nodes via this secure control plane. Node A consults its local copy of this mapping to determine the Outer IP address for the UDP packet.

3. Interactive: VXLAN Encapsulation

Visualize how a packet traverses the physical network from Node A to Node B.

Node A (Physical: 192.168.1.10)
Container A
10.0.0.3
VTEP (Encapsulator)
UDP
192.168.1.20
IP
10.0.0.5
Node B (Physical: 192.168.1.20)
Container B
10.0.0.5
VTEP (Decapsulator)
Ready to transmit...

4. Edge Cases & Hardware Realities: The MTU Trap

A critical architectural constraint of Overlay networks is MTU (Maximum Transmission Unit). The standard Ethernet MTU is 1500 bytes.

Because VXLAN encapsulation adds 50 bytes of headers (14 Ethernet + 20 IP + 8 UDP + 8 VXLAN), the inner packet must be smaller. Docker automatically sets the overlay network MTU to 1450 bytes to accommodate this overhead without exceeding the physical network’s 1500 byte limit.

War Story: The Silent Drops

Scenario: A microservice successfully deployed on a local Bridge network fails intermittently when deployed to a Docker Swarm Overlay network. Small API requests work, but large JSON payloads timeout.

The Root Cause: The physical network infrastructure (e.g., a restrictive cloud provider VPC or a physical router) was misconfigured to a lower MTU (e.g., 1400). The overlay network, assuming a 1500 byte physical MTU, sent 1450 byte payload packets. With the 50 byte overhead, the packets were 1500 bytes. The router silently dropped the packets because they exceeded its 1400 byte limit, and ICMP “Fragmentation Needed” messages were blocked by a firewall, breaking Path MTU Discovery.

The Fix: Explicitly configure the overlay network to use a smaller MTU (e.g., --opt com.docker.network.driver.mtu=1350).

5. Creating an Overlay Network

To utilize an overlay network, Docker Swarm must be initialized. This enables the control plane required for node discovery and state management.

# 1. Initialize Swarm on Node A (Becomes a Manager Node)
docker swarm init

# 2. Create the overlay network with an explicit subnet and MTU (optional but good practice)
docker network create -d overlay \
  --subnet=10.10.0.0/16 \
  --opt com.docker.network.driver.mtu=1450 \
  my-multi-host-net

# 3. Deploy a service on this network
docker service create \
  --name web \
  --network my-multi-host-net \
  --replicas 2 \
  nginx:alpine

Once running, containers across different nodes can transparently ping or communicate with each other using the service name (web), and Docker’s internal DNS will resolve it to the correct internal overlay IP.

6. Summary

  • Overlay Networks: Bridge the physical gap between distinct servers, allowing containers to communicate seamlessly.
  • VXLAN Encapsulation: The underlying mechanism that wraps inner container traffic inside outer UDP packets for transit over the physical network.
  • Control & Data Planes: Swarm managers distribute the routing table (Control Plane), while the VTEP handles the real-time encapsulation (Data Plane).
  • MTU Overhead: Always account for the 50-byte VXLAN overhead when diagnosing network timeouts or packet loss in distributed systems.