Network Sockets and the TCP/IP Stack: How Data Travels Across Networks

2020-11-08 · Leonardo Benicio

A comprehensive exploration of network programming internals, from socket system calls through the TCP/IP protocol stack to the network interface. Understand connection establishment, flow control, and the kernel's role in networking.

Every web request, database query, and API call travels through the network stack. The simple act of opening a connection hides layers of protocol machinery handling packet routing, reliable delivery, congestion control, and flow management. Understanding how sockets work and how data traverses the TCP/IP stack illuminates why networks behave as they do and how to build efficient networked applications.

1. The Network Stack Overview

Before diving into details, let’s see the complete picture.

1.1 The OSI and TCP/IP Models

OSI Model                    TCP/IP Model           Examples
┌─────────────────┐          ┌─────────────────┐
│  Application    │          │                 │    HTTP, DNS, SSH
├─────────────────┤          │  Application    │
│  Presentation   │          │                 │    TLS, JSON, XML
├─────────────────┤          │                 │
│  Session        │          │                 │    Sockets API
├─────────────────┤          ├─────────────────┤
│  Transport      │          │  Transport      │    TCP, UDP
├─────────────────┤          ├─────────────────┤
│  Network        │          │  Internet       │    IP, ICMP
├─────────────────┤          ├─────────────────┤
│  Data Link      │          │                 │    Ethernet, WiFi
├─────────────────┤          │  Link           │
│  Physical       │          │                 │    Cables, Radio
└─────────────────┘          └─────────────────┘

Data flow (sending):
Application → Transport → Network → Link → Physical → Wire

Data flow (receiving):
Wire → Physical → Link → Network → Transport → Application

1.2 Encapsulation

Each layer wraps data in its own header:

Application data:
┌─────────────────────────────────────────────────────────┐
│                    HTTP Request                          │
│  "GET /index.html HTTP/1.1\r\nHost: example.com..."     │
└─────────────────────────────────────────────────────────┘

Transport layer (TCP) adds header:
┌──────────────┬─────────────────────────────────────────┐
│  TCP Header  │              HTTP Request                │
│  (20 bytes)  │                                          │
└──────────────┴─────────────────────────────────────────┘

Network layer (IP) adds header:
┌──────────────┬──────────────┬──────────────────────────┐
│  IP Header   │  TCP Header  │       HTTP Request       │
│  (20 bytes)  │  (20 bytes)  │                          │
└──────────────┴──────────────┴──────────────────────────┘

Link layer (Ethernet) adds header and trailer:
┌──────────────┬──────────────┬──────────────┬────────────────────┬─────────┐
│ Eth Header   │  IP Header   │  TCP Header  │    HTTP Request    │Eth Trail│
│  (14 bytes)  │  (20 bytes)  │  (20 bytes)  │                    │(4 bytes)│
└──────────────┴──────────────┴──────────────┴────────────────────┴─────────┘

This complete unit is a "frame" ready for transmission.

1.3 Kernel Networking Components

Linux network stack architecture:

┌─────────────────────────────────────────────────────────────┐
│                     User Space                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │ Application  │  │ Application  │  │   Application    │   │
│  │  (browser)   │  │  (server)    │  │   (database)     │   │
│  └──────┬───────┘  └──────┬───────┘  └────────┬─────────┘   │
└─────────┼─────────────────┼───────────────────┼─────────────┘
          │ Socket API      │                   │
──────────┼─────────────────┼───────────────────┼──────────────
          ▼                 ▼                   ▼
┌─────────────────────────────────────────────────────────────┐
│                     Kernel Space                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                   Socket Layer                        │   │
│  │         (struct socket, file descriptor)              │   │
│  └──────────────────────────┬───────────────────────────┘   │
│  ┌──────────────────────────┴───────────────────────────┐   │
│  │              Protocol Layer (TCP/UDP)                 │   │
│  │       (struct sock, connection state, buffers)        │   │
│  └──────────────────────────┬───────────────────────────┘   │
│  ┌──────────────────────────┴───────────────────────────┐   │
│  │                   IP Layer                            │   │
│  │            (routing, fragmentation)                   │   │
│  └──────────────────────────┬───────────────────────────┘   │
│  ┌──────────────────────────┴───────────────────────────┐   │
│  │               Network Device Layer                    │   │
│  │           (driver interface, queues)                  │   │
│  └──────────────────────────┬───────────────────────────┘   │
└─────────────────────────────┼───────────────────────────────┘
                              ▼
                    ┌─────────────────┐
                    │   NIC Hardware  │
                    └─────────────────┘

2. The Socket API

Sockets provide the programming interface to the network.

2.1 Socket System Calls

// Server side
int server_fd = socket(AF_INET, SOCK_STREAM, 0);  // Create socket

struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port = htons(8080),
    .sin_addr.s_addr = INADDR_ANY
};
bind(server_fd, (struct sockaddr*)&addr, sizeof(addr));  // Bind to port

listen(server_fd, 128);  // Mark as listening, backlog = 128

int client_fd = accept(server_fd, NULL, NULL);  // Accept connection
// client_fd is a new socket for this connection

char buffer[1024];
ssize_t n = read(client_fd, buffer, sizeof(buffer));  // Receive data
write(client_fd, "Hello", 5);  // Send data

close(client_fd);  // Close connection
close(server_fd);  // Close listening socket

// Client side
int sock = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in server = {
    .sin_family = AF_INET,
    .sin_port = htons(8080)
};
inet_pton(AF_INET, "192.168.1.1", &server.sin_addr);

connect(sock, (struct sockaddr*)&server, sizeof(server));  // Connect
write(sock, "Hello", 5);  // Send
read(sock, buffer, sizeof(buffer));  // Receive
close(sock);

2.2 Socket Data Structures

Kernel socket structures:

struct socket (VFS interface):
┌────────────────────────────────────────────────────────┐
│  state: SS_CONNECTED                                   │
│  type: SOCK_STREAM                                     │
│  flags: various options                                │
│  ops: pointer to protocol operations                   │
│  sk: pointer to struct sock (protocol layer)          │
│  file: pointer to struct file (for fd)                │
└────────────────────────────────────────────────────────┘
            │
            ▼
struct sock (protocol layer):
┌────────────────────────────────────────────────────────┐
│  Protocol state (TCP: ESTABLISHED, etc.)              │
│  Source/destination addresses and ports               │
│  Send buffer (sk_write_queue)                         │
│  Receive buffer (sk_receive_queue)                    │
│  Timers (retransmit, keepalive, etc.)                 │
│  Congestion control state                             │
│  Window sizes                                          │
│  Sequence numbers                                      │
└────────────────────────────────────────────────────────┘

struct sk_buff (packet buffer):
┌────────────────────────────────────────────────────────┐
│  Pointers: head, data, tail, end                      │
│  Protocol headers at various offsets                  │
│  Reference count                                       │
│  Device reference                                      │
│  Timestamp                                             │
│  Actual packet data follows                            │
└────────────────────────────────────────────────────────┘

2.3 File Descriptor Integration

Sockets are file descriptors:

Process file descriptor table:
┌─────┬───────────────────────────────────────────┐
│ fd  │ struct file*                              │
├─────┼───────────────────────────────────────────┤
│  0  │ → stdin (terminal)                        │
│  1  │ → stdout (terminal)                       │
│  2  │ → stderr (terminal)                       │
│  3  │ → socket (struct socket → TCP connection) │
│  4  │ → regular file (/tmp/data.txt)           │
│  5  │ → socket (listening server socket)        │
└─────┴───────────────────────────────────────────┘

Because sockets are fds:
- read()/write() work on sockets
- select()/poll()/epoll work on sockets
- Can pass sockets between processes (via Unix sockets)
- close() closes the connection
- dup()/dup2() create socket aliases

Socket-specific operations:
- send()/recv() with flags
- sendto()/recvfrom() for UDP
- sendmsg()/recvmsg() for advanced use
- getsockopt()/setsockopt() for options

2.4 Blocking vs Non-Blocking

// Blocking (default): calls wait until complete
read(sock, buf, len);  // Blocks until data available

// Non-blocking: calls return immediately
int flags = fcntl(sock, F_GETFL, 0);
fcntl(sock, F_SETFL, flags | O_NONBLOCK);

ssize_t n = read(sock, buf, len);
if (n == -1 && errno == EAGAIN) {
    // No data available, try again later
}

// Or set at socket creation
int sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);

// Common pattern with epoll:
int epfd = epoll_create1(0);
struct epoll_event ev = {
    .events = EPOLLIN | EPOLLET,  // Edge-triggered
    .data.fd = sock
};
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);

struct epoll_event events[MAX_EVENTS];
int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nfds; i++) {
    handle_event(events[i]);
}

3. TCP Connection Lifecycle

Understanding TCP’s reliable connection protocol.

3.1 Three-Way Handshake

Connection establishment:

    Client                              Server
      │                                   │
      │ ──────── SYN (seq=x) ──────────► │
      │                                   │
      │ ◄─── SYN-ACK (seq=y, ack=x+1) ── │
      │                                   │
      │ ──────── ACK (ack=y+1) ────────► │
      │                                   │
   ESTABLISHED                        ESTABLISHED

Sequence numbers:
- Client picks random initial sequence number (ISN): x
- Server picks its own ISN: y
- Each side acknowledges the other's ISN + 1
- Prevents old duplicate packets from being accepted

SYN queue and accept queue:
┌─────────────────────────────────────────────────────────┐
│               Server Kernel                              │
│                                                          │
│  Incoming SYN:                                           │
│  ┌─────────────────────────────────┐                     │
│  │       SYN Queue (half-open)     │ ← SYN received     │
│  │  Connection 1 (SYN_RECV)        │   SYN-ACK sent     │
│  │  Connection 2 (SYN_RECV)        │                     │
│  │  Connection 3 (SYN_RECV)        │                     │
│  └─────────────────────────────────┘                     │
│              │ ACK received                              │
│              ▼                                           │
│  ┌─────────────────────────────────┐                     │
│  │    Accept Queue (established)   │                     │
│  │  Connection A (ESTABLISHED)     │ ← Ready for        │
│  │  Connection B (ESTABLISHED)     │   accept()         │
│  └─────────────────────────────────┘                     │
│              │                                           │
│              ▼ accept() returns                          │
└─────────────────────────────────────────────────────────┘

3.2 Data Transfer

Reliable data transfer:

Sender                                  Receiver
   │                                       │
   │ ─── Data (seq=1000, len=1000) ──────► │
   │                                       │
   │ ◄─────── ACK (ack=2000) ────────────  │
   │                                       │
   │ ─── Data (seq=2000, len=1000) ──────► │
   │ ─── Data (seq=3000, len=1000) ──────► │
   │ ─── Data (seq=4000, len=1000) ──────► │
   │                                       │
   │ ◄─────── ACK (ack=5000) ────────────  │ (cumulative)
   │                                       │

Cumulative acknowledgment:
- ACK number = next expected byte
- ACK 5000 means "received all bytes up to 4999"
- Multiple segments can be ACKed with single ACK

Retransmission:
- Sender sets timer for each segment
- If ACK not received in time, retransmit
- Exponential backoff on repeated failures

3.3 Connection Termination

Four-way handshake for graceful close:

    Client                              Server
      │                                   │
      │ ───────── FIN (seq=x) ─────────► │
      │                                   │
   FIN_WAIT_1                         CLOSE_WAIT
      │                                   │
      │ ◄──────── ACK (ack=x+1) ──────── │
      │                                   │
   FIN_WAIT_2                         (Server may send more data)
      │                                   │
      │ ◄──────── FIN (seq=y) ────────── │
      │                                   │
   TIME_WAIT                          LAST_ACK
      │                                   │
      │ ───────── ACK (ack=y+1) ────────► │
      │                                   │
   (wait 2×MSL)                       CLOSED
      │                                   │
   CLOSED                                 │

TIME_WAIT:
- Lasts 2 × Maximum Segment Lifetime (typically 60 seconds)
- Ensures final ACK reaches server
- Prevents old duplicate packets from being accepted
- Can cause "address already in use" on quick restart
- SO_REUSEADDR socket option helps

3.4 TCP State Machine

Complete TCP state diagram:

                              ┌──────────────┐
                              │    CLOSED    │
                              └──────┬───────┘
                    passive open     │     active open
                     (listen)        │      (connect)
                         ┌───────────┴───────────┐
                         ▼                       ▼
                  ┌──────────────┐        ┌──────────────┐
                  │    LISTEN    │        │   SYN_SENT   │
                  └──────┬───────┘        └──────┬───────┘
              rcv SYN    │                       │ rcv SYN-ACK
              send SYN-ACK                       │ send ACK
                         │         ┌─────────────┘
                         ▼         ▼
                  ┌──────────────────────┐
                  │     SYN_RECEIVED     │
                  └──────────┬───────────┘
                   rcv ACK   │
                             ▼
                  ┌──────────────────────┐
                  │     ESTABLISHED      │ ← Normal data transfer
                  └──────────┬───────────┘
             close           │           rcv FIN
             send FIN        │           send ACK
           ┌─────────────────┴─────────────────┐
           ▼                                   ▼
    ┌──────────────┐                    ┌──────────────┐
    │  FIN_WAIT_1  │                    │  CLOSE_WAIT  │
    └──────┬───────┘                    └──────┬───────┘
           │                                   │ close
           ▼                                   │ send FIN
    ┌──────────────┐                    ┌──────┴───────┐
    │  FIN_WAIT_2  │                    │   LAST_ACK   │
    └──────┬───────┘                    └──────┬───────┘
   rcv FIN │                           rcv ACK │
   send ACK│                                   │
           ▼                                   ▼
    ┌──────────────┐                    ┌──────────────┐
    │  TIME_WAIT   │                    │    CLOSED    │
    └──────┬───────┘                    └──────────────┘
   timeout │
           ▼
    ┌──────────────┐
    │    CLOSED    │
    └──────────────┘

4. TCP Flow Control and Congestion Control

Preventing overwhelming receivers and networks.

4.1 Flow Control: Receive Window

Receive window prevents sender from overwhelming receiver:

Receiver advertises available buffer space:
┌─────────────────────────────────────────────────────────┐
│  Receiver Buffer (64KB)                                 │
│  ┌──────────────────────────┬───────────────────────┐  │
│  │      Data waiting for    │    Available space    │  │
│  │      application read    │    (receive window)   │  │
│  │         32KB             │         32KB          │  │
│  └──────────────────────────┴───────────────────────┘  │
│                                                         │
│  ACK packet includes: Window = 32768                    │
└─────────────────────────────────────────────────────────┘

Sender side:
┌─────────────────────────────────────────────────────────┐
│  Must keep: unacked_data + in_flight ≤ receiver_window  │
│                                                         │
│  If window = 0: Stop sending, probe periodically        │
│  As receiver reads data: Window opens, ACK sent         │
└─────────────────────────────────────────────────────────┘

Window scaling (for high bandwidth-delay networks):
- Original window: 16 bits = max 64KB
- Window scale option: Multiply by 2^scale
- Scale up to 14 = 64KB × 16384 = 1GB window

4.2 Congestion Control

Preventing network congestion collapse:

Congestion window (cwnd): Sender-side limit
- Independent of receive window
- Grows when network seems uncongested
- Shrinks on packet loss (sign of congestion)

Effective window = min(cwnd, receive_window)

Classic algorithms:

Slow Start:
- Initial cwnd = 1-10 MSS (Maximum Segment Size)
- Double cwnd each RTT (exponential growth)
- Until: loss occurs OR cwnd > ssthresh

Congestion Avoidance:
- Linear growth: cwnd += 1 per RTT
- After loss: ssthresh = cwnd/2, restart

┌─────────────────────────────────────────────────────────┐
│  cwnd                                                   │
│   │                               ×(loss)               │
│   │                          ╱    │                     │
│   │                     ╱        │                     │
│   │                ╱             │     Linear          │
│   │           ╱                  │    ╱                │
│   │      ╱                       │ ╱                   │
│   │ ╱ Slow start                 ╱                     │
│   └──────────────────────────────────────────── time   │
│        Exponential growth    ssthresh                   │
└─────────────────────────────────────────────────────────┘

4.3 Modern Congestion Control

CUBIC (Linux default):
- Designed for high bandwidth-delay product networks
- Cubic function of time since last loss
- More aggressive window growth than classic TCP
- Used by most Linux servers

BBR (Bottleneck Bandwidth and RTT):
- Developed by Google
- Model-based rather than loss-based
- Estimates bandwidth and RTT
- Maintains low queue occupancy
- Better performance on lossy links

Comparison:
┌────────────────────────────────────────────────────────┐
│  Algorithm │ Loss-based │ Queue │ Fairness            │
├────────────┼────────────┼───────┼─────────────────────┤
│  Reno      │    Yes     │ High  │ Poor with others    │
│  CUBIC     │    Yes     │ High  │ Good with CUBIC     │
│  BBR       │    No      │ Low   │ Issues with others  │
└────────────────────────────────────────────────────────┘

Setting congestion control:
sysctl net.ipv4.tcp_congestion_control=bbr
# Or per-socket:
setsockopt(sock, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

4.4 Nagle’s Algorithm and Delayed ACKs

Nagle's algorithm (reduce small packets):
- If there's unacknowledged data AND new data is small:
  Buffer it until ACK arrives or buffer fills
- Prevents sending many tiny packets

Delayed ACKs:
- Don't ACK every packet immediately
- Wait up to 40ms for more data to piggyback ACK
- Or ACK after every 2 full-size segments

The interaction problem:
┌─────────────────────────────────────────────────────────┐
│  Client (Nagle)          Server (Delayed ACK)          │
│                                                         │
│  send(100 bytes) ──────►  (received, waiting for more) │
│  send(100 bytes)          │                            │
│    │ waiting for ACK      │ waiting 40ms to ACK        │
│    │                      │                            │
│    ▼                      ▼                            │
│       40ms delay before next data sent!                │
└─────────────────────────────────────────────────────────┘

Solutions:
- TCP_NODELAY: Disable Nagle (send immediately)
- TCP_QUICKACK: Disable delayed ACK
- Batch writes: Larger writes avoid the issue

setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one));

5. The IP Layer

Network layer handles addressing and routing.

5.1 IP Packet Structure

IPv4 Header (20-60 bytes):
┌─────────────────────────────────────────────────────────┐
│  0                   1                   2              │
│  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 ... │
│ ├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤  │
│ │Version│  IHL  │    DSCP   │ECN│     Total Length    │  │
│ ├───────────────┼───────────┴───┴─────────────────────┤  │
│ │         Identification          │Flags│ Frag Offset │  │
│ ├───────────────┼─────────────────┴───────────────────┤  │
│ │      TTL      │    Protocol     │  Header Checksum  │  │
│ ├───────────────┴─────────────────┴───────────────────┤  │
│ │                  Source Address                      │  │
│ ├──────────────────────────────────────────────────────┤  │
│ │               Destination Address                    │  │
│ ├──────────────────────────────────────────────────────┤  │
│ │                  Options (if any)                    │  │
│ └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Key fields:
- Version: 4 for IPv4, 6 for IPv6
- TTL: Decremented at each hop, prevents loops
- Protocol: 6 = TCP, 17 = UDP, 1 = ICMP
- Addresses: 32-bit source and destination

5.2 IP Fragmentation

Large packets may need fragmentation:

Original packet (4000 bytes data, MTU = 1500):
┌────────────────────────────────────────────────────────┐
│ IP Header │              4000 bytes of data            │
└────────────────────────────────────────────────────────┘

Fragmented into 3 packets:
┌─────────────────────────────┐
│ IP │ Frag 0: 1480 bytes     │  Offset = 0, MF = 1
└─────────────────────────────┘
┌─────────────────────────────┐
│ IP │ Frag 1: 1480 bytes     │  Offset = 1480, MF = 1
└─────────────────────────────┘
┌─────────────────────────────┐
│ IP │ Frag 2: 1040 bytes     │  Offset = 2960, MF = 0
└─────────────────────────────┘

Problems with fragmentation:
- Receiver must reassemble (buffer, timeout)
- One lost fragment = entire packet lost
- Security issues (fragment attacks)
- Performance overhead

Modern approach: Path MTU Discovery
- Set "Don't Fragment" (DF) bit
- If router can't forward, it sends ICMP "too big"
- Sender reduces packet size
- Avoids fragmentation in network

5.3 Routing

Routing table lookup:

$ ip route show
default via 192.168.1.1 dev eth0
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.100
10.0.0.0/8 via 192.168.1.254 dev eth0

For destination 10.5.3.2:
1. Check most specific matching prefix
2. 10.0.0.0/8 matches (8 bits)
3. Next hop: 192.168.1.254 via eth0

Routing decision in kernel:
┌─────────────────────────────────────────────────────────┐
│                   Packet arrives                         │
│                         │                                │
│                         ▼                                │
│              ┌───────────────────┐                       │
│              │ Is dest local?    │                       │
│              └─────────┬─────────┘                       │
│                 Yes    │    No                           │
│              ┌─────────┴─────────┐                       │
│              ▼                   ▼                       │
│     Deliver locally      Route lookup                    │
│     (pass up stack)      (forward out interface)         │
└─────────────────────────────────────────────────────────┘

5.4 ARP: Address Resolution

Mapping IP addresses to MAC addresses:

Host wants to send to 192.168.1.1:
1. Check ARP cache: ip neigh show
2. If not found, broadcast ARP request:
   "Who has 192.168.1.1? Tell 192.168.1.100"

3. Owner responds (unicast):
   "192.168.1.1 is at aa:bb:cc:dd:ee:ff"

4. Cache the mapping, send packet

ARP cache:
$ ip neigh show
192.168.1.1 dev eth0 lladdr aa:bb:cc:dd:ee:ff REACHABLE
192.168.1.50 dev eth0 lladdr 11:22:33:44:55:66 STALE

States:
- REACHABLE: Recently confirmed
- STALE: May need refresh
- INCOMPLETE: ARP request sent, waiting
- FAILED: No response received

6. UDP: The Simple Alternative

When you don’t need TCP’s guarantees.

6.1 UDP Characteristics

UDP Header (8 bytes only!):
┌─────────────────────────────────────────────────────────┐
│       Source Port        │      Destination Port        │
├──────────────────────────┼──────────────────────────────┤
│          Length          │         Checksum             │
└─────────────────────────────────────────────────────────┘

UDP provides:
✓ Multiplexing (ports)
✓ Optional checksum
✗ No connection setup
✗ No reliability (no retransmission)
✗ No ordering guarantee
✗ No flow control
✗ No congestion control

UDP code:
int sock = socket(AF_INET, SOCK_DGRAM, 0);  // DGRAM = UDP

// No connect needed (but can use connect for default dest)
sendto(sock, data, len, 0, &dest_addr, addr_len);
recvfrom(sock, buf, size, 0, &src_addr, &addr_len);

6.2 When to Use UDP

Good use cases for UDP:

DNS queries:
- Small request/response
- Timeout and retry at application level
- Connection setup overhead > actual data

Video/audio streaming:
- Real-time, can't wait for retransmissions
- Missing frame less bad than delayed frame
- Application handles packet loss gracefully

Gaming:
- Low latency critical
- Old state updates can be skipped
- Game protocol handles reliability if needed

QUIC (HTTP/3):
- UDP-based transport
- Implements reliability on top
- Avoids TCP head-of-line blocking
- Faster connection establishment

6.3 UDP Reliability When Needed

Building reliability on UDP:

Application-level acknowledgment:
┌─────────────────────────────────────────────────────────┐
│  struct packet {                                        │
│      uint32_t sequence;    // Packet number             │
│      uint32_t ack;         // What we've received       │
│      uint16_t flags;       // Control flags             │
│      uint8_t  data[];      // Payload                   │
│  };                                                     │
│                                                         │
│  Sender:                                                │
│  - Track sent packets and timestamps                    │
│  - Retransmit if no ACK within timeout                 │
│  - Implement own congestion control                     │
│                                                         │
│  Receiver:                                              │
│  - Track received sequence numbers                      │
│  - Reorder if needed                                    │
│  - Send ACKs (possibly with SACK)                      │
└─────────────────────────────────────────────────────────┘

Libraries: QUIC, KCP, ENet, Reliable UDP implementations

6.4 Multicast and Broadcast

UDP supports one-to-many communication:

Broadcast (same subnet only):
struct sockaddr_in addr;
addr.sin_addr.s_addr = INADDR_BROADCAST;  // 255.255.255.255

int broadcast = 1;
setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &broadcast, sizeof(broadcast));
sendto(sock, data, len, 0, &addr, sizeof(addr));

Multicast (across networks):
// Join a multicast group
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr("239.0.0.1");
mreq.imr_interface.s_addr = INADDR_ANY;
setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));

// Send to multicast group
struct sockaddr_in mcast_addr;
mcast_addr.sin_addr.s_addr = inet_addr("239.0.0.1");
sendto(sock, data, len, 0, &mcast_addr, sizeof(mcast_addr));

Use cases:
- Service discovery (mDNS, SSDP)
- Live video/audio streaming
- Financial data feeds
- Cluster heartbeats

IGMP (Internet Group Management Protocol):
- Hosts tell routers which groups they want
- Routers only forward multicast to interested subnets
- Reduces unnecessary network traffic

7. Network Performance

Understanding and optimizing network throughput.

7.1 Bandwidth-Delay Product

BDP = Bandwidth × Round-Trip Time

Example:
100 Mbps link, 50ms RTT
BDP = 100,000,000 bits/s × 0.050 s = 5,000,000 bits = 625 KB

What this means:
- 625 KB of data can be "in flight" at any time
- Need send/receive buffers at least this large
- TCP window must be at least BDP for full utilization

If buffer < BDP:
┌─────────────────────────────────────────────────────────┐
│  Sender                         Receiver                │
│  ┌────────┐                     ┌────────┐             │
│  │ Buffer │─── data ───────────►│ Buffer │             │
│  │ 64 KB  │                     │ 64 KB  │             │
│  └────────┘◄── ACK ─────────────└────────┘             │
│                                                         │
│  With 625KB BDP but 64KB buffer:                       │
│  Can only send 64KB before waiting for ACK             │
│  Utilization: 64/625 ≈ 10% of available bandwidth!     │
└─────────────────────────────────────────────────────────┘

7.2 Buffer Sizing

Socket buffer tuning:

# View current settings
sysctl net.core.rmem_max
sysctl net.core.wmem_max
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem

# tcp_rmem and tcp_wmem: min, default, max
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304

Per-socket setting:
int size = 4 * 1024 * 1024;  // 4 MB
setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, &size, sizeof(size));

Auto-tuning:
- Linux automatically adjusts buffer sizes
- Based on observed RTT and bandwidth
- Generally leave auto-tuning enabled
- Manual tuning for special cases (high BDP, etc.)

7.3 Connection Scalability

Handling many connections:

File descriptor limits:
$ ulimit -n    # Soft limit
$ ulimit -Hn   # Hard limit

Raise in /etc/security/limits.conf:
* soft nofile 65535
* hard nofile 65535

Memory per connection:
- Socket structure: ~2 KB
- Send buffer: 16 KB - 4 MB
- Receive buffer: 4 KB - 6 MB

10,000 connections × 200 KB average = 2 GB memory

Event handling efficiency:
select():  O(n) per call, limited to 1024 fds
poll():    O(n) per call, no fd limit
epoll():   O(1) per event, O(n) setup, scalable

For high connection counts: use epoll (Linux), kqueue (BSD)

7.4 Latency Optimization

Sources of network latency:

1. Serialization delay: packet_size / bandwidth
   1500 bytes / 1 Gbps = 12 μs

2. Propagation delay: distance / speed_of_light
   Coast to coast US ≈ 40 ms

3. Queuing delay: Waiting in router/switch buffers
   Variable, depends on congestion

4. Processing delay: Router/host packet processing
   Usually negligible (~1 μs per hop)

Optimization strategies:

Reduce round trips:
- Connection pooling (reuse connections)
- TCP Fast Open (data in SYN)
- HTTP/2 multiplexing (one connection, many requests)

Reduce processing:
- Zero-copy where possible
- Kernel bypass (DPDK, io_uring)
- Jumbo frames (reduce packet count)

Reduce queuing:
- ECN (Explicit Congestion Notification)
- BBR congestion control
- QoS prioritization

7.5 Zero-Copy Techniques

Avoiding memory copies for high throughput:

Traditional send path:
User buffer → Kernel socket buffer → NIC DMA buffer → Wire
(2 copies minimum)

sendfile() - File to socket without user space:
┌─────────────────────────────────────────────────────────┐
│  // Send file directly to socket                        │
│  sendfile(socket_fd, file_fd, &offset, count);         │
│                                                         │
│  File → Page cache → NIC DMA (with scatter-gather)     │
│  Avoids: user buffer entirely!                          │
└─────────────────────────────────────────────────────────┘

splice() - Move data between file descriptors:
┌─────────────────────────────────────────────────────────┐
│  // Create pipe for zero-copy transfer                  │
│  int pipefd[2];                                        │
│  pipe(pipefd);                                         │
│                                                         │
│  // Move from socket to pipe                           │
│  splice(socket_in, NULL, pipefd[1], NULL, len, 0);     │
│                                                         │
│  // Move from pipe to file                             │
│  splice(pipefd[0], NULL, file_fd, NULL, len, 0);       │
└─────────────────────────────────────────────────────────┘

MSG_ZEROCOPY for sends:
int one = 1;
setsockopt(sock, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));
send(sock, buf, len, MSG_ZEROCOPY);
// Kernel uses page pinning, notifies via error queue when done

When zero-copy helps:
- Large data transfers (MB+)
- High bandwidth links
- CPU-bound scenarios

When it doesn't help:
- Small messages (overhead > savings)
- Already memory-bound
- Latency-critical (may add delay)

8. Kernel Bypass and High Performance

When the kernel isn’t fast enough.

8.1 The Kernel Overhead Problem

Traditional network path:

NIC → DMA to kernel buffer → Interrupt → softirq processing →
Protocol processing → Copy to user buffer → Wake application

Overhead sources:
1. Interrupts: 5-15 μs each
2. Context switches: 1-10 μs
3. Memory copies: Bandwidth limited
4. System calls: ~100 ns each
5. Protocol processing: CPU time

For 10-100 Gbps: Kernel can't keep up!

8.2 DPDK: Data Plane Development Kit

DPDK approach:

User space                  Kernel space
┌──────────────────────┐   ┌──────────────────┐
│   DPDK Application   │   │   (bypassed)     │
│   ┌──────────────┐   │   │                  │
│   │ Poll Mode    │   │   │                  │
│   │ Driver       │   │   │                  │
│   └──────┬───────┘   │   │                  │
└──────────┼───────────┘   └──────────────────┘
           │ Direct access via huge pages
           ▼
      ┌──────────┐
      │   NIC    │
      └──────────┘

Benefits:
- Zero kernel involvement
- No interrupts (busy polling)
- No memory copies (zero-copy)
- Millions of packets per second per core

Costs:
- Dedicated cores (100% CPU polling)
- Must implement protocols
- Lose kernel features (iptables, etc.)
- Complex programming model

8.3 io_uring for Networking

io_uring: Async I/O with kernel involvement

Submission Queue              Completion Queue
┌─────────────┐              ┌─────────────┐
│   Entry 0   │              │  Result 0   │
│   Entry 1   │   ───────►   │  Result 1   │
│   Entry 2   │   Kernel     │  Result 2   │
└─────────────┘   process    └─────────────┘

Advantages for networking:
- Batched system calls (fewer context switches)
- Zero-copy send/receive
- Registered buffers (avoid allocation)
- Linked operations (chains of I/O)

// Submit multiple operations at once
io_uring_prep_recv(sqe1, sock1, buf1, len1, 0);
io_uring_prep_recv(sqe2, sock2, buf2, len2, 0);
io_uring_prep_send(sqe3, sock3, buf3, len3, 0);
io_uring_submit(&ring);  // One syscall for three operations

8.4 XDP: eXpress Data Path

XDP: eBPF programs at NIC driver level

Packet arrives at NIC:
        │
        ▼
┌───────────────────────┐
│   XDP Program (eBPF)  │
│                       │
│   Decision:           │
│   - XDP_DROP         │ ← Drop packet (fastest)
│   - XDP_PASS         │ ← Normal kernel processing
│   - XDP_TX           │ ← Bounce back out same NIC
│   - XDP_REDIRECT     │ ← Send to different NIC/CPU
└───────────────────────┘
        │
    (if PASS)
        ▼
   Normal stack

Use cases:
- DDoS mitigation (drop bad packets early)
- Load balancing
- Packet filtering
- Fast forwarding

Performance: Millions of packets per second

9. Observability and Debugging

Tools for understanding network behavior.

9.1 Socket Statistics

# Connection state summary
ss -s
# Total: 1024 (kernel 0)
# TCP:   500 (estab 400, closed 50, orphaned 10, timewait 30)

# All TCP connections with details
ss -tan
# State   Recv-Q Send-Q Local:Port   Peer:Port
# ESTAB   0      0      10.0.0.1:443 10.0.0.2:54321

# Show socket memory usage
ss -tm
# Includes: skmem:(r,rb,t,tb,f,w,o,bl,d)
# r: receive queue, t: send queue, etc.

# Filter by state
ss -tan state established '( dport = :443 )'

9.2 Network Statistics

# Interface statistics
ip -s link show eth0
# RX: bytes packets errors dropped
# TX: bytes packets errors dropped

# Protocol statistics
netstat -s
# IP, TCP, UDP, ICMP statistics

# Detailed TCP stats
cat /proc/net/tcp
# Or parsed: ss --info

# Real-time monitoring
sar -n DEV 1    # Per-interface
sar -n TCP 1    # TCP stats

9.3 Packet Capture

# tcpdump: Command-line capture
tcpdump -i eth0 port 80
tcpdump -i any tcp and host 192.168.1.1
tcpdump -w capture.pcap  # Save to file

# Wireshark: GUI analysis
wireshark capture.pcap

# tshark: Wireshark CLI
tshark -i eth0 -f "port 443"

# BPF-based tracing
tcpdump 'tcp[tcpflags] & (tcp-syn) != 0'  # SYN packets

9.4 Connection Tracing

# strace for socket operations
strace -e network ./app
# Shows: socket, bind, connect, send, recv, etc.

# bpftrace for kernel events
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb {
    printf("Retransmit: %s:%d\n",
           ntop(args->saddr), args->dport);
}'

# ss for real-time connection info
watch -n1 'ss -tan | grep ESTAB'

# conntrack for NAT tracking
conntrack -L
# Shows all tracked connections through NAT

9.5 Performance Profiling

Identifying network bottlenecks:

Bandwidth testing:
iperf3 -s                    # Server
iperf3 -c server_ip -t 30    # Client, 30 second test

Results show:
- Throughput (Mbps or Gbps)
- Retransmits (TCP reliability issues)
- CPU usage (processing bottleneck)

Latency measurement:
ping -c 100 host             # Basic RTT
mtr host                     # Traceroute with statistics
hping3 -S -p 80 host         # TCP-based latency

Connection timing breakdown:
curl -w "@timing.txt" https://example.com -o /dev/null

timing.txt:
time_namelookup:  %{time_namelookup}s
time_connect:     %{time_connect}s
time_appconnect:  %{time_appconnect}s (TLS)
time_pretransfer: %{time_pretransfer}s
time_starttransfer: %{time_starttransfer}s (first byte)
time_total:       %{time_total}s

Flame graphs for network code:
perf record -g ./network_app
perf script | stackcollapse.pl | flamegraph.pl > flame.svg
# Shows where CPU time goes in network processing

10. Summary and Best Practices

Key concepts and practical guidance.

10.1 Core Concepts Review

Socket fundamentals:
✓ Sockets are file descriptors with network operations
✓ TCP provides reliable, ordered, connection-oriented streams
✓ UDP provides unreliable, unordered, connectionless datagrams

TCP mechanics:
✓ Three-way handshake establishes connection
✓ Sequence numbers enable ordering and reliability
✓ Flow control prevents receiver overload
✓ Congestion control prevents network overload

Performance factors:
✓ Bandwidth-delay product determines buffer needs
✓ RTT affects throughput (must wait for ACKs)
✓ Connection reuse avoids handshake overhead
✓ Kernel bypass for extreme performance

10.2 Practical Guidelines

For application developers:

1. Use connection pooling
   - Reuse connections when possible
   - Avoid setup/teardown overhead

2. Set TCP_NODELAY for interactive protocols
   - Avoids Nagle delay for small messages
   - Essential for request-response patterns

3. Handle partial reads and writes
   - read() may return less than requested
   - Loop until complete or EOF

4. Use non-blocking I/O with event loops
   - Scales to many connections
   - epoll on Linux, kqueue on BSD

5. Set appropriate timeouts
   - Connection timeout
   - Read/write timeouts
   - Keepalive for idle connections

For system administrators:

1. Monitor connection states
   - Many TIME_WAIT = rapid connection churn
   - Many SYN_RECV = possible SYN flood

2. Tune buffer sizes for high BDP
   - Increase tcp_rmem/tcp_wmem max
   - Let auto-tuning work

3. Choose appropriate congestion control
   - BBR for high latency links
   - CUBIC for general use

10.3 Debugging Checklist

When investigating network issues:

□ Check connection state (ss -tan)
□ Verify routing (ip route get <dest>)
□ Test connectivity (ping, traceroute)
□ Check for packet loss (netstat -s | grep retrans)
□ Verify DNS resolution (dig, nslookup)
□ Capture packets (tcpdump) if needed
□ Check firewall rules (iptables -L)
□ Monitor bandwidth utilization
□ Check for buffer overflows (netstat -s)
□ Review application logs for timeouts
□ Examine socket options and buffer sizes
□ Profile CPU usage during network operations
□ Check for connection leaks (increasing fd count)
□ Verify MTU settings match across path
□ Test with different congestion control algorithms

10.4 Security Considerations

Network security fundamentals:

Socket-level protection:
┌─────────────────────────────────────────────────────────┐
│  Bind to specific interfaces:                           │
│  - INADDR_LOOPBACK for local-only services             │
│  - Specific IP for multi-homed hosts                   │
│                                                         │
│  Set socket options:                                    │
│  - SO_REUSEADDR carefully (can mask issues)           │
│  - TCP_DEFER_ACCEPT (delay accept until data)         │
│                                                         │
│  Validate client addresses:                             │
│  - Check source IP/port in accept()                    │
│  - Implement connection rate limiting                  │
└─────────────────────────────────────────────────────────┘

Common attack vectors:

SYN Flood:
- Attacker sends many SYNs, never completes handshake
- Fills SYN queue, prevents legitimate connections
- Mitigation: SYN cookies, increase backlog, rate limiting

TCP Reset Attacks:
- Inject RST packets to close connections
- Requires guessing sequence numbers
- Mitigation: Randomized ISN, TCP-AO authentication

Connection Hijacking:
- Predict sequence numbers to inject data
- Modern TCP uses random ISN to prevent
- Mitigation: TLS for application-layer security

Slow Read/Slowloris:
- Attacker reads data very slowly
- Ties up server resources
- Mitigation: Timeouts, connection limits

Kernel hardening:
sysctl net.ipv4.tcp_syncookies=1          # SYN flood protection
sysctl net.ipv4.tcp_max_syn_backlog=4096  # Larger SYN queue
sysctl net.core.somaxconn=4096            # Larger accept queue

Network programming bridges the gap between application logic and the physical reality of data transmission across wires and through the air. From the elegant abstraction of sockets through the intricate state machines of TCP to the raw speed of kernel bypass techniques, the network stack represents decades of engineering refinement. Understanding these internals empowers you to build efficient networked applications, diagnose mysterious connectivity problems, and optimize for the specific characteristics of your network environment. Whether you’re building real-time systems demanding microsecond latencies or web services handling millions of connections, the principles of the TCP/IP stack inform every packet that crosses the wire.