Troubleshooting Tools
[!NOTE] This module explores the core principles of Troubleshooting Tools, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
The “Murder Mystery” of Networking
Imagine you’re the lead engineer at an e-commerce company during Black Friday. Suddenly, checkout requests start timing out. The servers are up, the database is healthy, but customers are seeing 504 Gateway Timeout errors. Where do you even begin?
Network troubleshooting is rarely a straight line. It’s a murder mystery where the culprit could be a severed fiber optic cable, a misconfigured BGP route, or an expired TLS certificate. To solve these mysteries, you need a structured methodology and a deep understanding of your investigative tools.
1. Troubleshooting Methodology
When a network goes down, panic and guessing are your worst enemies. Structured troubleshooting prevents you from checking the engine when the car just needs gas. The industry-standard approach relies on the OSI Model Divide and Conquer:
- Bottom-Up: Start at the Physical layer (Is it plugged in? Are there interface errors?). Best when you suspect hardware or local link failures.
- Top-Down: Start at the Application layer (Does the website load? Is DNS resolving?). Best for user-reported issues like “I can’t load the app.”
- Divide and Conquer: Start in the middle, typically at the Network layer (Layer 3 - Can I ping the gateway?). If the ping succeeds, layers 1-3 are fine, so look up. If it fails, look down.
War Story: The 500-Mile Email Limit
In a famous anecdote, a sysadmin was told users couldn’t send emails further than 500 miles. It sounded like magic, but structured troubleshooting revealed the truth: The mail server’s OS was upgraded, reducing the default timeout for the SMTP daemon to zero. Due to the speed of light in fiber and router queuing delays, a zero-timeout meant the connection dropped before a packet could travel more than ~500 miles and back. A simple traceroute and tcpdump uncovered the physics of the problem.
2. The Network Engineer’s Toolkit
Let’s break down the essential CLI tools, categorizing them by their primary function.
A. Connectivity & Pathing (ICMP)
ping: Tests basic reachability and Round-Trip Time (RTT). It uses ICMP Echo Request and Reply.- Analogy: Like a submarine’s sonar ping. You bounce a signal off a target and measure how long the echo takes.
- Deep Dive: Modern firewalls often block ICMP. A failed ping doesn’t guarantee a host is down, just that it’s not responding to ICMP.
traceroute(Linux/macOS) /tracert(Windows): Reveals the exact path packets take to a destination.- How it works: It exploits the TTL (Time To Live) field in the IP header. It sends packets with TTL=1, then TTL=2, etc. Each router along the path decrements the TTL, drops the packet when it hits 0, and sends back an “ICMP Time Exceeded” message, revealing its IP.
B. Addressing & Routing (Layer 2/3)
ip addr/ifconfig: Shows your local IP address, subnet mask, and interface state (UP/DOWN).ip route/netstat -rn: Displays the routing table. Tells the OS where to send packets destined for different networks.- Key Concept: The “Default Gateway” (
0.0.0.0/0) is the router that handles traffic not destined for the local subnet.
- Key Concept: The “Default Gateway” (
C. DNS Resolution (Layer 7)
nslookup: Simple name-to-IP lookup tool.dig(Domain Information Groper): The Swiss Army knife for DNS. Returns detailed DNS information including A, AAAA, MX, CNAME, and TXT records.- Pro-Tip:
dig +short google.comgives just the IPs.dig @8.8.8.8 example.comforces a query to a specific DNS server (useful for checking if your local DNS is poisoned).
- Pro-Tip:
D. Connections & Sockets (Layer 4)
ss(Socket Statistics) /netstat: Shows all active TCP/UDP connections, listening ports, and socket states.ssis the modern, faster replacement fornetstat.- Command:
ss -tulpnshows all listening TCP/UDP ports and the processes owning them.
- Command:
curl -v: Performs an HTTP request and displays the headers. Essential for debugging web servers, API endpoints, and TLS handshakes.
E. Deep Packet Inspection (Sniffing)
tcpdump: A powerful CLI packet analyzer. It captures raw packets on the wire matching a boolean expression.- Example:
tcpdump -i eth0 port 80captures all HTTP traffic on theeth0interface.
- Example:
Wireshark: A GUI tool for analyzing.pcapfiles generated bytcpdump. You can right-click and “Follow TCP Stream” to read the raw bytes sent between client and server.
3. Interactive: Diagnostic Sandbox
Let’s test your intuition. Select the symptom below to reveal the primary diagnostic tool and the rationale.
4. Anatomy of a Troubleshooting Session
When diagnosing a typical web service failure, follow this step-by-step anatomy:
- Verify DNS (
dig): Did the URL resolve to the correct IP? IfNXDOMAIN, the DNS record is missing or mistyped. - Verify Reachability (
ping): Can you reach the IP? If 100% packet loss, there might be a routing issue, or the server is down/firewalled. - Trace the Path (
traceroute): Where is the packet dying? Does it leave your ISP? Does it reach the target’s cloud provider? - Verify the Port (
nc -zvortelnet): Is the specific application port (e.g., 443 for HTTPS) open and listening? A firewall might allow ping but block port 443. - Verify the Protocol (
curl -v): If the port is open, does the web server return a200 OK, or a502 Bad Gateway?
5. Case Study: The Midnight Database Outage
Let’s apply the PEDALS framework to a real-world troubleshooting scenario:
- Process Requirements: The web API suddenly cannot connect to the primary PostgreSQL database after a deployment.
- Estimate: Not applicable for sudden failure, but blast radius is 100% of API traffic.
- Data Model: The API uses connection pooling (PgBouncer) to talk to the database.
- Architecture: Web APIs (Subnet A) -> PgBouncer (Subnet B) -> Postgres Primary (Subnet C).
- Localized Details:
- Step 1: Application logs show
Connection timeout. - Step 2: We run
pingfrom the Web API to PgBouncer. It succeeds. (Network Layer 3 is up). - Step 3: We run
curl -v telnet://pgbouncer-ip:6432(ornc -zv pgbouncer-ip 6432). It times out. (Transport Layer 4 is blocked). - Step 4: We SSH into the PgBouncer server and run
ss -tulpn. It shows it’s listening correctly on port 6432. - Conclusion: The service is up, but packets aren’t reaching it on the specific port. We check Security Groups/Firewalls between Subnet A and B. A recent terraform deployment accidentally removed the ingress rule for port 6432.
- Step 1: Application logs show
- Scale: The fix takes 2 minutes once identified, but finding the exact broken link between Layer 3 (ping) and Layer 4 (TCP connection) is the key.
6. Common Edge Cases & Hardware Realities
- Asymmetric Routing: Packets take one path to the destination but a completely different path back. Firewalls monitoring stateful connections often drop these returning packets, causing silent failures.
- MTU Mismatch (Black Hole Connections): The Maximum Transmission Unit (usually 1500 bytes) is too high for a link in the path. Large packets are silently dropped because ICMP “Fragmentation Needed” messages are blocked by firewalls. The connection establishes (small packets pass), but hangs indefinitely when data transfers start.
- Duplex Mismatch: One side of an Ethernet link is full-duplex, the other is half-duplex. Causes massive collisions, resulting in terrible performance (e.g., 1-2 Mbps on a 1 Gbps link).
- DNS Poisoning / Spoofing: An attacker corrupts the local DNS cache, directing users to a malicious IP instead of the legitimate service. Verify with
dig @1.1.1.1to compare against a trusted resolver.