Network Monitoring and Troubleshooting
Tools and techniques for diagnosing network issues, monitoring performance, and maintaining network health.
Network problems are often invisible until they break something important. Good monitoring catches issues before users notice them, while systematic troubleshooting helps you fix problems quickly when they do occur.
Prerequisites
- Understanding of networking fundamentals from previous sections
- Access to Linux/Unix systems for running diagnostic tools
- Basic familiarity with log files and system administration
Essential Network Monitoring Tools
Let's start with the fundamental tools every developer should know.
ping: Basic Connectivity Testing
ping sends ICMP packets to test if a host is reachable:
# Basic connectivity test
ping google.com
# Send specific number of packets
ping -c 4 google.com
# Measure packet loss and latency patterns
ping -c 100 -i 0.1 your-server.com
Look for patterns in the output:
64 bytes from google.com (172.217.12.14): icmp_seq=1 ttl=55 time=12.3 ms
64 bytes from google.com (172.217.12.14): icmp_seq=2 ttl=55 time=11.8 ms
64 bytes from google.com (172.217.12.14): icmp_seq=3 ttl=55 time=45.2 ms
Sudden spikes in response time (like the third packet) often indicate network congestion or processing delays.
traceroute: Path Discovery
traceroute shows the path packets take to reach a destination:
# See the route to a destination
traceroute google.com
# Use UDP instead of ICMP (some firewalls block ICMP)
traceroute -U google.com
# Show IP addresses without DNS resolution for faster results
traceroute -n google.com
Each line represents a router hop:
1 192.168.1.1 (192.168.1.1) 1.234 ms 1.123 ms 1.456 ms
2 10.0.0.1 (10.0.0.1) 12.345 ms 11.234 ms 13.456 ms
3 * * *
4 203.0.113.1 (203.0.113.1) 23.456 ms 22.345 ms 24.567 ms
The * * *
in line 3 means that router doesn't respond to traceroute packets (common for security reasons), but traffic still flows through it.
netstat: Connection Analysis
netstat shows active network connections and listening services:
# Show all TCP and UDP connections
netstat -tuln
# Show connections with process information
netstat -tulnp
# Show only established connections
netstat -tln | grep ESTABLISHED
# Monitor connections in real-time
watch 'netstat -tuln | grep :80'
Understanding the output helps diagnose connection issues:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1234/sshd
tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN 5678/postgres
tcp 0 52 192.168.1.100:22 203.0.113.50:45678 ESTABLISHED 1234/sshd
ss: Modern Socket Statistics
ss is the modern replacement for netstat with better performance:
# Show all sockets
ss -tuln
# Show established TCP connections
ss -t state established
# Show connections to specific port
ss -tn dport = :80
# Show process information
ss -tlnp
ss provides more detailed information and runs faster on busy systems.
Application-Level Monitoring
Network monitoring extends beyond basic connectivity to application performance.
HTTP Response Time Monitoring
Monitor web application performance:
# Measure HTTP response times
curl -w "@curl-format.txt" -o /dev/null -s http://your-api.com/health
# Create curl-format.txt with timing information
echo 'time_namelookup: %{time_namelookup}\n
time_connect: %{time_connect}\n
time_appconnect: %{time_appconnect}\n
time_pretransfer: %{time_pretransfer}\n
time_redirect: %{time_redirect}\n
time_starttransfer: %{time_starttransfer}\n
time_total: %{time_total}\n' > curl-format.txt
This breaks down where time is spent in HTTP requests:
time_namelookup: 0.003
time_connect: 0.012
time_appconnect: 0.045
time_pretransfer: 0.046
time_redirect: 0.000
time_starttransfer: 0.123
time_total: 0.125
Database Connection Monitoring
Monitor database connectivity and performance:
# PostgreSQL connection monitoring
#!/bin/bash
DB_HOST="db.example.com"
DB_PORT="5432"
DB_NAME="app_production"
# Test basic connectivity
nc -zv $DB_HOST $DB_PORT
# Test application-level connectivity
psql "postgresql://user:pass@$DB_HOST:$DB_PORT/$DB_NAME" -c "SELECT 1;"
# Monitor active connections
psql "postgresql://user:pass@$DB_HOST:$DB_PORT/$DB_NAME" -c "
SELECT count(*) as active_connections,
state,
client_addr
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY state, client_addr;"
TCP Connection Health
Monitor TCP connection states and errors:
# Count connections by state
ss -tan | awk 'NR>1 {++S[$1]} END {for (a in S) print a,S[a]}'
# Monitor TCP retransmissions (sign of network problems)
netstat -s | grep -i retrans
# Watch for connection errors
dmesg | grep -i "tcp\|network"
Log-Based Network Monitoring
System and application logs contain valuable networking information.
System Log Analysis
Look for network-related messages in system logs:
# Check for network interface errors
dmesg | grep -i "network\|eth\|link"
# Monitor authentication failures (potential attacks)
grep "Failed password" /var/log/auth.log | tail -20
# Check for firewall blocks
grep "BLOCK" /var/log/ufw.log | tail -20
# Monitor DNS resolution issues
grep "resolve" /var/log/syslog | tail -20
Application Log Patterns
Monitor application logs for network issues:
# Web server logs - look for error patterns
tail -f /var/log/nginx/error.log | grep -E "(timeout|refused|unreachable)"
# Application logs - database connection issues
tail -f /var/log/app/application.log | grep -E "(connection.*failed|timeout|refused)"
# API response time monitoring
tail -f /var/log/nginx/access.log | awk '$NF > 1000 {print $0}' # Slow requests > 1 second
Structured Log Analysis
For applications that produce structured logs:
# Parse JSON logs for network metrics
tail -f app.log | jq 'select(.response_time_ms > 1000) | {timestamp, endpoint, response_time_ms}'
# Monitor error rates
tail -f app.log | jq -r 'select(.level == "ERROR" and .message | contains("network")) | .timestamp + " " + .message'
Network Performance Monitoring
Understanding network performance helps optimize applications and identify bottlenecks.
Bandwidth Testing
Measure available bandwidth:
# iperf3 - requires server on remote end
iperf3 -c iperf.example.com
# Test download speed
wget -O /dev/null http://speedtest.example.com/100MB.bin
# Test with curl and measure time
time curl -o /dev/null http://example.com/largefile.zip
Latency Monitoring
Track latency patterns over time:
# Continuous latency monitoring
#!/bin/bash
TARGET="api.example.com"
LOG_FILE="/var/log/latency-monitor.log"
while true; do
LATENCY=$(ping -c 1 $TARGET | grep 'time=' | awk -F'time=' '{print $2}' | awk '{print $1}')
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "$TIMESTAMP $TARGET $LATENCY" >> $LOG_FILE
sleep 60
done
Network Interface Statistics
Monitor network interface performance:
# Interface statistics
cat /proc/net/dev
# Detailed interface information
ip -s link show
# Monitor interface errors
watch 'cat /proc/net/dev | grep -E "(eth0|wlan0)"'
# Network interface utilization
vnstat -i eth0 -l # Live monitoring
vnstat -i eth0 -d # Daily statistics
Automated Network Monitoring
Set up automated monitoring to catch issues proactively.
Network Monitoring Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Monitoring Infrastructure │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Prometheus │ │ Grafana │ │ AlertManager │ │
│ │ (Metrics) │ │ (Dashboards) │ │ (Alerts) │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────────┐│ │ ┌─────────────┐│ │ ┌─────────────┐│ │
│ │ │Network ││ │ │Latency ││ │ │Email ││ │
│ │ │Metrics DB ││ │ │Dashboard ││ │ │Slack ││ │
│ │ └─────────────┘│ │ └─────────────┘│ │ │PagerDuty ││ │
│ └─────────────────┘ └─────────────────┘ │ └─────────────┘│ │
│ ▲ ▲ └─────────────────┘ │
│ │ │ ▲ │
│ │ └─────────────────────┘ │
│ │ │
└───────────┼─────────────────────────────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────────────────────────────┐
│ Target Infrastructure │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Web-1 │ │ API-1 │ │Database-1 │ │Load Balancer│ │
│ │ │ │ │ │ │ │ │ │
│ │┌───────────┐│ │┌───────────┐│ │┌───────────┐│ │┌───────────┐│ │
│ ││node_export││ ││node_export││ ││node_export││ ││blackbox ││ │
│ ││:9100 ││ ││:9100 ││ ││:9100 ││ ││exporter ││ │
│ │└───────────┘│ │└───────────┘│ │└───────────┘│ ││:9115 ││ │
│ │ │ │ │ │ │ │└───────────┘│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Network Metrics │ │
│ │ │ │
│ │ • Interface utilization (rx/tx bytes) │ │
│ │ • Connection counts (established, time_wait) │ │
│ │ • Packet loss and error rates │ │
│ │ • Latency measurements (ping, HTTP response time) │ │
│ │ • DNS resolution time │ │
│ │ • SSL certificate expiration │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Monitoring Flow:
1. Exporters collect metrics from infrastructure
2. Prometheus scrapes and stores metrics
3. Grafana visualizes metrics in dashboards
4. AlertManager sends notifications when thresholds exceeded
5. Engineers respond to alerts and fix issues
Simple Health Check Script
#!/bin/bash
# network-health-check.sh
SERVICES=(
"google.com:80"
"api.example.com:443"
"db.example.com:5432"
)
ALERT_EMAIL="[email protected]"
LOG_FILE="/var/log/network-health.log"
check_service() {
local host_port=$1
local host=${host_port%:*}
local port=${host_port#*:}
if nc -zv $host $port 2>/dev/null; then
echo "$(date): $host_port OK" >> $LOG_FILE
return 0
else
echo "$(date): $host_port FAILED" >> $LOG_FILE
echo "Network check failed for $host_port" | mail -s "Network Alert" $ALERT_EMAIL
return 1
fi
}
for service in "${SERVICES[@]}"; do
check_service $service
done
Run this script every 5 minutes via cron:
# Add to crontab
*/5 * * * * /usr/local/bin/network-health-check.sh
Prometheus Network Monitoring
For more sophisticated monitoring, use Prometheus with node_exporter:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
metrics_path: /metrics
scrape_interval: 5s
Query network metrics:
# Network bytes received
rate(node_network_receive_bytes_total[5m])
# Network errors
rate(node_network_receive_errs_total[5m])
# TCP connection states
node_netstat_Tcp_CurrEstab
Troubleshooting Network Issues
When network problems occur, systematic troubleshooting saves time.
Network Troubleshooting Flow
Problem: "I can't connect to the application!"
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Step 1: Physical/Link Layer │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Check cables │ │ Check WiFi │ │
│ │ ip link show │ │ iwconfig │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────┬──────────────────────────────────────┘
│ ✓ Links are UP
▼
┌────────────────────────────────────────────────────────────────┐
│ Step 2: Network Layer (IP) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ ping gateway │ │ ping 8.8.8.8 │ │
│ │ ping 192.168.1.1│ │ (test internet) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────┬──────────────────────────────────────┘
│ ✓ IP routing works
▼
┌────────────────────────────────────────────────────────────────┐
│ Step 3: DNS Resolution │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ nslookup │ │ dig │ │
│ │ app.example.com │ │ app.example.com │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────┬──────────────────────────────────────┘
│ ✓ DNS resolves correctly
▼
┌────────────────────────────────────────────────────────────────┐
│ Step 4: Transport Layer (Ports) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ telnet host 80 │ │ nc -zv host 443 │ │
│ │ (test HTTP) │ │ (test HTTPS) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────┬──────────────────────────────────────┘
│ ✓ Ports are open
▼
┌────────────────────────────────────────────────────────────────┐
│ Step 5: Application Layer │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ curl -I │ │ Check app logs │ │
│ │ http://app.com │ │ /var/log/app │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────┬──────────────────────────────────────┘
│
▼
Problem Found!
Common Issues by Layer:
┌─────────────┬─────────────────────────────────────────────────────┐
│ Layer │ Common Problems │
├─────────────┼─────────────────────────────────────────────────────┤
│ Physical │ Cable unplugged, WiFi disconnected │
│ Network │ Wrong IP, gateway down, routing misconfigured │
│ DNS │ Wrong nameserver, domain expired, propagation delay │
│ Transport │ Firewall blocking, service not listening │
│ Application │ App crashed, misconfigured, database connection │
└─────────────┴─────────────────────────────────────────────────────┘
The OSI Model Troubleshooting Approach
Work through the network layers systematically:
Layer 1 (Physical): Are cables connected? Is WiFi working?
# Check interface status
ip link show
# Check for hardware errors
dmesg | grep -i "link\|cable\|phy"
Layer 2 (Data Link): Can you reach other machines on the same network?
# Check ARP table
arp -a
# Test local network connectivity
ping 192.168.1.1 # Gateway
Layer 3 (Network): Can you reach remote networks?
# Check routing
ip route show
# Test internet connectivity
ping 8.8.8.8
Layer 4 (Transport): Are ports open and services responding?
# Test specific ports
telnet example.com 80
nc -zv example.com 443
Layer 7 (Application): Is the application working correctly?
# Test application endpoints
curl -I http://api.example.com/health
Common Network Problems and Solutions
DNS Resolution Issues:
# Symptoms: Can ping IP but not hostname
ping 8.8.8.8 # Works
ping google.com # Fails
# Troubleshooting
nslookup google.com
dig google.com
# Check DNS configuration
cat /etc/resolv.conf
# Try different DNS servers
nslookup google.com 8.8.8.8
Firewall Blocking Traffic:
# Symptoms: Connection timeouts or immediate refusal
telnet example.com 80 # Hangs or "Connection refused"
# Check local firewall
sudo ufw status
sudo iptables -L
# Check remote firewall (if you have access)
# Test from different source IPs
Network Congestion:
# Symptoms: High latency, packet loss
ping -c 100 example.com | grep "packet loss"
# Check interface utilization
vnstat -i eth0 -l
# Monitor for retransmissions
netstat -s | grep -i retrans
SSL/TLS Issues:
# Test SSL connectivity
openssl s_client -connect example.com:443
# Check certificate validity
echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates
Network Security Monitoring
Monitor for security issues and attacks.
Failed Authentication Monitoring
# SSH brute force attempts
grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -nr
# Web authentication failures
grep "401\|403" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr
Port Scan Detection
# Monitor for port scanning attempts
netstat -an | grep SYN_RECV | wc -l
# Detect connection attempts to closed ports
grep "Connection attempt" /var/log/messages
DDoS Attack Monitoring
# Monitor connection counts by IP
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -20
# Monitor request rates in web logs
tail -f /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr
Performance Baselines and Alerting
Establish performance baselines to detect anomalies.
Creating Performance Baselines
# Collect baseline network metrics
#!/bin/bash
BASELINE_FILE="/var/log/network-baseline.log"
# Collect metrics for a week during normal operations
echo "$(date): Latency to google.com: $(ping -c 5 google.com | tail -1 | awk -F'/' '{print $5}')" >> $BASELINE_FILE
echo "$(date): DNS resolution time: $(time nslookup google.com 2>&1 | grep real | awk '{print $2}')" >> $BASELINE_FILE
echo "$(date): HTTP response time: $(curl -w '%{time_total}' -o /dev/null -s http://api.example.com)" >> $BASELINE_FILE
Alerting Thresholds
Set up alerts based on deviations from baselines:
# Alert if latency exceeds baseline by 50%
BASELINE_LATENCY=20 # milliseconds
CURRENT_LATENCY=$(ping -c 1 google.com | grep 'time=' | awk -F'time=' '{print $2}' | awk '{print $1}' | cut -d'.' -f1)
if [ $CURRENT_LATENCY -gt $((BASELINE_LATENCY * 150 / 100)) ]; then
echo "High latency detected: ${CURRENT_LATENCY}ms" | mail -s "Network Alert" [email protected]
fi
Advanced Troubleshooting Tools
For complex network issues, specialized tools provide deeper insights.
tcpdump: Packet Capture
Capture and analyze network packets:
# Capture packets on specific interface
sudo tcpdump -i eth0
# Capture HTTP traffic
sudo tcpdump -i eth0 port 80
# Capture and save to file
sudo tcpdump -i eth0 -w network-capture.pcap
# Filter by host
sudo tcpdump -i eth0 host api.example.com
# Show packet contents
sudo tcpdump -i eth0 -A port 80
Wireshark: Packet Analysis
For detailed packet analysis, use Wireshark (GUI tool) or tshark (command line):
# Analyze saved packet capture
tshark -r network-capture.pcap
# Filter HTTP requests
tshark -r network-capture.pcap -Y http.request
# Extract timing information
tshark -r network-capture.pcap -T fields -e frame.time -e tcp.analysis.ack_rtt
strace: System Call Tracing
Debug application network behavior:
# Trace network system calls for a process
strace -e trace=network -p <pid>
# Trace a command's network activity
strace -e trace=network curl http://example.com
# Save trace to file for analysis
strace -e trace=network -o network-trace.log -p <pid>
In the next section, we'll explore network automation - how to manage network configurations, deployments, and monitoring using infrastructure as code principles.
Network monitoring and troubleshooting are skills that improve with practice. Start with the basic tools, build your understanding of normal network behavior, and gradually add more sophisticated monitoring as your infrastructure grows.
Happy troubleshooting!
Found an issue?