Linux Debugging Essentials for Infrastructure

Debugging Workflow#

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

Services: systemctl and journalctl#

When a service is misbehaving, start with its status:

systemctl status nginx

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.

Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

  • Utilization: How busy is the resource? (e.g., CPU at 90%)
  • Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
  • Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.