Linux Debugging Essentials for Infrastructure

Debugging Workflow#

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

Services: systemctl and journalctl#

When a service is misbehaving, start with its status:

systemctl status nginx

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.

Linux Performance Tuning: sysctl, ulimits, I/O Schedulers, and Kernel Parameters

sysctl: Kernel Parameter Tuning#

The sysctl interface exposes kernel parameters that control how Linux manages memory, networking, file systems, and processes. Changes take effect immediately but are lost on reboot unless persisted.

Memory Parameters#

# Reduce swap aggressiveness (default is 60, range 0-100)
# Lower values make the kernel prefer reclaiming page cache over swapping
# Set to 10 for database servers -- swapping destroys database performance
sysctl -w vm.swappiness=10

# Overcommit behavior
# 0 = heuristic overcommit (default, kernel estimates if there is enough memory)
# 1 = always overcommit (never refuse malloc -- dangerous but used by Redis)
# 2 = strict overcommit (never allocate more than swap + ratio*physical)
sysctl -w vm.overcommit_memory=0

The vm.swappiness parameter is one of the most impactful settings for database servers. The default of 60 means the kernel will fairly aggressively swap application memory to disk in favor of filesystem cache. For databases that manage their own caching (PostgreSQL shared_buffers, MySQL innodb_buffer_pool), this is counterproductive – the database’s carefully managed cache gets swapped out to make room for OS-level cache the database does not use.

Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

  • Utilization: How busy is the resource? (e.g., CPU at 90%)
  • Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
  • Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.