Linux Debugging Essentials for Infrastructure

Debugging Workflow#

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

Services: systemctl and journalctl#

When a service is misbehaving, start with its status:

systemctl status nginx

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.

Linux Performance Tuning: sysctl, ulimits, I/O Schedulers, and Kernel Parameters

sysctl: Kernel Parameter Tuning#

The sysctl interface exposes kernel parameters that control how Linux manages memory, networking, file systems, and processes. Changes take effect immediately but are lost on reboot unless persisted.

Memory Parameters#

# Reduce swap aggressiveness (default is 60, range 0-100)
# Lower values make the kernel prefer reclaiming page cache over swapping
# Set to 10 for database servers -- swapping destroys database performance
sysctl -w vm.swappiness=10

# Overcommit behavior
# 0 = heuristic overcommit (default, kernel estimates if there is enough memory)
# 1 = always overcommit (never refuse malloc -- dangerous but used by Redis)
# 2 = strict overcommit (never allocate more than swap + ratio*physical)
sysctl -w vm.overcommit_memory=0

The vm.swappiness parameter is one of the most impactful settings for database servers. The default of 60 means the kernel will fairly aggressively swap application memory to disk in favor of filesystem cache. For databases that manage their own caching (PostgreSQL shared_buffers, MySQL innodb_buffer_pool), this is counterproductive – the database’s carefully managed cache gets swapped out to make room for OS-level cache the database does not use.

Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

  • Utilization: How busy is the resource? (e.g., CPU at 90%)
  • Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
  • Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.

systemd Service Management: Units, Timers, Journal, and Socket Activation

Unit Types#

systemd manages the entire system through “units,” each representing a resource or service. The most common types:

  • service: Daemons and long-running processes (nginx, postgresql, your application).
  • timer: Scheduled execution, replacing cron. More flexible and better integrated with logging.
  • socket: Network sockets that trigger service activation on connection. Enables lazy startup and zero-downtime restarts.
  • target: Groups of units that represent system states (multi-user.target, graphical.target). Analogous to SysV runlevels.
  • mount: Filesystem mount points managed by systemd.
  • path: Watches filesystem paths and activates units when changes occur.

Unit files live in three locations, in order of precedence: /etc/systemd/system/ (local admin overrides), /run/systemd/system/ (runtime, non-persistent), and /usr/lib/systemd/system/ (package-installed defaults). Always put custom units in /etc/systemd/system/.