Linux Debugging Essentials for Infrastructure

Debugging Workflow#

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

Services: systemctl and journalctl#

When a service is misbehaving, start with its status:

systemctl status nginx

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.

PostgreSQL Debugging

PostgreSQL Debugging#

When PostgreSQL breaks, it usually falls into a handful of patterns. This is a reference for diagnosing each one with specific queries and commands.

Connection Refused#

Work through these in order:

1. Is PostgreSQL running?

sudo systemctl status postgresql-16

2. Is it listening on the right address?

ss -tlnp | grep 5432

If it shows 127.0.0.1:5432 but you need remote access, set listen_addresses = '*' in postgresql.conf.

3. Does pg_hba.conf allow the connection? Check logs for no pg_hba.conf entry for host:

Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues

The USE Method: A Framework for Systematic Diagnosis#

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system – CPU, memory, disk, network – you check three things:

  • Utilization: How busy is the resource? (e.g., CPU at 90%)
  • Saturation: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
  • Errors: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.

systemd Service Management: Units, Timers, Journal, and Socket Activation

Unit Types#

systemd manages the entire system through “units,” each representing a resource or service. The most common types:

  • service: Daemons and long-running processes (nginx, postgresql, your application).
  • timer: Scheduled execution, replacing cron. More flexible and better integrated with logging.
  • socket: Network sockets that trigger service activation on connection. Enables lazy startup and zero-downtime restarts.
  • target: Groups of units that represent system states (multi-user.target, graphical.target). Analogous to SysV runlevels.
  • mount: Filesystem mount points managed by systemd.
  • path: Watches filesystem paths and activates units when changes occur.

Unit files live in three locations, in order of precedence: /etc/systemd/system/ (local admin overrides), /run/systemd/system/ (runtime, non-persistent), and /usr/lib/systemd/system/ (package-installed defaults). Always put custom units in /etc/systemd/system/.