---
title: "Linux Debugging Essentials for Infrastructure"
description: "Systematic approach to debugging Linux systems using systemctl, journalctl, dmesg, process tools, disk and memory analysis, network inspection, and strace."
url: https://agent-zone.ai/knowledge/infrastructure/linux-debugging-essentials/
section: knowledge
date: 2026-02-22
categories: ["infrastructure"]
tags: ["linux","debugging","systemd","networking","performance"]
skills: ["linux-troubleshooting","system-administration"]
tools: ["systemctl","journalctl","dmesg","strace","ss","lsof"]
levels: ["intermediate"]
word_count: 798
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/linux-debugging-essentials/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/linux-debugging-essentials/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Linux+Debugging+Essentials+for+Infrastructure
---


## Debugging Workflow

Start broad, narrow down. Most problems fall into five categories: service not running, resource exhaustion, full disk, network failure, or kernel issue. Work through them in order: service, resources, network, kernel logs.

## Services: systemctl and journalctl

When a service is misbehaving, start with its status:

```bash
systemctl status nginx
```

This shows whether the service is active, its PID, its last few log lines, and how long it has been running. If the service keeps restarting, the uptime will be suspiciously short.

View full logs for a service:

```bash
journalctl -u nginx -b              # logs since last boot
journalctl -u nginx -f              # follow in real time
journalctl -u nginx -p err          # only errors and above
journalctl -u nginx --since "1 hour ago"  # time-scoped
```

If a service fails to start, check the exit code in `systemctl status`. Common patterns: exit code 1 means configuration error, 137 means OOM-killed, 203 means the binary was not found. Restart with `systemctl restart nginx`, enable on boot with `systemctl enable nginx`, and run `systemctl daemon-reload` after editing a unit file.

## Kernel Messages: dmesg

When things go wrong at the system level, `dmesg` shows kernel ring buffer messages. OOM kills, hardware errors, filesystem issues, and driver problems all appear here.

```bash
# Recent kernel messages
dmesg --time-format iso | tail -50

# Follow new messages
dmesg -w

# Filter for OOM events
dmesg | grep -i "oom\|out of memory\|killed process"

# Disk/filesystem errors
dmesg | grep -i "error\|fail\|ext4\|xfs"
```

If a process was OOM-killed, `dmesg` shows which process was chosen. The kernel picks the process with the highest `oom_score`.

## Processes: top, htop, ps

Identify what is consuming CPU and memory:

```bash
# Snapshot of top processes by CPU
ps aux --sort=-%cpu | head -20

# Snapshot by memory
ps aux --sort=-%mem | head -20

# Find a specific process
ps aux | grep '[n]ginx'

# Process tree (parent-child relationships)
ps auxf
```

`htop` provides an interactive view with per-core CPU graphs and sortable columns. Use `top -b -n 1` for non-interactive output suitable for scripts. For deeper per-process inspection, look at `/proc/PID/status` for memory details and `/proc/PID/fd` for open file descriptors.

## Disk: df and du

A full filesystem causes cascading failures -- services cannot write logs, databases cannot write data, package managers refuse to install updates.

```bash
df -h                                   # filesystem usage
df -i                                   # inode usage (can fill even with free space)
du -sh /var/* | sort -rh | head -10     # largest directories
find / -type f -size +100M 2>/dev/null  # large files
```

Common culprits: unrotated logs in `/var/log`, old Docker images (`docker system prune`), package manager cache (`apt clean`), and core dumps.

## Memory: free and vmstat

```bash
free -h      # memory overview
vmstat 2     # continuous monitoring (every 2 seconds)
```

In `free` output, the "available" column matters, not "free." Linux uses unused memory for disk cache, which is reclaimed on demand. A system showing 50MB "free" but 4GB "available" is healthy. In `vmstat`, watch `si`/`so` (swap in/out -- constant activity means memory starvation) and `wa` (I/O wait).

## Networking: ss and netstat

`ss` is the modern replacement for `netstat`.

```bash
ss -tlnp                       # listening TCP ports with process names
ss -tnp                        # established connections
ss -tnp dst :443               # connections to a specific port
ss -tn state time-wait | wc -l # TIME_WAIT count
ss -s                          # socket statistics summary
```

The `-p` flag shows the owning process (requires root). For connectivity testing: `dig example.com +short` for DNS, `nc -zv host 443 -w 5` for port reachability, `traceroute -n host` for routing.

## System Calls: strace

When logs tell you nothing, `strace` shows exactly what system calls a process is making -- file access errors, network connection attempts, and permission denials that never appear in application logs.

```bash
# Trace a running process
strace -p PID -f

# Trace a command from start
strace -f -e trace=network,file curl https://example.com

# Only file operations
strace -e trace=open,openat,read,write -p PID

# Trace with timestamps
strace -t -p PID

# Summary of syscalls (count and time spent)
strace -c -p PID
```

The `-f` flag follows child processes (critical for forking servers). The `-e trace=` flag filters to specific syscall categories to reduce noise.

## Open Files: lsof

`lsof` connects processes to the files and sockets they hold open.

```bash
# Files opened by a process
lsof -p PID

# What process has a file open
lsof /var/log/syslog

# All network connections for a process
lsof -i -a -p PID

# What is using a specific port
lsof -i :8080

# Files opened by a user
lsof -u www-data

# Count open files per process (find file descriptor leaks)
lsof | awk '{print $2}' | sort | uniq -c | sort -rn | head -10
```

A process that continuously opens files without closing them hits the file descriptor limit (`ulimit -n`), causing "too many open files" errors. The count command above identifies the offending process.

