---
title: "Linux Troubleshooting: A Systematic Approach to Diagnosing System Issues"
description: "Systematic methodology for diagnosing Linux system issues using the USE method, covering CPU, memory, disk, network, process, and log investigation with practical commands and common patterns."
url: https://agent-zone.ai/knowledge/infrastructure/linux-troubleshooting/
section: knowledge
date: 2026-02-21
categories: ["infrastructure"]
tags: ["linux","troubleshooting","performance","diagnostics","USE-method","monitoring"]
skills: ["linux-troubleshooting","system-administration","performance-analysis"]
tools: ["top","htop","mpstat","pidstat","vmstat","iostat","iotop","ss","tcpdump","strace","lsof","journalctl","dmesg"]
levels: ["intermediate"]
word_count: 1425
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/linux-troubleshooting/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/linux-troubleshooting/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Linux+Troubleshooting%3A+A+Systematic+Approach+to+Diagnosing+System+Issues
---


## The USE Method: A Framework for Systematic Diagnosis

The USE method, developed by Brendan Gregg, provides a structured approach to system performance analysis. For every resource on the system -- CPU, memory, disk, network -- you check three things:

- **Utilization**: How busy is the resource? (e.g., CPU at 90%)
- **Saturation**: Is work queuing because the resource is overloaded? (e.g., CPU run queue length)
- **Errors**: Are there error events? (e.g., disk I/O errors, network packet drops)

This method prevents the common trap of randomly checking things. Instead, you systematically walk through each resource and check all three dimensions. If you find high utilization, saturation, or errors on a resource, you have found your bottleneck.

The recommended investigation order is: CPU, Memory, Disk, Network, Processes, Logs. This order works because CPU and memory issues are the most common, and each step builds context for the next.

## CPU Investigation

Start with the big picture using `top` or `htop`:

```bash
top -bn1 | head -20          # snapshot view, non-interactive
htop                          # interactive, color-coded, tree view
```

Key things to look at in `top`: the load average (1, 5, 15 minute averages), overall CPU percentages (us = user, sy = system, wa = I/O wait, id = idle), and per-process CPU usage. The load average represents the number of processes waiting for CPU time. On a 4-core system, a load average of 4.0 means the CPUs are fully utilized. Above that means processes are queuing.

A critical pattern to recognize: **high load average but low CPU usage**. This means processes are waiting but not for CPU -- they are in I/O wait or uninterruptible sleep. Check the `wa` (I/O wait) value in `top`. If `wa` is high, the bottleneck is disk I/O, not CPU.

For per-CPU breakdown:

```bash
mpstat -P ALL 1 5             # per-CPU stats, 1-second interval, 5 samples
```

This reveals if one CPU is pegged at 100% while others are idle -- a sign of a single-threaded bottleneck. It also shows if the system is spending excessive time in system calls (`%sys`) vs user code (`%usr`).

To identify which process is consuming CPU:

```bash
pidstat 1 5                   # per-process CPU, 1-second interval
pidstat -t -p <pid> 1         # per-thread breakdown for a specific process
```

## Memory Investigation

The most important command for memory is `free`:

```bash
free -h
```

This produces output like:

```
              total        used        free      shared  buff/cache   available
Mem:           31Gi        12Gi       1.2Gi       256Mi        18Gi        18Gi
```

The critical column is **available**, not **free**. Linux uses unused memory for disk caching (buff/cache), which is reclaimed when applications need it. A system showing 1.2G "free" but 18G "available" is healthy -- the kernel is using spare memory productively. Only worry when "available" is low.

Check for swap activity with `vmstat`:

```bash
vmstat 1 10                   # 1-second interval, 10 samples
```

Watch the `si` (swap in) and `so` (swap out) columns. Any non-zero `so` value means the system is actively pushing memory to disk, which devastates performance. Consistent swap activity is a strong signal that the system needs more RAM or a process has a memory leak.

For detailed memory breakdown:

```bash
cat /proc/meminfo             # full kernel memory statistics
smem -tk                      # per-process memory (USS = unique, PSS = proportional)
```

`smem` is particularly useful because it shows actual per-process memory consumption, accounting for shared libraries. The PSS (Proportional Set Size) column divides shared memory proportionally among the processes sharing it, giving a realistic picture.

## Disk Investigation

Disk problems come in two forms: running out of space and I/O performance issues.

For space:

```bash
df -h                         # filesystem space usage
df -i                         # inode usage -- critical and often overlooked
```

**Inodes can run out before disk space.** A filesystem with millions of tiny files (common with mail servers, container layers, or build caches) can exhaust inodes while gigabytes of space remain. The symptom is "No space left on device" errors despite `df -h` showing available space. Always check `df -i` when you see space errors.

For I/O performance:

```bash
iostat -x 1 5                 # extended I/O stats, 1-second interval
```

Key columns: `await` (average I/O request wait time in ms -- should be under 10ms for SSDs, under 20ms for HDDs), `%util` (device utilization -- 100% means saturated), and `avgqu-sz` (average queue size -- high values mean I/O is queuing).

To identify which process is causing I/O:

```bash
iotop -oP                     # show only processes doing I/O, per-process
```

## Network Investigation

Start with what is listening and connected:

```bash
ss -tlnp                      # TCP listening ports with process names
ss -s                         # connection state summary (established, TIME_WAIT, etc.)
ss -tnp state established     # all established connections
```

A high number of `TIME_WAIT` connections can indicate connection churn. Thousands of `CLOSE_WAIT` connections indicate a process that is not properly closing sockets -- typically an application bug.

For bandwidth investigation:

```bash
iftop -i eth0                 # real-time bandwidth per connection
nethogs eth0                  # bandwidth per process (more useful)
```

When you need to see actual packet content:

```bash
tcpdump -i eth0 port 80 -nn -c 100    # capture 100 packets on port 80
tcpdump -i any host 10.0.0.5 -w /tmp/capture.pcap   # write to file for Wireshark
```

## Process Investigation

When you have identified a suspect process:

```bash
ps aux --sort=-%mem | head -20          # top 20 processes by memory
ps aux --sort=-%cpu | head -20          # top 20 processes by CPU
ps -eo pid,ppid,stat,cmd --forest      # process tree showing parent-child relationships
```

To see what a process is doing at the system call level:

```bash
strace -p <pid> -c                     # syscall summary (count and time per call)
strace -p <pid> -e trace=network       # only network-related syscalls
strace -p <pid> -e trace=file          # only file-related syscalls
```

To see what files and sockets a process has open:

```bash
lsof -p <pid>                          # all open files, sockets, pipes
lsof -i :8080                          # which process is using port 8080
lsof +D /var/log                       # which processes have files open in /var/log
```

## Log Investigation

Logs are where you confirm what the metrics are telling you:

```bash
journalctl -u <service> --since "30 min ago"   # recent service logs
journalctl -u <service> -p err                  # errors only
journalctl -f                                    # follow all system logs
dmesg --time-format iso | tail -100              # recent kernel messages
dmesg -T | grep -i error                         # kernel errors with human timestamps
```

`dmesg` is especially important for: OOM kills, disk errors, hardware failures, and filesystem issues. These never appear in application logs.

Check standard log locations when journalctl does not have what you need:

```bash
/var/log/syslog          # general system log (Debian/Ubuntu)
/var/log/messages        # general system log (RHEL/CentOS)
/var/log/auth.log        # authentication events
/var/log/kern.log        # kernel messages
```

## Common Patterns and Their Diagnosis

**OOM Killer**: The kernel kills processes when memory is exhausted. Detect with:

```bash
dmesg | grep -i "oom\|out of memory"
journalctl -k | grep -i oom
```

The kernel log shows which process was killed and how much memory it was using. The OOM killer selects victims based on an `oom_score` -- processes using more memory get higher scores.

**Disk Full (Including Inodes)**: When `df -h` shows space but operations fail, check inodes with `df -i`. Also check if a deleted file is still held open by a process:

```bash
lsof +L1                # files that have been deleted but are still open
```

A common scenario: you delete a large log file, but the process still holds it open. The space is not freed until the process closes the file or is restarted.

**Zombie Processes**: Processes that have exited but whose parent has not read their exit status. They show as `Z` in `ps`. They consume no resources but do consume a PID slot. Many zombies indicate a buggy parent process.

```bash
ps aux | awk '$8 ~ /Z/ {print}'        # find zombie processes
```

**File Descriptor Exhaustion**: Processes or the system running out of file descriptors:

```bash
ulimit -n                              # current per-process limit
cat /proc/sys/fs/file-nr               # system-wide: allocated, unused, max
ls /proc/<pid>/fd | wc -l              # how many FDs a process has open
```

## The "It's Slow" Investigation

When the symptom is simply "it's slow," determine which resource is the bottleneck:

1. **CPU bound**: `top` shows high CPU usage, low `wa`. The application is compute-limited. Solutions: optimize code, add CPU cores, distribute load.
2. **I/O bound**: `top` shows high `wa` (I/O wait), `iostat` shows high `await` and `%util`. The application is waiting on disk. Solutions: faster disks (SSD/NVMe), reduce I/O (caching, better queries), spread I/O across disks.
3. **Memory bound**: high swap activity (`vmstat` si/so), low "available" in `free -h`. The system is thrashing. Solutions: add RAM, reduce memory usage, fix memory leaks.
4. **Network bound**: `nethogs` or `iftop` shows high bandwidth, or `ss` shows many connections in unusual states. Solutions: increase bandwidth, optimize payload sizes, add connection pooling.

Run through this checklist in order. The first resource showing saturation or high utilization is usually your primary bottleneck. Fix that first, then re-evaluate -- fixing one bottleneck often reveals the next.

