---
title: "Container Runtime Security Hardening"
description: "Securing container runtimes with seccomp profiles, AppArmor and SELinux policies, read-only root filesystems, capability dropping, Falco for runtime threat detection, and gVisor and Kata Containers for workload isolation."
url: https://agent-zone.ai/knowledge/security/container-runtime-security/
section: knowledge
date: 2026-02-22
categories: ["security"]
tags: ["container-security","seccomp","apparmor","selinux","falco","gvisor","kata-containers","runtime-security","capabilities","sandbox"]
skills: ["seccomp-profile-creation","apparmor-policy-writing","capability-management","runtime-threat-detection","sandbox-configuration"]
tools: ["docker","containerd","kubectl","falco","strace","oci-seccomp-bpf-hook","gvisor","kata-containers"]
levels: ["intermediate"]
word_count: 1684
formats:
  json: https://agent-zone.ai/knowledge/security/container-runtime-security/index.json
  html: https://agent-zone.ai/knowledge/security/container-runtime-security/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Container+Runtime+Security+Hardening
---


## Why Runtime Security Matters

Container images get scanned for vulnerabilities before deployment. Admission controllers enforce pod security standards at creation time. But neither addresses what happens after the container starts running. Runtime security fills this gap: it detects and prevents malicious behavior inside running containers.

A compromised container with a properly hardened runtime is limited in what damage it can cause. Without runtime hardening, a single container escape can compromise the entire node.

## Seccomp Profiles

Seccomp (Secure Computing Mode) restricts which Linux system calls a container process can make. The kernel kills any process that attempts a blocked syscall. This is the most effective single hardening measure because it directly limits what the kernel will do on behalf of the container.

### The RuntimeDefault Profile

Kubernetes applies no seccomp profile by default. The `RuntimeDefault` profile is the container runtime's built-in profile (containerd or CRI-O), which blocks approximately 44 dangerous syscalls including `mount`, `reboot`, `kexec_load`, `unshare`, and `bpf`.

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:1.0.0
      securityContext:
        seccompProfile:
          type: RuntimeDefault
```

Apply `RuntimeDefault` to every workload as a starting point. It breaks very few applications because it only blocks syscalls that normal applications never use.

### Custom Seccomp Profiles

For higher security, create a custom profile that only allows the specific syscalls your application needs. This follows the principle of least privilege at the kernel level.

**Step 1: Record which syscalls your application uses.**

```bash
# Use strace to record syscalls made by the application
strace -f -o /tmp/syscalls.log -e trace=all /path/to/application

# Extract unique syscall names
awk '{print $NF}' /tmp/syscalls.log | grep -oP '^\w+' | sort -u > /tmp/used-syscalls.txt

# Alternatively, use the OCI seccomp BPF hook to generate a profile automatically
# Install oci-seccomp-bpf-hook, then run the container with:
sudo podman run --annotation io.containers.trace-syscall=of:/tmp/seccomp-profile.json myapp:1.0.0
```

**Step 2: Create the seccomp profile.**

```json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": [
        "accept4", "access", "arch_prctl", "bind", "brk", "clone", "close",
        "connect", "epoll_create1", "epoll_ctl", "epoll_pwait", "execve",
        "exit_group", "fcntl", "fstat", "futex", "getdents64", "getpid",
        "getsockname", "getsockopt", "ioctl", "listen", "lseek", "madvise",
        "mmap", "mprotect", "munmap", "nanosleep", "newfstatat", "openat",
        "pipe2", "pread64", "read", "recvfrom", "rt_sigaction", "rt_sigprocmask",
        "rt_sigreturn", "sched_getaffinity", "sched_yield", "sendto", "set_robust_list",
        "set_tid_address", "setsockopt", "sigaltstack", "socket", "tgkill",
        "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
```

**Step 3: Deploy the profile via a Kubernetes SeccompProfile resource or by placing it on nodes.**

```yaml
# Using the Kubernetes seccomp profile directory on nodes
# Place the profile at: /var/lib/kubelet/seccomp/profiles/myapp.json

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/myapp.json
  containers:
    - name: app
      image: myapp:1.0.0
```

For managed Kubernetes where you cannot place files on nodes, use the Security Profiles Operator to manage seccomp profiles as Kubernetes resources:

```yaml
apiVersion: security-profiles-operator.x-k8s.io/v1beta1
kind: SeccompProfile
metadata:
  name: myapp-seccomp
  namespace: production
spec:
  defaultAction: SCMP_ACT_ERRNO
  architectures:
    - SCMP_ARCH_X86_64
    - SCMP_ARCH_AARCH64
  syscalls:
    - action: SCMP_ACT_ALLOW
      names:
        - accept4
        - bind
        - clone
        - close
        - connect
        # ... remaining syscalls
```

## AppArmor Profiles

AppArmor provides mandatory access control on Debian/Ubuntu-based systems. It restricts file access, network access, and capability usage per program.

### Default Docker/Containerd Profile

The default container runtime profile (`docker-default` or `cri-containerd.apparmor.d`) restricts mounting filesystems, accessing `/proc` and `/sys` files, and loading kernel modules. Like seccomp's RuntimeDefault, this is a reasonable baseline.

### Custom AppArmor Profile

```
# /etc/apparmor.d/myapp
#include <tunables/global>

profile myapp flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  # Allow reading application files
  /app/** r,
  /app/bin/myapp ix,

  # Allow writing to specific directories only
  /tmp/** rw,
  /var/log/myapp/** rw,

  # Network access: allow TCP only
  network inet stream,
  network inet6 stream,

  # Deny raw sockets (prevents packet sniffing)
  deny network raw,
  deny network packet,

  # Deny mount operations
  deny mount,

  # Deny access to sensitive host paths
  deny /proc/*/mem rw,
  deny /sys/firmware/** rw,
  deny /etc/shadow r,
  deny /etc/passwd w,

  # Deny ptrace (prevents debugging/inspection of other processes)
  deny ptrace,
}
```

Load and apply the profile:

```bash
# Load the profile
sudo apparmor_parser -r /etc/apparmor.d/myapp

# Verify it loaded
sudo aa-status | grep myapp

# Apply to a Kubernetes pod
```

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: localhost/myapp
spec:
  containers:
    - name: app
      image: myapp:1.0.0
```

### SELinux for RHEL/CentOS Nodes

On RHEL-based systems, SELinux provides equivalent mandatory access control. The `container_t` SELinux type is applied to containers by default and restricts host filesystem access, network operations, and inter-process communication.

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seLinuxOptions:
      type: container_t
      level: "s0:c123,c456"
  containers:
    - name: app
      image: myapp:1.0.0
```

The `level` field assigns MCS (Multi-Category Security) labels. Containers with different MCS labels cannot access each other's files even if they run on the same node.

## Capability Dropping

Linux capabilities split root's powers into discrete units. Containers start with a default set of 14 capabilities. Most applications need none of them.

### Drop All, Add Back Selectively

```yaml
securityContext:
  capabilities:
    drop:
      - ALL
    add: []
```

This is the most secure default. Only add capabilities back when the application fails without them, and add only the specific capability needed:

| Capability | What It Allows | When Needed |
|---|---|---|
| `NET_BIND_SERVICE` | Bind to ports below 1024 | Web servers on port 80/443 |
| `CHOWN` | Change file ownership | Init containers setting up volumes |
| `SETUID` / `SETGID` | Change user/group ID | Applications that drop privileges at startup |
| `DAC_OVERRIDE` | Bypass file permission checks | Rarely legitimate in containers |
| `SYS_PTRACE` | Trace/debug other processes | Debugging sidecars, security tools |
| `NET_RAW` | Use raw sockets | Ping, network diagnostics |

Never add these in production workloads: `SYS_ADMIN` (near-equivalent of full root), `SYS_PTRACE` (allows container escape via process injection), `NET_ADMIN` (allows network namespace manipulation).

### Read-Only Root Filesystem

A read-only root filesystem prevents attackers from modifying binaries, installing tools, or writing scripts in the container:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: myapp:1.0.0
      securityContext:
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: var-run
          mountPath: /var/run
        - name: var-cache
          mountPath: /var/cache
  volumes:
    - name: tmp
      emptyDir:
        sizeLimit: 100Mi
    - name: var-run
      emptyDir:
        sizeLimit: 10Mi
    - name: var-cache
      emptyDir:
        sizeLimit: 50Mi
```

Mount `emptyDir` volumes for every path where the application needs to write. Set `sizeLimit` to prevent a compromised container from filling the node's disk.

## Falco: Runtime Threat Detection

Falco monitors system calls made by containers in real time and alerts on suspicious behavior. It is the runtime equivalent of an intrusion detection system for containers.

### Installation

```bash
# Install via Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco --create-namespace \
  --set tty=true \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl=https://hooks.slack.com/services/XXX
```

### Key Detection Rules

Falco ships with rules that detect common attack patterns. These fire without any custom configuration:

- **Terminal shell in container**: Detects interactive shell access (bash, sh, zsh) inside a container. Almost always suspicious in production.
- **Read sensitive file untouched**: Detects reading `/etc/shadow`, `/etc/sudoers`, or private keys.
- **Write below /etc or /bin**: Detects modification of system files, a common persistence technique.
- **Contact K8s API server**: Detects containers making Kubernetes API calls, which is unexpected unless the workload intentionally uses the API.
- **Outbound connection to C2**: Detects connections to known command-and-control infrastructure.

### Custom Rules

Write rules for your specific environment:

```yaml
- rule: Unexpected process in production container
  desc: Detect processes that are not part of the normal application
  condition: >
    spawned_process and
    container and
    container.image.repository = "registry.example.com/myapp" and
    not proc.name in (myapp, node, python, gunicorn)
  output: >
    Unexpected process in production container
    (user=%user.name command=%proc.cmdline container=%container.name
     image=%container.image.repository:%container.image.tag)
  priority: WARNING
  tags: [container, process]

- rule: Sensitive mount in container
  desc: Detect containers mounting sensitive host paths
  condition: >
    container and
    (fd.name startswith /etc/kubernetes or
     fd.name startswith /var/lib/kubelet or
     fd.name startswith /var/run/docker.sock)
  output: >
    Sensitive path accessed in container
    (user=%user.name path=%fd.name container=%container.name)
  priority: CRITICAL
  tags: [container, filesystem]
```

### Alert Routing with Falcosidekick

Falcosidekick forwards Falco alerts to external systems:

```yaml
# Values for Falcosidekick Helm installation
config:
  slack:
    webhookurl: "https://hooks.slack.com/services/XXX"
    minimumpriority: "warning"
  pagerduty:
    routingkey: "ROUTING_KEY"
    minimumpriority: "critical"
  elasticsearch:
    hostport: "https://elasticsearch:9200"
    index: "falco-alerts"
    minimumpriority: "notice"
```

## gVisor: Application Kernel Isolation

gVisor interposes a user-space kernel (called Sentry) between the container and the host kernel. System calls from the container are handled by Sentry rather than the host kernel, providing defense-in-depth against kernel vulnerabilities.

### Setup with containerd

```bash
# Install gVisor runsc binary
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | \
  sudo tee /etc/apt/sources.list.d/gvisor.list
sudo apt update && sudo apt install -y runsc

# Add gVisor as a containerd runtime
# Add to /etc/containerd/config.toml:
```

```toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
```

```bash
sudo systemctl restart containerd
```

### Create a RuntimeClass in Kubernetes

```yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
```

### Use gVisor for Specific Workloads

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: gvisor
  containers:
    - name: app
      image: untrusted-app:1.0.0
```

gVisor adds latency to system calls (roughly 2-10x for syscall-heavy workloads). Use it for untrusted workloads, multi-tenant environments, or workloads processing untrusted input. Do not use it for latency-sensitive workloads like databases.

## Kata Containers: VM-Level Isolation

Kata Containers runs each container inside a lightweight virtual machine. This provides hardware-level isolation via the hypervisor. A container escape reaches the guest VM, not the host.

```yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
overhead:
  podFixed:
    memory: "160Mi"
    cpu: "250m"
```

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: isolated-workload
spec:
  runtimeClassName: kata
  containers:
    - name: app
      image: sensitive-app:1.0.0
```

Kata Containers have higher overhead than gVisor (each pod gets a VM with its own kernel) but provide stronger isolation because they use hardware virtualization. Use Kata for workloads that require the strongest possible isolation, such as running customer-supplied code or processing classified data.

### Runtime Isolation Comparison

| Feature | runc (default) | gVisor | Kata Containers |
|---|---|---|---|
| Isolation boundary | Linux namespaces + cgroups | User-space kernel | Hardware VM |
| Syscall overhead | None | 2-10x | 1.5-3x |
| Memory overhead | Minimal | ~50MB per sandbox | ~160MB per pod |
| Startup time | <1 second | ~1 second | 2-5 seconds |
| Kernel vulnerability protection | None | Strong | Strongest |
| Compatibility | Full | Most workloads | Most workloads |
| Best for | Trusted workloads | Multi-tenant, untrusted input | Highest-security, multi-tenant |

## Layered Defense

No single mechanism provides complete runtime security. Layer them:

1. **Seccomp**: Restrict which syscalls are available. The kernel-level filter.
2. **AppArmor/SELinux**: Restrict file and network access. The OS-level policy.
3. **Capabilities**: Drop unnecessary root powers. The privilege-level control.
4. **Read-only filesystem**: Prevent runtime modification. The immutability guarantee.
5. **Falco**: Detect when something bypasses the above controls. The detection layer.
6. **gVisor/Kata**: Isolate the workload from the host kernel entirely. The containment layer.

Apply layers 1 through 5 to every workload. Add layer 6 for untrusted or highest-risk workloads. Each layer reduces the attack surface that the next layer must defend.

