Network Policies: Namespace Isolation and Pod-to-Pod Rules

Network Policies: Namespace Isolation and Pod-to-Pod Rules#

By default, every pod in a Kubernetes cluster can talk to every other pod. Network policies let you restrict that. They are namespace-scoped resources that select pods by label and define allowed ingress and egress rules.

Critical Prerequisite: CNI Support#

Network policies are only enforced if your CNI plugin supports them. Calico, Cilium, and Weave all support network policies. Flannel does not. If you are running Flannel, you can create NetworkPolicy resources without errors, but they will have absolutely no effect. This is a silent failure that wastes hours of debugging.

Network Security Layers

Defense in Depth#

No single network control stops every attack. Layer controls so that a failure in one does not compromise the system: host firewalls, Kubernetes network policies, service mesh encryption, API gateway authentication, and DNS security, each operating independently.

Host Firewall: iptables and nftables#

Every node should run a host firewall regardless of the orchestrator. Block everything by default:

# iptables: default deny with essential allows
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT ACCEPT

# Allow established connections
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow SSH from management network only
iptables -A INPUT -p tcp --dport 22 -s 10.0.100.0/24 -j ACCEPT

# Allow kubelet API (for k8s nodes)
iptables -A INPUT -p tcp --dport 10250 -s 10.0.0.0/16 -j ACCEPT

# Allow loopback
iptables -A INPUT -i lo -j ACCEPT

The nftables equivalent is more readable for complex rulesets:

Node Drain and Cordon: Safe Node Maintenance

Node Drain and Cordon#

Node maintenance is a routine part of cluster operations: kernel patches, instance type changes, Kubernetes upgrades, hardware replacement. The tools are kubectl cordon (stop scheduling new pods) and kubectl drain (evict existing pods). Getting the flags and sequence right is the difference between a seamless operation and a production incident.

Cordon: Mark Unschedulable#

Cordon sets the spec.unschedulable field on a node to true. The scheduler will not place new pods on it, but existing pods continue running undisturbed.

Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures

“I’m Not Seeing Metrics” – Systematic Diagnosis#

This is the most common observability complaint. Work through these steps in order to isolate where the pipeline breaks.

Step 1: Is the Target Being Scraped?#

Open the Prometheus UI at /targets. Search for the job name or target address. Look at three things: state (UP or DOWN), last scrape timestamp, and error message.

Status: UP    Last Scrape: 3s ago    Duration: 12ms    Error: (none)
Status: DOWN  Last Scrape: 15s ago   Duration: 0ms     Error: connection refused

If the target does not appear at all, Prometheus does not know about it. This means the scrape configuration (or ServiceMonitor) is not matching the target. Jump to the ServiceMonitor checklist at the end of this guide.

OpenTelemetry for Kubernetes

What OpenTelemetry Is#

OpenTelemetry (OTel) is a vendor-neutral framework for generating, collecting, and exporting telemetry data: traces, metrics, and logs. It provides APIs, SDKs, and the Collector – a standalone binary that receives, processes, and exports telemetry. OTel replaces the fragmented landscape of Jaeger client libraries, Zipkin instrumentation, Prometheus client libraries, and proprietary agents with a single standard.

The three signal types:

  • Traces: Record the path of a request through distributed services as a tree of spans. Each span has a name, duration, attributes, and parent reference.
  • Metrics: Numeric measurements (counters, gauges, histograms) emitted by applications and infrastructure. OTel metrics can be exported to Prometheus.
  • Logs: Structured log records correlated with trace context. OTel log support bridges existing logging libraries with trace correlation.

The OTel Collector Pipeline#

The Collector is the central hub. It has three pipeline stages:

Pod Lifecycle and Probes: Init Containers, Hooks, and Health Checks

Pod Lifecycle and Probes#

Understanding how Kubernetes starts, monitors, and stops pods is essential for running reliable services. Misconfigurations here cause cascading failures, dropped requests, and restart loops that are difficult to diagnose.

Pod Startup Sequence#

When a pod is scheduled, this is the exact order of operations:

  1. Init containers run sequentially. Each must exit 0 before the next starts.
  2. All regular containers start simultaneously.
  3. postStart hooks fire (in parallel with the container’s main process).
  4. Startup probe begins checking (if defined).
  5. Once the startup probe passes, liveness and readiness probes begin.

Init Containers#

Init containers run before your application containers and are used for setup tasks: waiting for a dependency, running database migrations, cloning config from a remote source.

Pod Security Standards: Admission Control and Secure Pod Configuration

Pod Security Standards#

Kubernetes Pod Security Standards define three security profiles that control what pods are allowed to do. Pod Security Admission (PSA) enforces these standards at the namespace level. This is the replacement for PodSecurityPolicy, which was removed in Kubernetes 1.25.

The Three Levels#

Privileged – Unrestricted. No security controls applied. Used for system-level workloads like CNI plugins, storage drivers, and logging agents that genuinely need host access.

Baseline – Prevents known privilege escalations. Blocks hostNetwork, hostPID, hostIPC, privileged containers, and most host path mounts. Allows most workloads to run without modification.

PodDisruptionBudgets Deep Dive

PodDisruptionBudgets Deep Dive#

A PodDisruptionBudget (PDB) limits how many pods from a set can be simultaneously down during voluntary disruptions – node drains, cluster upgrades, autoscaler scale-down. PDBs do not protect against involuntary disruptions like node crashes or OOM kills. They are the mechanism by which you tell Kubernetes “this service needs at least N healthy pods at all times during maintenance.”

minAvailable vs maxUnavailable#

PDBs support two fields. Use one or the other, not both.

Prometheus and Grafana Monitoring Stack

Prometheus Architecture#

Prometheus pulls metrics from targets at regular intervals (scraping). Each target exposes an HTTP endpoint (typically /metrics) that returns metrics in a text format. Prometheus stores the scraped data in a local time-series database and evaluates alerting rules against it. Grafana connects to Prometheus as a data source and renders dashboards.

Scrape Configuration#

The core of Prometheus configuration is the scrape config. Each scrape_config block defines a set of targets and how to scrape them.

RBAC Patterns: Practical Access Control for Kubernetes

RBAC Patterns#

Kubernetes RBAC controls who can do what to which resources. It is built on four objects: Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings. Getting RBAC right means understanding how these four pieces compose and knowing the common patterns that cover 90% of real-world needs.

The Four RBAC Objects#

Role – Defines permissions within a single namespace. Lists API groups, resources, and allowed verbs.

ClusterRole – Defines permissions cluster-wide or for non-namespaced resources (nodes, persistent volumes, namespaces themselves).