---
title: "Prometheus Architecture Deep Dive"
description: "How Prometheus scraping, storage, service discovery, relabeling, federation, and remote storage work together in production monitoring systems."
url: https://agent-zone.ai/knowledge/observability/prometheus-architecture/
section: knowledge
date: 2026-02-22
categories: ["observability"]
tags: ["prometheus","tsdb","service-discovery","relabeling","federation","remote-write","thanos","mimir"]
skills: ["prometheus-architecture","scrape-configuration","service-discovery-setup","long-term-storage-design"]
tools: ["prometheus","thanos","mimir","cortex","consul"]
levels: ["intermediate"]
word_count: 817
formats:
  json: https://agent-zone.ai/knowledge/observability/prometheus-architecture/index.json
  html: https://agent-zone.ai/knowledge/observability/prometheus-architecture/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Prometheus+Architecture+Deep+Dive
---


## Pull-Based Scraping Model

Prometheus pulls metrics from targets rather than having targets push metrics to it. Every scrape interval (default 15s in the global config), Prometheus sends an HTTP GET to each target's metrics endpoint. The target responds with all its current metric values in Prometheus exposition format.

This pull model has concrete advantages. Prometheus controls the scrape rate, so a misbehaving target cannot flood the system. You can scrape a target from your laptop with `curl http://target:8080/metrics` to see exactly what Prometheus sees. Targets that go down are immediately detectable because the scrape fails.

```yaml
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
```

The `scrape_timeout` must be less than `scrape_interval`. If a target takes longer than the timeout to respond, the scrape is marked as failed and `up{job="..."}` drops to 0.

## TSDB Storage

Prometheus stores data in a custom time-series database on local disk. The TSDB organizes data into two-hour blocks. Incoming samples first go to an in-memory head block, which is write-ahead logged (WAL) for crash recovery. Every two hours, the head block is compacted into an immutable on-disk block.

Compaction merges smaller blocks into larger ones over time, improving query performance for longer time ranges. Each block is a directory containing chunks (the actual sample data), an index (label-to-series mappings), and metadata.

Key storage flags:

```bash
prometheus \
  --storage.tsdb.path=/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --storage.tsdb.min-block-duration=2h \
  --storage.tsdb.max-block-duration=36h
```

When both `retention.time` and `retention.size` are set, whichever limit is hit first triggers block deletion. Monitor `prometheus_tsdb_storage_blocks_bytes` to track actual disk usage.

## Service Discovery

Static configs work for a handful of targets but break down in dynamic environments. Prometheus has native service discovery for Kubernetes, Consul, EC2, GCE, Azure, DNS, and file-based discovery.

**Kubernetes SD** discovers targets from the Kubernetes API server:

```yaml
scrape_configs:
  - job_name: "k8s-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ["production", "staging"]
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}
```

The `role` field determines what is discovered: `pod`, `service`, `endpoints`, `endpointslice`, `node`, or `ingress`. Each role exposes different `__meta_kubernetes_*` labels for relabeling.

**Consul SD** discovers services registered in Consul:

```yaml
scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.service.consul:8500"
        services: []  # empty = all services
        tags: ["prometheus"]
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
```

**File SD** watches JSON or YAML files for target lists, useful for configuration management systems:

```yaml
scrape_configs:
  - job_name: "file-targets"
    file_sd_configs:
      - files: ["/etc/prometheus/targets/*.json"]
        refresh_interval: 5m
```

## Relabeling

Relabeling is the most powerful and most confusing part of Prometheus configuration. There are two places it applies, and they serve different purposes.

**`relabel_configs`** runs before the scrape. It manipulates target labels and metadata to control which targets are scraped and how. It operates on labels prefixed with `__meta_*` (from service discovery) and special labels like `__address__`, `__metrics_path__`, and `__scheme__`.

**`metric_relabel_configs`** runs after the scrape, on every individual metric sample. Use it to drop expensive metrics, rename them, or modify their labels.

```yaml
scrape_configs:
  - job_name: "kubelet"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      # Before scrape: rewrite the port to kubelet's metrics port
      - action: replace
        source_labels: [__address__]
        regex: (.+):(.+)
        replacement: ${1}:10250
        target_label: __address__
    metric_relabel_configs:
      # After scrape: drop high-cardinality metrics we don't need
      - source_labels: [__name__]
        regex: "kubelet_runtime_operations_duration_seconds_bucket"
        action: drop
```

Common relabeling actions: `keep` (only keep targets matching regex), `drop` (discard matching targets), `replace` (regex-replace label values), `labelmap` (copy labels matching a regex pattern), `labeldrop` (remove labels matching regex), `hashmod` (for sharding across multiple Prometheus instances).

## Federation and Remote Storage

A single Prometheus instance has limits. Federation lets a higher-level Prometheus scrape selected metrics from downstream instances:

```yaml
scrape_configs:
  - job_name: "federate"
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{job="api-server"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ["prometheus-dc1:9090", "prometheus-dc2:9090"]
```

For true long-term storage, use `remote_write` to send samples to a dedicated backend. The three major options are Thanos, Cortex/Mimir (Grafana Mimir is Cortex's successor), and VictoriaMetrics.

```yaml
remote_write:
  - url: "http://mimir-distributor:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      max_shards: 30
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop
```

**Thanos** adds a sidecar to each Prometheus that uploads blocks to object storage (S3, GCS). A querier component provides a unified query interface across all Prometheus instances and historical data. Best for: multi-cluster setups where you want to keep Prometheus instances independent.

**Grafana Mimir** is a horizontally scalable, multi-tenant TSDB. Prometheus pushes via `remote_write` and Mimir handles storage, compaction, and querying. Best for: large-scale centralized monitoring with strict multi-tenancy requirements.

## When Prometheus Is the Right Choice

Prometheus excels at infrastructure and application monitoring with dimensional data. It handles millions of active time series on a single instance with modest hardware. The ecosystem (exporters, alerting, Grafana integration) is unmatched.

Prometheus is not the right choice for: event logging (use Loki or Elasticsearch), distributed tracing (use Jaeger or Tempo), business analytics requiring SQL joins across metrics and other data, or situations requiring 100% accuracy of every data point (Prometheus may lose samples during scrape failures or restarts). For high-cardinality use cases like per-user metrics, consider a system built for that workload like Mimir or VictoriaMetrics from the start.