AWS Fundamentals for Agents

IAM: Identity and Access Management#

IAM controls who can do what in your AWS account. Everything in AWS is an API call, and IAM decides which API calls are allowed. There are three concepts an agent must understand: users, roles, and policies.

Users are long-lived identities for humans or service accounts. Roles are temporary identities that can be assumed by users, services, or other AWS accounts. Policies are JSON documents that define permissions. Roles are always preferred over users for programmatic access because they issue short-lived credentials through STS (Security Token Service).

AWS Lambda and Serverless Function Patterns

AWS Lambda and Serverless Function Patterns#

Lambda runs your code without you provisioning or managing servers. You upload a function, configure a trigger, and AWS handles scaling, patching, and availability. The execution model is simple: an event arrives, Lambda invokes your handler, your handler returns a response. Everything in between – concurrency, retries, scaling from zero to thousands of instances – is managed for you.

That simplicity hides real complexity. Cold starts, timeout limits, memory-to-CPU coupling, VPC attachment latency, and event source mapping behavior all require deliberate design. This article covers the patterns that matter in practice.

Choosing a Monitoring Stack: Prometheus vs Datadog vs Cloud-Native vs VictoriaMetrics

Choosing a Monitoring Stack#

Monitoring is not optional. Without metrics, you are guessing. The question is not whether to monitor but which stack to use. The right choice depends on your cost tolerance, operational capacity, retention requirements, and how much you value control versus convenience.

Decision Criteria#

Before comparing tools, clarify what matters to your organization:

  • Cost model: Are you optimizing for infrastructure spend or engineering time? Self-managed tools cost less in licensing but more in operational hours. SaaS tools cost more in subscription fees but less in engineering effort.
  • Operational burden: Who manages the monitoring system? Do you have an infrastructure team, or are developers responsible for everything?
  • Data retention: Do you need metrics for 15 days, 90 days, or years? Long retention changes the equation significantly.
  • Query capability: Does your team know PromQL? Do they need ad-hoc analysis or mostly pre-built dashboards?
  • Alerting requirements: Simple threshold alerts, or complex multi-signal alerts with routing and escalation?
  • Team expertise: An organization fluent in Prometheus wastes that investment by switching to Datadog. An organization with no Prometheus experience faces a learning curve.

Options at a Glance#

Capability Prometheus + Grafana Prometheus + Thanos/Mimir VictoriaMetrics Datadog Cloud-Native Grafana Cloud
Cost model Infrastructure only Infrastructure only Infrastructure only Per host ($15-23/mo) Per metric/API call Per series/GB
Operational burden High Very high Medium None Low Low
Query language PromQL PromQL MetricsQL (PromQL-compatible) Datadog query language Vendor-specific PromQL, LogQL
Default retention 15 days (local disk) Unlimited (object storage) Unlimited (configurable) 15 months Varies (15 days - 15 months) Plan-dependent
HA built-in No (requires federation) Yes Yes (cluster mode) Yes Yes Yes
Multi-cluster Federation (limited) Yes (global view) Yes (cluster mode) Yes Per-account Yes
APM/Tracing No (separate tools) No (separate tools) No (separate tools) Yes (integrated) Varies Yes (Tempo)
Vendor lock-in None None Low High High Low-Medium

Prometheus + Grafana (Self-Managed)#

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model, scraping metrics from endpoints at configurable intervals, and stores time series data on local disk. Grafana provides visualization. Alertmanager handles alert routing.