> ## Documentation Index
> Fetch the complete documentation index at: https://docs.clawker.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring

> Set up the OpenSearch + OpenSearch Dashboards + Prometheus monitoring stack for agent observability

Clawker includes an optional monitoring stack that collects telemetry from Claude Code agents running in containers. It provides dashboards for logs and metrics — giving you visibility into what every agent is doing, how much it costs, what tools it's calling, and whether it's making progress.

## What You Get

The stack ingests logs into OpenSearch and exposes them through **OpenSearch Dashboards** for ad-hoc querying and dashboarding. **Prometheus** backs metrics with its own UI. You can see per-agent and per-project activity for:

* **Cost and token usage** — API call costs, input/output tokens, rate limiting
* **Coding activity** — file edits, tool invocations, command executions
* **Tool usage** — which tools the agent calls, how often, and how long they take
* **Session detail** — full event timeline for individual agent sessions
* **Egress traffic** — Envoy access logs and CoreDNS query logs from the firewall stack (when enabled)

## Architecture

The monitoring stack runs as four long-lived Docker Compose services on the `clawker-net` network, plus one short-lived bootstrap container that runs once per `monitor up` to preconfigure the cluster:

```
Claude Code (in container)                Control plane (host)
    |                                          |
    | OTLP (HTTP/gRPC) — agent lane            | OTLP/gRPC + mTLS — trusted infra lane
    | logs, metrics, traces                    | zerolog bridge, Envoy access logs,
    |                                          | CoreDNS query logs, netlogger ebpf-egress
    v                                          v
OTEL Collector ──┬──> OpenSearch (logs + traces) ──> OpenSearch Dashboards
                 └──> Prometheus (metrics + spanmetrics) ──> Prometheus UI
                                                       \\
                                                        \\── (also read by OpenSearch Dashboards
                                                              via a preconfigured datasource)
```

Claude Code emits logs, metrics, AND traces. Span export gates on the `CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1` flag (baked into the image alongside `CLAUDE_CODE_ENABLE_TELEMETRY=1`). The OTEL Collector's `spanmetrics` connector also produces RED (rate/error/duration) metrics from incoming spans and feeds them into the same Prometheus pipeline.

### Services

| Service                          | Image                                     |       Default Port       | Purpose                                                                                                                                                                                                                                                                                                                                          |
| -------------------------------- | ----------------------------------------- | :----------------------: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **OTEL Collector**               | `otel/opentelemetry-collector-contrib`    | 4318 (HTTP), 4317 (gRPC) | Receives telemetry from agents and routes it to backends                                                                                                                                                                                                                                                                                         |
| **OpenSearch**                   | `opensearchproject/opensearch`            |           9200           | Stores telemetry across the `claude-code` (Claude Code OTLP), `clawker-cli` (host CLI), `clawkercp` (clawkercp mTLS push), `clawker-envoy` (firewall HTTP/TLS/TCP access logs), `clawker-coredns` (firewall DNS query logs), `clawker-ebpf-egress` (eBPF per-decision egress events) log indices, plus the `traces`/`clawker` (SS4O) trace index |
| **OpenSearch Dashboards**        | `opensearchproject/opensearch-dashboards` |           5601           | UI for querying logs, dashboards, and (via the preconfigured `clawker_prometheus` direct-query datasource) Prometheus metrics                                                                                                                                                                                                                    |
| **Prometheus**                   | `prom/prometheus`                         |           9090           | Time-series metrics storage and UI                                                                                                                                                                                                                                                                                                               |
| **clawker-opensearch-bootstrap** | `curlimages/curl`                         |             —            | One-shot init container. Runs after OpenSearch + Prometheus are reachable, before the collector starts; applies templates, ISM, datasource, workspace, saved objects, then exits                                                                                                                                                                 |

All containers are pre-configured with labels (`dev.clawker.purpose=monitoring`) and attached to the `clawker-net` network. The OpenSearch security plugin is disabled by default for local development — Dashboards is reachable at `http://localhost:5601` with no login required. Set `OPENSEARCH_JAVA_OPTS` via the `opensearch_heap_mb` setting if you need more heap.

### Service Hostnames Are Constants

Individual service hostname constants (`MonitoringServiceOtelCollector`, `MonitoringServicePrometheus`, `MonitoringServiceOpenSearchNode`, `MonitoringServiceOpenSearchDashboards`) are defined in `internal/consts/monitoring.go`. The `MonitoringServiceHostnames` slice — consumed by CoreDNS `internalHosts` forward zones — intentionally contains only `otel-collector` and `prometheus`: agents push telemetry through the collector and never query OpenSearch directly, so `opensearch-node` and `opensearch-dashboards` are excluded from the forwarded set (containers on `clawker-net` that do need those reach them via Docker's embedded resolver).

### How Agents Connect

OTEL endpoints are baked into every Clawker image at build time. At container creation, Clawker checks whether the monitoring stack is running and conditionally enables telemetry:

* `OTEL_EXPORTER_OTLP_ENDPOINT` — single base URL pointing at the collector's OTLP/HTTP listener (baked into image). The Claude Code SDK appends `/v1/metrics`, `/v1/logs`, `/v1/traces` per signal
* `OTEL_METRICS_EXPORTER` / `OTEL_LOGS_EXPORTER` / `OTEL_TRACES_EXPORTER` — all set to `otlp`
* `CLAUDE_CODE_ENABLE_TELEMETRY` — set to `1` when the `otel-collector` container is detected as running on `clawker-net`, set to `0` otherwise
* `CLAUDE_CODE_ENHANCED_TELEMETRY_BETA` — set to `1` (gates Claude Code's span export; both this and `CLAUDE_CODE_ENABLE_TELEMETRY` must be `1` for traces to flow)
* `OTEL_RESOURCE_ATTRIBUTES` — tags with `project=` and `agent=` for filtering

**Key behavior:** Telemetry is **only active when the monitoring stack is running** at the time the container is created. If the stack isn't up, Clawker explicitly disables telemetry so agents don't waste time attempting exports to an unreachable collector. You can force-enable telemetry via `agent.env` in your `.clawker.yaml` even when the stack is down, but Claude Code won't retry failed collector connections.

**Start the monitoring stack before starting your agents** to ensure telemetry is captured.

## Egress Traffic Visibility

When the firewall is enabled, Envoy and CoreDNS access logs flow into OpenSearch alongside agent telemetry, plus a per-decision-point eBPF event stream that closes the bypass-mode forensic gap.

* **Envoy access logs** (`clawker-envoy`) — `server.address` (host the client asked for), `network.peer.address`/`network.peer.port` (post-resolution upstream), `network.transport`/`network.protocol.name`/`network.protocol.version`, `tls.established`/`tls.protocol.version`/`tls.cipher`, `action` (clawker firewall verdict: `allowed`/`denied`), `response_code`/`response_code_details` (distinguishes Envoy `direct_response` from upstream `via_upstream`), duration breakdown (`req_duration_ms`/`resp_duration_ms`/`resp_tx_duration_ms`/`duration_ms`), and byte counters. Query `action:denied` for clawker firewall blocks; `response_code:>=400 AND action:allowed` for upstream errors.
* **CoreDNS query logs** (`clawker-coredns`) — queried `domain`, `qtype` (A/AAAA), `rcode` (NOERROR / NXDOMAIN), resolution `duration`. `rcode:NXDOMAIN` is **ambiguous**: it covers both DNS-layer firewall blocks (non-allowlisted host) and legitimate misses inside allowed zones (typo, missing record). CoreDNS makes no per-query allow/deny decision — correlate `domain` against the allowlist, or use the eBPF egress stream below (explicit `action:denied`) for the authoritative block signal.
* **eBPF egress events** (`clawker-ebpf-egress`) — one record per connect/sendmsg/sock\_create decision carrying `action` (`allowed` / `denied` / `bypassed`), attribution (`agent`, `project`, `container_id`), the 4-tuple, and the resolved domain. The bypass case is the headline: bypassed traffic skips Envoy and CoreDNS entirely but still produces a netlogger record, so bypass windows leave a complete audit trail. See [Egress Observability](/observability) for the record shape and per-attribute reference.

## Setup

### 1. Initialize Configuration

```bash theme={"dark"}
clawker monitor init
```

This scaffolds configuration files in your data directory (`~/.local/share/clawker/monitor/`):

* `compose.yaml` — Docker Compose definition for all services (four long-lived + one-shot bootstrap)
* `otel-config.yaml` — OpenTelemetry Collector pipeline configuration
* `prometheus.yaml` — Prometheus scrape targets

Use `--force` to regenerate files if they already exist.

### 2. Start the Stack

```bash theme={"dark"}
clawker monitor up
```

The stack runs in the background. Service URLs are printed on startup:

```
OpenSearch Dashboards: http://localhost:5601
OpenSearch API:        http://localhost:9200
Prometheus:            http://localhost:9090
```

### 3. Run Agents

Start agents as usual — telemetry flows automatically:

```bash theme={"dark"}
clawker run -it --agent dev @
```

### 4. Explore in OpenSearch Dashboards

Open OpenSearch Dashboards at `http://localhost:5601`. From the splash / welcome screen, under the **Analytics** panel on the far right, click **Clawker** to enter the auto-created workspace.

Inside the workspace, the left navbar has an **Explore** section — click **Logs** or **Metrics** to browse. The preconfigured index patterns (`claude-code`, `clawker-cli`, `clawkercp`, `clawker-envoy`, `clawker-coredns`, `clawker-ebpf-egress`) and the `clawker_prometheus` direct-query datasource are already wired, so logs and Prometheus metrics are both reachable from inside OSD. Raw Prometheus is also still available at `http://localhost:9090` if you prefer it.

**Dashboards.** Three are preinstalled under the workspace's **Dashboards** view:

* **Claude Code Cost & Usage** — KPI strip over Claude Code sessions, cost, input/output/cache tokens. Sourced from Prometheus counters.
* **Claude Code Activity** — Claude Code security activity audit: prompts, tooling, code editing, permissions, hooks, MCP, plugins — totals, distributions, and full event tables.
* **Clawker Networking** — per-source firewall telemetry: three event-stream panels (Envoy access logs, CoreDNS query log, eBPF egress decisions) plus three verdict pies (Envoy `action`, CoreDNS `rcode`, eBPF `action`). Verdict pies use each source's own field — values not normalized so each component's truth stays visible (e.g. CoreDNS denies show as `NXDOMAIN`, not collapsed into `denied`). Red slices = denials. Sourced from OS log indices; no Prometheus counters for network telemetry today.

Build your own off the index patterns + Prometheus datasource for anything not covered.

<Warning>
  **Prometheus label naming: use `kind` instead of `type`**

  Claude Code's metrics docs reference a `type` label on a few counters (`claude_code_token_usage_tokens_total`, `claude_code_active_time_seconds_total`, `claude_code_lines_of_code_count_total`). The monitoring stack stores these as `kind` instead — same values (`input`/`output`/`cacheRead`/`cacheCreation`, `cli`/`user`, `added`/`removed`), only the key is renamed.

  Why: a known bug in the current OpenSearch SQL plugin's direct-query Prometheus connector causes the OSD Explore "Metrics" UI to error out (`missing type id property 'type'`) on any Prom series that carries a label literally named `type`. The OTel collector renames the label at ingest so the UI works. The native Prometheus UI at `http://localhost:9090` is unaffected either way — `type` would work there if it existed.

  When writing your own PromQL or saved searches against these metrics, query with `kind=` rather than `type=`.
</Warning>

## Bootstrap

The stack is preconfigured every `monitor up`. A one-shot `clawker-opensearch-bootstrap` compose service runs after OpenSearch reports healthy, applies:

* **Component templates** with shared mappings (`@timestamp` as date, `service.name` / `ingest_source` / `project` / `agent` as keyword)
* **Index templates** with per-source field mappings for each log index plus the SS4O `traces` mapping (so dynamic mapping never locks the wrong type at first ingest). The `claude-code` template carries the full Claude Code log + span field set.
* **ISM retention policy** (7-day rollover-to-delete, auto-attached to all indices via `ism_template.index_patterns`)
* **Empty index pre-creation** for the log indices so OSD Discover and dashboards don't error on initial load before the first record arrives
* **`clawker_prometheus` direct-query datasource** registered via the OpenSearch SQL plugin so OSD can read Prometheus metrics without a federated proxy
* **OSD `Clawker` workspace** with `features: ["use-case-all"]` so the explore-flavor `metrics` / `logs` / `traces` nav groups all mount
* **Saved objects** imported INTO the Clawker workspace — index patterns for every log index, the `Claude Code Cost & Usage` KPI dashboard (sessions, cost, input/output/cache tokens), the `Claude Code Activity` dashboard (prompts, tooling, code editing, permissions, hooks), and the `Clawker Networking` dashboard (Envoy / CoreDNS / eBPF event streams + verdict pies). Build additional dashboards off the index patterns + Prometheus datasource as needed.

`otel-collector` gates on bootstrap completing successfully — it never starts until the cluster is preconfigured. Prometheus starts in parallel; bootstrap depends on Prometheus being up so the `clawker_prometheus` datasource registration can validate the configured URI. Bootstrap failure surfaces in `docker logs clawker-opensearch-bootstrap` and leaves the stack half-up by design (so wrong-mapped indices can't be silently created). The throwaway-stack model means picking up template/policy edits requires `monitor down --volumes && monitor up`; templates only apply at index creation.

## Telemetry Controls

Fine-tune what telemetry is collected via `settings.yaml`:

```yaml theme={"dark"}
monitoring:
  telemetry:
    log_tool_details: true       # Include tool invocation details
    log_user_prompts: true       # Include user prompts in logs
    include_account_uuid: true   # Include account identifier
    include_session_id: true     # Include session identifier
```

## Port Configuration

Override default ports in `settings.yaml` if they conflict with other services:

```yaml theme={"dark"}
monitoring:
  otel_collector_port: 4318
  otel_grpc_port: 4317
  opensearch_port: 9200
  opensearch_dashboards_port: 5601
  opensearch_heap_mb: 512        # JVM -Xms/-Xmx for the OpenSearch node
  prometheus_port: 9090
  prometheus_metrics_port: 8889
```

After changing ports, regenerate config files and restart:

```bash theme={"dark"}
clawker monitor init --force
clawker monitor down
clawker monitor up
```

## Checking Status

```bash theme={"dark"}
clawker monitor status
```

Shows container status (running/stopped) and service URLs for running services.

## Teardown

```bash theme={"dark"}
# Stop the stack (preserves data)
clawker monitor down

# Stop and remove all data volumes
clawker monitor down --volumes
```

Without `--volumes`, monitoring data persists across restarts (named volume `clawker-opensearch` for indices, `clawker-prometheus` for TSDB). The `clawker-net` network is preserved for other Clawker services (firewall, agents).
