observability — piclaw-addons

Open Settings → Add-Ons and pick observability

OpenTelemetry observability for piclaw — trace errors and agent turns across multiple instances to Azure Application Insights (with Live Metrics Stream) and local Graphite.

Requires Piclaw >=2.0.0.

Uses the runtime's structured log-sink contract. The runtime never imports OTel — it just logs structured records. This addon subscribes to those records and creates OTel spans, exceptions, and Graphite metrics from them.

The add-on keeps one telemetry/exporter runtime per Piclaw process and multiplexes all chat/session activity through shared tracer state keyed by chatJid, turnId, and sessionLeafId. A single session shutting down does not tear down telemetry for other active sessions.

Setup

1. Install

Open Settings → Add-Ons and install observability from the catalog.

2. Configure via Settings → Observability

The pane loads/saves non-secret settings through the direct backend add-on config API (/agent/addons/api/observability/config). The connection string can be pasted directly into the settings pane — it is saved to the keychain automatically as azure/appinsights-connection-string. Changes are applied live to the process-wide telemetry runtime.

Observability settings pane on the microVM test instance

Field	Type	Default	Description
Enabled	checkbox	off	Master switch
Instance name	text	`hostname()`	Identifies this instance in App Insights (`cloud_RoleInstance`). Set to e.g. `smith`, `relay`, `orangepi`.
App Insights enabled	checkbox	on	Sub-toggle for the Azure backend
Connection string	password	—	Paste the App Insights connection string directly. Saved to keychain as `azure/appinsights-connection-string`.
Live Metrics Stream	checkbox	on	Real-time telemetry in the Azure portal (QuickPulse)
Standard metrics	checkbox	on	OTel standard metrics collection (CPU, memory, request rate)
Sampling ratio	number	1	0–1. 1 = send all traces. 0.5 = sample 50%.
Graphite enabled	checkbox	off	Sub-toggle for Carbon plaintext push
Host	text	—	Graphite/Carbon receiver host, e.g. `192.168.1.250`
Port	number	2003	Carbon plaintext port
Metric prefix	text	`piclaw`	Root prefix for all Graphite metric paths

Storage model

What	Where
App Insights connection string	Keychain — entry `azure/appinsights-connection-string`. Entered directly in the settings pane.
All other settings	Runtime database — extension KV store (SQLite, global scope, extension ID `observability`)
App Insights actor/session identity	Derived on the backend from Piclaw log records (`chatJid`, `sessionLeafId`, `turnId`)

No config files are written to disk.

3. Deploy to other instances

Each piclaw instance needs:

The addon installed
The same keychain entry with the App Insights connection string
instance_name set to a unique value in Settings → Observability

Architecture

How it works

The addon uses piclaw's log-sink contract — a generic API that any addon can use. Server-side spans are derived from runtime records. The add-on does not install browser telemetry, wrap fetch, wrap EventSource, or load the browser Application Insights SDK.

Server side:

text

runtime                              addon
───────                              ─────
log.info("Prompting session", {
  operation: "run_agent.prompt",     ──►  sink receives record
  chatJid: "web:default",                 creates Span "agent.turn"
  model: "azure-openai/gpt-5-4",         stores in inflightTurns map
})

  ... model runs, tools fire ...

log.info("Tool execution ended", {
  operation: "tool.call.end",        ──►  sink receives record
  chatJid: "web:default",                 creates child Span "tool.call"
  toolName: "bash",                       pushes Graphite metric
  durationMs: 320,
})

log.info("Agent run completed", {
  operation: "run_agent.complete",   ──►  sink receives record
  chatJid: "web:default",                 finds inflight span
  durationMs: 4523,                       ends span → App Insights
})                                        pushes Graphite metrics

If the addon isn't installed, no sink is registered and there is zero overhead.

See the runtime observability docs for the full log-sink API and operation reference.

Instance identity

OTel Resource attribute	App Insights field	Value
`service.name`	`cloud_RoleName`	`piclaw`
`service.instance.id`	`cloud_RoleInstance`	config `instance_name` (or hostname)
`host.name`	—	always OS `hostname()`
`deployment.environment`	custom dimension	auto-detected: `docker` / `lxc` / `host-native`
`service.version`	—	piclaw package version

Application Insights user/session model

The goal is to make the standard Application Insights UX behave as if Piclaw were a normal web application, while still deriving all telemetry from backend runtime events.

App Insights concept	Piclaw source	OTel/App Insights fields emitted
User	Chat/agent actor	`enduser.id = chatJid`, `enduser.pseudo.id = chatJid`, `piclaw.chat_jid`, `piclaw.actor.id`
Authenticated user	Same stable actor identity	Azure Monitor maps `enduser.id` to `ai.user.authUserId`; `ai.user.authUserId` is also kept as a custom dimension
User ID	Same stable actor identity	Azure Monitor maps `enduser.pseudo.id` to `ai.user.id`; `ai.user.id` is also kept as a custom dimension
Session	Piclaw runtime session/fork	`session.id`, `ai.session.id`, `piclaw.session.id`; value is `sessionLeafId` when available, otherwise `chatJid`
Operation / transaction	One agent turn	`piclaw.turn_id`; child model/tool spans share the same trace/operation
Request	User-visible agent turn	`agent.turn` SERVER span, request-style attributes (`http.route=/agent/turn`)
Dependency	Work performed by the turn	`model.call` and `tool.call` CLIENT/dependency spans; `provider.error` is an error span
Metrics	Spend and performance	token dimensions on `model.call`, duration/count metrics in Graphite, standard Azure Monitor metrics when enabled

Why these fields

Azure Monitor's OpenTelemetry exporter maps:

OTel attribute	App Insights field
`enduser.id`	`ai.user.authUserId`
`enduser.pseudo.id`	`ai.user.id`

The exporter does not currently map session.id into the App Insights session tag for spans, so the add-on emits both standard (session.id) and App Insights-style (ai.session.id) attributes as queryable dimensions. This keeps the data available in Transaction Search/KQL and gives us a single place to add a custom exporter/processor later if needed.

Backend-only interaction principle

Browser telemetry is intentionally absent. The web entry only registers the Settings pane. Front-end actions should be represented by backend log records and then mapped by this add-on into synthetic App Insights requests/events/spans. This keeps telemetry consistent across web, mobile, WhatsApp, scheduled tasks, and other channels.

Data sent

Log operation → Span / Metric mapping

Log operation	OTel Span	Graphite metric
`run_agent.prompt` → `run_agent.complete`	`agent.turn` (request-style span; paired by `turnId`, fallback `chatJid`)	`agent.turn.count`, `agent.turn.duration_ms`, `agent.turn.success`
`run_agent.prompt` → `run_agent` (error)	`agent.turn` (request-style span; ERROR + exception)	`agent.turn.count`, `agent.turn.error`
`run_agent.no_terminal_reply`	`agent.turn` (request-style span; ERROR)	`agent.turn.error`
`model.response.start/end`	`model.call` (dependency-style child span of `agent.turn`)	`model.call.count`, `model.call.duration_ms`
`run_agent.attempt_failed`	`provider.error` (exception)	`recovery.attempts`, `provider.error.<classifier>`
`tool.call.start/end`	`tool.call` (dependency-style child span of `agent.turn`)	`tool.<name>.count`, `tool.<name>.duration_ms`
`dream.complete`	`dream`	`dream.duration_ms`
`get_or_create.create_main_session`	—	`session.created`
`evict_idle.*`	—	`session.evicted`
Any warn/error with `operation`	`log.warn` / `log.error`	—

Backend-synthesized interaction events planned next

These interactions should be emitted by the backend as structured log records and then mapped here into App Insights request/event-style spans:

Interaction	Backend source	Suggested App Insights item	Identity/session
User sends a message	`handle_agent_message` accepted payload	`agent.message.sent`	`chatJid`, `sessionLeafId` when known
Message queued as follow-up	queue/follow-up backend path	`agent.followup.queued`	`chatJid`, active `turnId` when known
Queued follow-up consumed	follow-up materialization path	`agent.followup.consumed`	`chatJid`, next `turnId`
Queued follow-up removed	queue remove backend handler	`agent.followup.removed`	`chatJid`
Steering message queued	steer backend path	`agent.steer.queued`	`chatJid`, active `turnId` when known
Model changed	backend model command path	`agent.model.changed`	`chatJid`
UI command handled	backend command handlers	`agent.ui.command`	`chatJid`

Span schemas

agent.turn (successful)

json

{
  "name": "agent.turn",
  "kind": "SERVER",
  "status": { "code": "OK" },
  "duration": "4523ms",
  "attributes": {
    "piclaw.chat_jid": "web:default:branch:0f3858079ad7",
    "piclaw.actor.kind": "chat_jid",
    "piclaw.actor.id": "web:default:branch:0f3858079ad7",
    "enduser.id": "web:default:branch:0f3858079ad7",
    "enduser.pseudo.id": "web:default:branch:0f3858079ad7",
    "session.id": "session-leaf-123",
    "ai.session.id": "session-leaf-123",
    "piclaw.instance": "smith",
    "piclaw.model": "azure-openai/gpt-5-4",
    "piclaw.turn.status": "success",
    "piclaw.turn.duration_ms": 4523,
    "piclaw.turn.output_chars": 1280
  }
}

agent.turn (error)

json

{
  "name": "agent.turn",
  "status": { "code": "ERROR", "message": "Prompt completed without emitting an assistant reply..." },
  "duration": "8912ms",
  "attributes": {
    "piclaw.chat_jid": "web:default:branch:0f3858079ad7",
    "enduser.id": "web:default:branch:0f3858079ad7",
    "enduser.pseudo.id": "web:default:branch:0f3858079ad7",
    "session.id": "session-leaf-123",
    "piclaw.instance": "smith",
    "piclaw.model": "azure-openai/gpt-5-4",
    "piclaw.turn.status": "error"
  },
  "events": [
    {
      "name": "exception",
      "attributes": {
        "exception.type": "Error",
        "exception.message": "Prompt completed without emitting an assistant reply before finalization..."
      }
    }
  ]
}

model.call

json

{
  "name": "model.call",
  "kind": "CLIENT",
  "status": { "code": "OK" },
  "duration": "1280ms",
  "attributes": {
    "piclaw.chat_jid": "web:default",
    "piclaw.turn_id": "turn_abcd1234",
    "piclaw.model": "azure-openai/gpt-5-4",
    "piclaw.model.sequence": 2,
    "piclaw.model.stop_reason": "toolUse",
    "piclaw.model.duration_ms": 1280
  }
}

tool.call

json

{
  "name": "tool.call",
  "status": { "code": "OK" },
  "duration": "320ms",
  "attributes": {
    "piclaw.chat_jid": "web:default",
    "piclaw.instance": "smith",
    "piclaw.tool.name": "bash",
    "piclaw.tool.duration_ms": 320
  }
}

provider.error

json

{
  "name": "provider.error",
  "status": { "code": "ERROR", "message": "429 Too Many Requests" },
  "attributes": {
    "piclaw.chat_jid": "web:default",
    "piclaw.instance": "relay",
    "piclaw.error.classifier": "rate_limit"
  },
  "events": [
    { "name": "exception", "attributes": { "exception.message": "429 Too Many Requests" } }
  ]
}

Graphite metric paths

text

# Agent turns
piclaw.smith.agent.turn.count 1 1745828400
piclaw.smith.agent.turn.duration_ms 4523 1745828400
piclaw.smith.agent.turn.success 1 1745828400

# Tool calls
piclaw.smith.tool.bash.count 1 1745828400
piclaw.smith.tool.bash.duration_ms 320 1745828400

# Recovery
piclaw.smith.recovery.attempts 2 1745828400
piclaw.smith.provider.error.rate_limit 1 1745828400

# Session lifecycle
piclaw.smith.session.created 1 1745828400
piclaw.smith.session.evicted 1 1745828400

# Dream
piclaw.smith.dream.duration_ms 45000 1745828400

Queryable as:

text

piclaw.*.agent.turn.error          # errors across all instances
piclaw.smith.tool.*.duration_ms    # all tool durations on smith
piclaw.relay.provider.error.*      # all provider errors on relay

Azure Application Insights views

Feature	What it shows
Application Map	All piclaw instances with health and dependency links
Failures blade	Errors grouped by `cloud_RoleInstance`: smith 2, relay 5, orangepi 1
Transaction Search	Individual turn traces with `model.call` and `tool.call` child spans
Live Metrics Stream	`agent.turn` maps more naturally to Incoming Requests, while `model.call` and `tool.call` map more naturally to outgoing dependency metrics
Users / Sessions	Backend-derived actor/session fields: `chatJid` maps to App Insights user fields; `sessionLeafId` maps to queryable session dimensions

Important: the addon now synthesizes telemetry classes intentionally:

agent.turn → request-style span (for Incoming Requests / request rate / request duration)

model.call and tool.call → dependency-style spans (for outgoing dependency metrics)

provider.error, log.error, and failed spans → exceptions / failures

Piclaw also stamps a synthetic result code onto spans so resultCode is no longer NaN in App Insights for custom telemetry: 200=info/success, 300=warn, 400=error.

Kusto queries

Use these in Azure Application Insights → Logs.

The piclaw repo also includes companion artifacts:

docs/azure/app-insights-agent-kusto-queries.md
docs/azure/app-insights-agent-observability-workbook-template.json

1) Everything recent for piclaw instances

kusto

union withsource=table requests, dependencies, traces, exceptions
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| where timestamp > ago(30m)
| where cloud_RoleName == "piclaw" or isnotempty(piclaw_instance)
| extend item_name = coalesce(name, operation_Name, message, outerMessage)
| project timestamp, table, piclaw_instance, item_name, success, resultCode, severityLevel, operation_Id
| order by timestamp desc

2) Piclaw custom spans (`agent.turn`, `model.call`, `tool.call`, `provider.error`, `dream`, `log.*`)

kusto

union withsource=table requests, dependencies, traces, exceptions
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| extend span_name = coalesce(name, operation_Name, message, outerMessage)
| where timestamp > ago(6h)
| where span_name in ("agent.turn", "model.call", "tool.call", "provider.error", "dream", "log.error", "log.warn")
| project timestamp,
          table,
          piclaw_instance,
          span_name,
          success,
          duration,
          operation_Id,
          chat_jid = tostring(customDimensions["piclaw.chat_jid"]),
          model = tostring(customDimensions["piclaw.model"]),
          tool_name = tostring(customDimensions["piclaw.tool.name"]),
          turn_status = tostring(customDimensions["piclaw.turn.status"]),
          classifier = tostring(customDimensions["piclaw.error.classifier"])
| order by timestamp desc

3) Backend-derived users and sessions by chat JID

kusto

union withsource=table requests, dependencies, traces, exceptions
| where timestamp > ago(24h)
| extend chat_jid = coalesce(user_AuthenticatedId, tostring(customDimensions["piclaw.chat_jid"]))
| extend session_id = coalesce(session_Id, tostring(customDimensions["ai.session.id"]), tostring(customDimensions["session.id"]), tostring(customDimensions["piclaw.session.id"]))
| where isnotempty(chat_jid)
| summarize items = count(), sessions = dcount(session_id), failures = countif(success == false or severityLevel >= 3) by chat_jid
| order by items desc

4) Agent-turn throughput and latency by instance

kusto

requests
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| where timestamp > ago(24h)
| where name == "agent.turn"
| extend duration_ms = todouble(duration / 1ms)
| summarize turns = count(),
            errors = countif(success == false or tostring(customDimensions["piclaw.turn.status"]) == "error"),
            p50_ms = percentile(duration_ms, 50),
            p95_ms = percentile(duration_ms, 95),
            p99_ms = percentile(duration_ms, 99)
  by piclaw_instance
| order by turns desc

5) Tool-call latency by tool name

kusto

dependencies
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| where timestamp > ago(24h)
| where name == "tool.call"
| extend duration_ms = todouble(duration / 1ms)
| summarize calls = count(),
            errors = countif(success == false),
            p50_ms = percentile(duration_ms, 50),
            p95_ms = percentile(duration_ms, 95)
  by piclaw_instance, tool_name = tostring(customDimensions["piclaw.tool.name"])
| order by calls desc

6) Models by instance

kusto

requests
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| where timestamp > ago(24h)
| where name == "agent.turn"
| extend model = tostring(customDimensions["piclaw.model"])
| where isnotempty(model)
| extend duration_ms = todouble(duration / 1ms)
| summarize turns = count(),
            errors = countif(success == false or tostring(customDimensions["piclaw.turn.status"]) == "error"),
            total_duration_ms = sum(duration_ms),
            p50_ms = percentile(duration_ms, 50),
            p95_ms = percentile(duration_ms, 95)
  by piclaw_instance, model
| order by turns desc

7) Providers / provider-error classifiers

kusto

union withsource=table dependencies, traces, exceptions
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| extend span_name = coalesce(name, operation_Name, message, outerMessage)
| extend provider = tostring(customDimensions["piclaw.provider"])
| extend classifier = tostring(customDimensions["piclaw.error.classifier"])
| where timestamp > ago(24h)
| where span_name == "provider.error" or isnotempty(provider) or isnotempty(classifier)
| summarize events = count(),
            failures = countif(success == false or severityLevel >= 3)
  by piclaw_instance, provider, classifier, span_name, table
| order by events desc

8) Provider/runtime failures

kusto

union withsource=table dependencies, traces, exceptions
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| extend span_name = coalesce(name, operation_Name, message, outerMessage)
| where timestamp > ago(24h)
| where span_name in ("provider.error", "log.error", "log.warn")
   or success == false
   or severityLevel >= 3
| project timestamp,
          table,
          piclaw_instance,
          span_name,
          severityLevel,
          success,
          operation_Id,
          classifier = tostring(customDimensions["piclaw.error.classifier"]),
          provider = tostring(customDimensions["piclaw.provider"]),
          model = tostring(customDimensions["piclaw.model"]),
          message,
          outerMessage,
          problemId,
          type
| order by timestamp desc

9) Token counters on `model.call` dependency spans

kusto

dependencies
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| where timestamp > ago(24h)
| where name == "model.call"
| extend model = tostring(customDimensions["piclaw.model"])
| extend input_tokens = todouble(customDimensions["piclaw.model.input_tokens"])
| extend output_tokens = todouble(customDimensions["piclaw.model.output_tokens"])
| extend cache_read_tokens = todouble(customDimensions["piclaw.model.cache_read_tokens"])
| extend cache_write_tokens = todouble(customDimensions["piclaw.model.cache_write_tokens"])
| extend total_tokens = todouble(customDimensions["piclaw.model.total_tokens"])
| where isnotnull(input_tokens)
   or isnotnull(output_tokens)
   or isnotnull(cache_read_tokens)
   or isnotnull(cache_write_tokens)
   or isnotnull(total_tokens)
| summarize model_calls = count(),
            input_tokens = sum(input_tokens),
            output_tokens = sum(output_tokens),
            cache_read_tokens = sum(cache_read_tokens),
            cache_write_tokens = sum(cache_write_tokens),
            total_tokens = sum(total_tokens)
  by piclaw_instance, model
| order by total_tokens desc

10) One-instance drill-down (`smith`)

kusto

union withsource=table requests, dependencies, traces, exceptions
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| where timestamp > ago(2h)
| where piclaw_instance == "smith"
| extend item_name = coalesce(name, operation_Name, message, outerMessage)
| project timestamp, table, item_name, success, duration, severityLevel, operation_Id
| order by timestamp desc

11) If Live Metrics only shows requests, confirm the exporter is still sending custom telemetry

kusto

union withsource=table requests, dependencies, traces
| extend piclaw_instance = coalesce(tostring(customDimensions["piclaw.instance"]), cloud_RoleInstance)
| extend item_name = coalesce(name, operation_Name, message)
| where timestamp > ago(15m)
| where item_name in ("agent.turn", "tool.call", "provider.error", "dream", "log.error", "log.warn")
| summarize count() by table, item_name, piclaw_instance
| order by count_ desc

Dependencies

@azure/monitor-opentelemetry ^1.16 — official Azure Monitor OTel distro (includes Live Metrics)
@opentelemetry/api ^1.9 — OTel trace + context API

Setup

1. Install

2. Configure via Settings → Observability

Storage model

3. Deploy to other instances

Architecture

How it works

Instance identity

Application Insights user/session model

Why these fields

Backend-only interaction principle

Data sent

Log operation → Span / Metric mapping

Backend-synthesized interaction events planned next

Span schemas

agent.turn (successful)

agent.turn (error)

model.call

tool.call

provider.error

Graphite metric paths

Azure Application Insights views

Kusto queries

1) Everything recent for piclaw instances

2) Piclaw custom spans (agent.turn, model.call, tool.call, provider.error, dream, log.*)

3) Backend-derived users and sessions by chat JID

4) Agent-turn throughput and latency by instance

5) Tool-call latency by tool name

6) Models by instance

7) Providers / provider-error classifiers

8) Provider/runtime failures

9) Token counters on model.call dependency spans

10) One-instance drill-down (smith)

11) If Live Metrics only shows requests, confirm the exporter is still sending custom telemetry

Dependencies

2) Piclaw custom spans (`agent.turn`, `model.call`, `tool.call`, `provider.error`, `dream`, `log.*`)

9) Token counters on `model.call` dependency spans

10) One-instance drill-down (`smith`)