blob: d31a7930efb1dcb1ab0300026eb6d9f9b091f661 [file] [view]
# DSL Debug API — MAL
> Status: **shipped**. Operator reference for the MAL slice of the DSL Debug
> API. Design: [SWIP-13](../../../swip/SWIP-13.md). Index of related pages:
> [DSL Debug API overview](dsl-debugging.md).
## What it captures
A MAL session attaches to one metric rule. Every scrape window that survives
the rule's file-level filter produces one **record** in the response;
within a record, each probe stage the expression executes appends one
**sample**. The wire shape is:
```text
nodes[]
records[]
startedAtMs — record boundary timestamp (ms)
dsl — verbatim per-rule DSL text
rule — rule envelope:
metricPrefix
name — per-rule name (no prefix)
filter — file-level filter closure body, if any
exp — `exp:` body verbatim
expSuffix — file-level expSuffix verbatim, if any
samples[]
type — input | filter | function | output
sourceText — verbatim DSL fragment for this probe
continueOn — true (MAL captures kept-only; see overview)
payload — SampleFamily.toJson() at this probe stage
sourceLine — omitted for MAL (no per-line mapping)
```
Sample types and the probes that emit them:
| `type` | Probe | Fired when |
|------------|----------------------|----------------------------------------------------------------------------|
| `filter` | `captureFilter` | The file-level `filter:` closure runs over the input samples (kept-only). |
| `input` | `captureInput` | The metric reference at the head of the expression resolves a SampleFamily. |
| `function` | `captureStage` | An in-expression chain op runs (`sum`, `tagEqual`, `service`, etc.). |
| `function` | `captureDownsample` | A downsampling op runs (e.g. `rate("PT1M")`). |
| `output` | `captureMeterEmit` | The metric is emitted to the persistence pipeline (terminal). |
`sample.sourceText` is the **verbatim ANTLR slice** of the chain segment
from the original `exp:` body — operators can grep the captured text
against the source byte-for-byte. There is no leading `.` (the dot is
part of the chain context, not the `MethodCallContext` slice).
`sample.payload` is the structured `SampleFamily.toJson()` at that
probe stage — every sample's name, label set, value, and timestamp is
present; truncated at `maxSamplesPerCapture` (default 64) with a
`+N more` summary.
When no session is bound, the codegen-emitted probe call sites are single
volatile-bool reads that JIT eliminates after warm-up — idle cost is
effectively free.
## Enabling
The shared admin HTTP host (`admin-server`) is enabled by default; turn on the
DSL-debug feature on top of it:
```bash
SW_DSL_DEBUGGING=default
```
`injectionEnabled` is a **boot-time codegen switch**, default `true` once the
`dsl-debugging` module is enabled — the MAL generator emits per-rule
`GateHolder` fields and probe call sites, so debug sessions actually capture
samples. Set `false` only if the REST surface is wanted but no codegen-side
probe overhead is acceptable; with `false` the MAL bytecode is byte-identical
to a build without SWIP-13, and `POST /dsl-debugging/session` returns
`503 injection_disabled`. Flipping the flag requires an OAP restart:
```bash
SW_DSL_DEBUGGING_INJECTION_ENABLED=false # default is true; set false to disable probes
```
> SECURITY: capture payloads include MAL builder state and sample-family
> contents. Treat the admin port as authenticated infrastructure — see
> [Admin API readme — Security Notice](readme.md#-security-notice).
## Picking the rule key
A session targets one MAL metric rule. The key tuple is
`(catalog, name, ruleName)`:
| Field | Source |
|------------|----------------------------------------------------------------------------------------------|
| `catalog` | One of `otel-rules`, `log-mal-rules`, `telegraf-rules` — the directory the rule file lives in |
| `name` | The rule **file** name, without `.yaml` |
| `ruleName` | The full metric name (`metricPrefix` + `_` + per-rule `name`) |
Example — the shipped `otel-rules/vm.yaml` declares a metric prefix
`vm` and per-rule name `cpu_total_percentage`. The full metric name is
`vm_cpu_total_percentage`. The session install call:
```
POST /dsl-debugging/session?catalog=otel-rules&name=vm&ruleName=vm_cpu_total_percentage
```
To list the metrics a runtime-rule MAL file exposes, query
`GET /runtime/rule/list` and pull the `ruleName`s associated with the
catalog/name pair (the runtime-rule receiver records every rule's metric
catalog).
## End-to-end example
The example uses a runtime-rule-applied MAL rule with a top-level filter
clause so all probe stages (`filter` → `input` → `function` → `output`)
appear in the captures.
### 1. Apply the rule
```yaml
# /tmp/mal-with-filter.yaml
filter: "{ tags -> tags.service_name == 'my-svc' }"
metricPrefix: e2e_demo
expSuffix: service(['service_name'], Layer.GENERAL)
metricsRules:
- name: filtered_requests
exp: e2e_demo_request_count_total.sum(['service_name'])
```
```bash
curl -s -X POST -H 'Content-Type: text/plain' \
--data-binary '@/tmp/mal-with-filter.yaml' \
'http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=mal-with-filter'
```
### 2. Open a debug session
```bash
curl -s -X POST \
'http://OAP:17128/dsl-debugging/session?catalog=otel-rules&name=mal-with-filter&ruleName=e2e_demo_filtered_requests&clientId=alice'
```
### 3. Drive ingest, then poll
```bash
curl -s 'http://OAP:17128/dsl-debugging/session/SESSION_ID'
```
A trimmed slice (one record = one scrape window):
```json
{
"sessionId": "76b3266a-...",
"capturedAt": 1777967923700,
"ruleKey": { "catalog": "otel-rules", "name": "mal-with-filter",
"ruleName": "e2e_demo_filtered_requests" },
"nodes": [{
"nodeId": "0.0.0.0_11800",
"status": "ok",
"records": [{
"startedAtMs": 1777967921000,
"dsl": "(e2e_demo_request_count_total.sum(['service_name'])).service(['service_name'], Layer.GENERAL)",
"rule": {
"metricPrefix": "e2e_demo",
"name": "filtered_requests",
"filter": "{ tags -> tags.service_name == 'my-svc' }",
"exp": "e2e_demo_request_count_total.sum(['service_name'])",
"expSuffix": "service(['service_name'], Layer.GENERAL)"
},
"samples": [
{ "type": "filter",
"sourceText": "{ tags -> tags.service_name == 'my-svc' }",
"continueOn": true,
"payload": {
"families": 1,
"items": [ /* one entry per surviving SampleFamily — name, samples count, items[] */ ]
} },
{ "type": "input",
"sourceText": "e2e_demo_request_count_total",
"continueOn": true,
"payload": { /* head SampleFamily — name, samples, items[] */ } },
{ "type": "function",
"sourceText": "sum(['service_name'])",
"continueOn": true,
"payload": { /* SampleFamily after sum */ } },
{ "type": "output",
"sourceText": "e2e_demo_filtered_requests",
"continueOn": true,
"payload": {
"metric": "e2e_demo_filtered_requests",
"entity": "MeterEntity(scopeType=SERVICE, serviceName=my-svc, …)",
"valueType": "sum",
"timeBucket": 202605091036,
"value": 42 /* shape depends on valueType:
number for Sum/Avg/Max/Min/CPM/Latest…,
object {bucket: count} for histograms /
*Labeled functions, omitted for non-scalar
holders. NaN/±Infinity render as strings. */
} }
]
}]
}]
}
```
`sample.sourceText` is the verbatim ANTLR slice — match it against the
`exp:` body byte-for-byte. The record-level `rule` envelope echoes the
structured rule config so operators don't have to re-resolve the file.
### 4. Stop
```bash
curl -s -X POST 'http://OAP:17128/dsl-debugging/session/SESSION_ID/stop'
```
## Cluster behaviour
- **Install** broadcasts to every reachable peer; each peer binds its own
recorder on its own holder so the slice reflects local L1 parsing.
- **Collect** broadcasts and concatenates per-node slices into `nodes[]`;
unreachable peers appear as `status: "unreachable"` rather than being
omitted.
- **Stop** broadcasts; missed acks fall out via per-node retention timeout
(default 5 minutes).
No cross-node merge — each slice is self-contained.
## Failure modes
| Response | Meaning |
|---------------------------|------------------------------------------------------------------------------------|
| `400 invalid_catalog` | The wire `catalog` is not one of the MAL catalogs. |
| `400 missing_param` | `name` or `ruleName` is missing. |
| `404 rule_not_found` | No live MAL artifact for the tuple on this node — rule never loaded, was inactivated, or this node hasn't compiled it yet. |
| `503 injection_disabled` | `injectionEnabled=false`. Restart with the flag on to debug. |
| `500 registry_misconfigured` | A recorder factory wiring bug — file an issue. |
## Limits
| Field | Default | Hard cap | Purpose |
|------------------------|-------------|--------------|-------------------------------------------------------------------------------|
| `recordCap` | `100` | `100` | Max records before the recorder marks itself `captured` and refuses appends. |
| `retentionMillis` | `300000` (5m) | `3600000` (1h) | Wall-clock retention; the session is reaped after the deadline whether or not it was explicitly stopped. |
Out-of-range values return `400 invalid_limits` from `POST /dsl-debugging/session`.
Override per-session (within the caps above) in the request body:
```json
{ "recordCap": 50, "retentionMillis": 600000 }
```
## See also
- [DSL Debug API — OAL](dsl-debugging-oal.md)
- [DSL Debug API — LAL](dsl-debugging-lal.md)
- [Runtime Rule Hot-Update API](runtime-rule.md) — apply / inactivate / delete MAL rules.
- [SWIP-13](../../../swip/SWIP-13.md) — full design.