Status: shipped. Sample-based runtime debugger for the three DSLs that drive SkyWalking analysis: MAL (meter analysis language), LAL (log analysis language), and OAL (observability analysis language). Design: SWIP-13.
Each DSL has its own probe surface, payload shape, and rule-key conventions — pick the page that matches the rule you're debugging:
otel-rules, log-mal-rules, telegraf-rules. Each captured record is one SampleFamily walking through the rule end-to-end (filter → chain ops → meterEmit). Sample payloads carry the complete SampleFamily (every sample's name + labels + value + timestamp).catalog=oal. Each record is one ISource walking through (source entry → filter clauses → aggregation function → emit). Source samples carry the rich ServiceRelation-style payload (sourceServiceName, destServiceName, layers, latency, status, detectPoint, ...).catalog=lal. Each record is one log walking through (text → parser → extractor statements → sink). granularity=statement emits one sample per extractor statement; granularity=block (default) collapses extractor into one sample.MAL is the only DSL where rejected executions are dropped from records[]. The reason is tag-cardinality noise: MAL filters discriminate on tag values (service_name == 'foo', kind == 'http_server_request_duration') and in a multi-tenant or multi-component flow, a single rule may reject 99% of the traffic routed to it. If every rejection landed as a record, the default recordCap (32) would fill with noise within seconds and the capture would never reach a row demonstrating the rule's actual processing.
So for MAL the contract is: every record represents one SampleFamily that passed the rule's filter clause and walked through to meterEmit. The first sample is the input that the filter accepted; subsequent samples follow the chain ops; the last is the emit.
OAL captures both kept and rejected executions. OAL filters are deterministic discriminators (CLIENT-vs-SERVER, layer matchers, status / latency predicates), not high-cardinality tag noise — seeing the rejected-source samples (continueOn=false) is the filter doing its job in plain view, useful for verifying partition logic.
LAL has no analogue (abort() is a per-statement short-circuit, not a filter); aborted logs publish their accumulated samples with continueOn=false on the abort point so operators can see where the rule gave up.
For MAL operators debugging “why isn't my data being aggregated?”: the runtime-rule list endpoint surfaces the rule‘s filter source; combine with knowledge of the input tags to identify the mismatch. The debug API’s job is to show what the rule did with traffic it accepted.
What the three DSLs share:
/dsl-debugging/* on the admin-server HTTP host (default port 17128). The shape of session lifecycle (install, poll, stop) is identical across DSLs — only the rule-key tuple and payload schema differ.SW_DSL_DEBUGGING=default (admin-server itself is enabled by default). Probe insertion is gated by the boot-time switch SW_DSL_DEBUGGING_INJECTION_ENABLED (default true — once the module is enabled, probes fire and sessions record samples). Set this to false only if the REST surface is wanted but probe overhead is not; flipping requires an OAP restart.17129, owned by admin-server) — NOT the public agent / cluster gRPC port (core.gRPCPort, default 11800). Privileged admin RPCs stay off the agent network; operators bind 17129 to a private peer-to-peer interface only. Each peer's slice is self-contained — no L2 join.recordCap (default 1000, hard cap 10000) and retentionMillis (default 5 minutes, hard cap 1 hour). Requests outside these bounds return 400 invalid_limits. A missed stop self-cleans on retention timeout.sessionId on the receiving node, broadcasts install to all peers, and reports per-peer install state in peers[]. GET/stop on a different node aggregate the cluster slices for the same id. So an L7 LB in front of the admin port (default port 17128) can route any request to any OAP, and the cluster fan-out hides which node actually owns the rule.capturedAt (millis) stamped at GET time. Per-record stamps were redundant — probes fire at clock-tick speed and intra-slice ordering is preserved by array order.sourceText is the verbatim ANTLR slice: pulled via ctx.getStart().getInputStream().getText(Interval.of(start, stop)) at parse time, so whitespace and identifier spelling are byte-identical to the original .yaml / .oal. This applies to MAL chain segments, OAL filter clauses, OAL aggregation calls, and LAL extractor statements. Operators can grep the captured sourceText against the source file directly.Each DSL‘s page has its own walkthrough — start there for a copy-paste example targeting that DSL’s rules.
| Verb | Path | Purpose |
|---|---|---|
| POST | /dsl-debugging/session?catalog=&name=&ruleName=&clientId=&granularity= | Start a debug capture session. |
| GET | /dsl-debugging/session/{id} | Snapshot the captured records. |
| POST | /dsl-debugging/session/{id}/stop | Stop a session (idempotent). |
| GET | /dsl-debugging/sessions | List active sessions on this node. |
| GET | /dsl-debugging/status | Module posture (phase, injectionEnabled). |
granularity is LAL-only; MAL/OAL recorders ignore it. Body fields (recordCap, retentionMillis, granularity) are optional; defaults from the table above apply.
The DSL debug API also exposes a read-only listing of the OAL rules loaded by an OAP node, used by the debugger UI's rule picker:
| Verb | Path | Purpose |
|---|---|---|
| GET | /runtime/oal/files | List the OAL files this OAP loaded at boot, with their sources and registered rules. |
| GET | /runtime/oal/files/{name} | Raw OAL content (UTF-8 text, classpath-bound). |
| GET | /runtime/oal/rules/{source} | Rules registered against a single source (debug-key resolver). |
These do not require injectionEnabled=true — they are pure metadata.
Common across DSLs. Per-DSL pages list any DSL-specific failures.
| Response | Meaning |
|---|---|
400 invalid_catalog | Wire catalog doesn't match any known DSL. |
400 missing_param | name, ruleName, or clientId query param missing. clientId is required so the receiving node can clean up the prior session bound to it before installing a new one. |
400 invalid_limits | recordCap / retentionMillis outside the per-session bounds (see above). |
429 too_many_sessions | Active-session ceiling on the receiving node (200) is full. Stop a session or wait for retention. |
404 rule_not_found | No live DSL artifact for (catalog, name, ruleName) on any reachable OAP node — every node and peer reported NOT_LOCAL. The 404 body carries peers[] so operators can see which OAP rejected what. |
404 session_not_found | Session id unknown to every reachable OAP node — never created, retention timed out, or already stopped. |
503 injection_disabled | injectionEnabled=false; capture is permanently off until OAP restarts with the flag on. |
500 registry_misconfigured | A recorder factory wiring bug — file an issue. |
See Configuration vocabulary for the full table. The DSL-debug-relevant keys:
| Key | Default | Purpose |
|---|---|---|
SW_ADMIN_SERVER | default | Shared admin HTTP host. On by default; set empty to disable. |
SW_DSL_DEBUGGING | (empty) | Enables the DSL debug API. |
SW_DSL_DEBUGGING_INJECTION_ENABLED | true | Boot-time codegen switch. Set false to disable probes. |
SECURITY: capture payloads include sensitive operational state (raw log bodies, parsed maps, MAL builder state, OAL source events). Treat the admin port as authenticated infrastructure — see Admin API readme — Security Notice.