10.4.0

Project

  • Introduce OAL V2 engine:

    • Immutable AST models for thread safety and predictable behavior
    • Type-safe enums replacing string-based filter operators
    • Precise error location reporting with file, line, and column numbers
    • Clean separation between parsing and code generation phases
    • Enhanced testability with models that can be constructed without parsing
  • Introduce MAL/LAL/Hierarchy V2 engine — replace Groovy-based DSL runtime with ANTLR4 parser + Javassist bytecode generation:

    • Remove Groovy runtime dependency from OAP backend
    • Fail-fast compilation at startup — syntax and type errors are caught immediately instead of at first execution
    • Thread-safe generated classes with no ThreadLocal or shared mutable state
    • Immutable AST models for all three DSLs (MAL, LAL, Hierarchy rules)
    • Explicit context passing replaces Groovy binding/closure capture
    • v1 (Groovy) and v2 (ANTLR4+Javassist) cross-version checker validates behavioral equivalence across 1,290+ expressions
    • JMH benchmarks confirm v2 runtime speedups: MAL execute ~6.8x, LAL compile ~39x / execute ~2.8x, Hierarchy execute ~2.6x faster than Groovy v1
    • Generated class names follow {yamlFileName}_L{lineNo}_{ruleName} pattern for all DSLs (MAL/LAL/Hierarchy) for stack trace traceability
  • Breaking Change — LAL: remove slowSql {} and sampledTrace {} sub-DSLs from the grammar. These are replaced by the configurable outputType mechanism:

    • Set outputType at the rule level in YAML config to specify the output entity class. Use the short name registered by LALOutputBuilder SPI (e.g., outputType: SlowSQL, outputType: SampledTrace), or a fully qualified class name as fallback.
    • LALOutputBuilder implementations are discovered via ServiceLoader and expose a name() method for short name resolution. Built-in types: SlowSQL (DatabaseSlowStatementBuilder), SampledTrace (SampledTraceBuilder).
    • Output fields (e.g., id, statement, latency) are now regular field assignments in the extractor block, no longer wrapped in sub-DSL blocks.
    • Custom output fields are validated against the output type's setters at compile time.
    • An explicit sink {} block is now required for data to be persisted. Without sink {}, no data is saved — this applies to all LAL rules including those using outputType. In v1, slowSql {} and sampledTrace {} dispatched data as a side-effect inside the extractor; in v2, persistence is always handled by the sink pipeline.
    • Output type resolution order: per-rule YAML outputType (short name via SPI or FQCN) > LALSourceTypeProvider SPI default > Log.class.
    • All bundled LAL scripts (mysql-slowsql.yaml, pgsql-slowsql.yaml, redis-slowsql.yaml, envoy-als.yaml, k8s-service.yaml, mesh-dp.yaml) have been updated.
    • Users with custom LAL scripts using slowSql {} or sampledTrace {} must migrate to the new syntax. See LAL documentation.
    • Rename ExtractorSpec to MetricExtractor — now only handles LAL metrics {} blocks. Standard field setters (service, layer, timestamp, etc.) are compiled as direct setter calls on the output builder.
    • Add def local variable support in LAL extractor (and filter level). Supports toJson() and toJsonArray() built-in functions for converting strings, Maps, and protobuf Struct to Gson JSON objects. Variables support null-safe navigation (?.), method chaining with compile-time type inference, and explicit type cast via as (built-in types or fully qualified class names, e.g., def resp = parsed?.response as io.envoyproxy.envoy.data.accesslog.v3.HTTPResponseProperties).
    • Breaking ChangeLALOutputBuilder.init() signature changed from init(LogData, NamingControl) to init(LogData, Optional<Object> extraLog, NamingControl). The extraLog parameter carries the typed input object (e.g., HTTPAccessLogEntry for envoy access logs) so that output builders can access protocol-specific fields. Custom LALOutputBuilder implementations must update their init() method signature.
  • Fix E2E test metrics verify: make it failure if the metric values all null.

  • Support building, testing, and publishing with Java 25.

  • Add CLAUDE.md as AI assistant guide for the project.

  • Upgrade Byte Buddy to 1.18.7 and configure explicit -javaagent for Mockito/Byte Buddy in Surefire to avoid JDK 25 dynamic agent loading warnings.

  • Upgrade Groovy to 5.0.3 in OAP backend.

  • Bump up nodejs to v24.13.0 for the latest UI(booster-ui) compiling.

  • Drop Elasticsearch 7.x (EOL) and OpenSearch 1.x from E2E tests, upgrade all ES tests to 8.18.8, and update skywalking-helm to use ECK 8.18.8.

  • Add library-batch-queue module — a partitioned, self-draining queue with type-based dispatch, adaptive partitioning, idle backoff, and throughput-weighted drain rebalancing (DrainBalancer). Designed to replace DataCarrier in high-fan-out scenarios.

  • Replace DataCarrier with BatchQueue for L1 metrics aggregation, L2 metrics persistence, TopN persistence, all three exporters (gRPC metrics, Kafka trace, Kafka log), and gRPC remote client. All metric types (OAL + MAL) now share unified queues instead of separate OAL/MAL pools. Each exporter keeps its own dedicated queue with 1 thread, preserving original buffer strategies. Thread count comparison on an 8-core machine (gRPC remote client excluded — unchanged 1 thread per peer):

    QueueOld threadsOld channelsOld buffer slotsNew threadsNew partitionsNew buffer slotsNew policy
    L1 Aggregation (OAL)24~1,240~12.4M8 (unified)~330 adaptive~6.6McpuCores(1.0)
    L1 Aggregation (MAL)2~100~100K(unified above)
    L2 Persistence (OAL)2~620~1.24M3 (unified)~330 adaptive~660KcpuCoresWithBase(1, 0.25)
    L2 Persistence (MAL)1~100~100K(unified above)
    TopN Persistence444K14 adaptive4Kfixed(1)
    Exporters (gRPC/Kafka)36120K3 (1 per exporter)60Kfixed(1) each
    Total36~2,070~13.9M15~664~7.3M
  • Remove library-datacarrier-queue module. All usages have been replaced by library-batch-queue.

  • Enable throughput-weighted drain rebalancing for L1 aggregation and L2 persistence queues (10s interval). Periodically reassigns partitions across drain threads to equalize load when metric types have skewed throughput.

  • Add benchmark framework under benchmarks/ with Kind-based Kubernetes environments, automated thread dump collection and analysis. First case: thread-analysis on istio-cluster_oap-banyandb environment.

  • Add virtual thread support (JDK 25+) for gRPC and Armeria HTTP server handler threads. Set SW_VIRTUAL_THREADS_ENABLED=false to disable.

    PoolThreads (JDK < 25)Threads (JDK 25+)
    gRPC server handler (core-grpc, receiver-grpc, als-grpc, ebpf-grpc)Cached platform (unbounded)Virtual threads
    HTTP blocking (core-http, receiver-http, promql-http, logql-http, zipkin-query-http, zipkin-http, firehose-http)Cached platform (max 200)Virtual threads
    VT carrier threads (ForkJoinPool)N/A~9 shared

    On JDK 25+, all 11 thread pools above share ~9 carrier threads instead of up to 1,400+ platform threads.

  • Change default Docker base image to JDK 25 (eclipse-temurin:25-jre). JDK 11 kept as -java11 variant.

  • Thread count benchmark comparison — 2-node OAP cluster on JDK 25 with BanyanDB, Istio bookinfo traffic (10-core machine, JVM-internal threads excluded):

    Poolv10.3.0 threadsv10.4.0 threadsNotes
    L1 Aggregation (OAL + MAL)26 (DataCarrier)10 (BatchQueue)Unified OAL + MAL
    L2 Persistence (OAL + MAL)3 (DataCarrier)4 (BatchQueue)Unified OAL + MAL
    TopN Persistence4 (DataCarrier)1 (BatchQueue)
    gRPC Remote Client1 (DataCarrier)1 (BatchQueue)Per peer
    Armeria HTTP event loop205min(5, cores) shared group
    Armeria HTTP handleron-demand platform(increasing with payload)-Virtual threads on JDK 25+
    gRPC event loop1010Unchanged
    gRPC handleron-demand platform(increasing with payload)-Virtual threads on JDK 25+
    ForkJoinPool (Virtual Thread carrier)0~10JDK 25+ virtual thread scheduler
    HttpClient-SelectorManager42SharedKubernetesClient
    Schedulers + others~24~24Mostly unchanged
    Total (OAP threads)150+~72~50% reduction, stable in high payload.
  • Replace PowerMock Whitebox with standard Java Reflection in server-library, server-core, and server-configuration to support JDK 25+.

  • Fix /debugging/config/dump may leak sensitive information if there are second level properties in the configuration.

OAP Server

  • KubernetesCoordinator: make self instance return real pod IP address instead of 127.0.0.1.

  • Fix KubernetesCoordinator self-endpoint race condition: include self in the endpoint list so DynamicEndpointGroup re-fires the listener when the self pod appears in the informer after initial sync.

  • Enhance the alarm kernel with recovered status notification capability

  • Fix BrowserWebVitalsPerfData clsTime to cls and make it double type.

  • Init log-mal-rules at module provider start stage to avoid re-init for every LAL.

  • Fail fast if SampleFamily is empty after MAL filter expression.

  • Fix range matrix and scalar binary operation in PromQL.

  • Add LatestLabeledFunction for meter.

  • MAL Labeled metrics support additional attributes.

  • Bump up netty to 4.2.9.Final.

  • Add support for OpenSearch/ElasticSearch client certificate authentication.

  • Fix BanyanDB logs paging query.

  • Replace BanyanDB Java client with native implementation.

  • Remove bydb.dependencies.properties and set the compatible BanyanDB API version number in ${SW_STORAGE_BANYANDB_COMPATIBLE_SERVER_API_VERSIONS}.

  • Fix trace profiling query time range condition.

  • Add named ThreadFactory to all Executors.newXxx() calls to replace anonymous pool-N-thread-M thread names with meaningful names for easier thread dump analysis. Complete OAP server thread inventory (counts on an 8-core machine, exporters and JDBC are optional):

    CatalogThread NameCountPolicyPartitions
    Data PipelineBatchQueue-METRICS_L1_AGGREGATION-N8cpuCores(1.0)~330 adaptive
    Data PipelineBatchQueue-METRICS_L2_PERSISTENCE-N3cpuCoresWithBase(1, 0.25)~330 adaptive
    Data PipelineBatchQueue-TOPN_PERSISTENCE-N1fixed(1)~4 adaptive
    Data PipelineBatchQueue-GRPC_REMOTE_{host}_{port}-N1 per peerfixed(1)fixed(1)
    Data PipelineBatchQueue-EXPORTER_GRPC_METRICS-N1fixed(1)fixed(1)
    Data PipelineBatchQueue-EXPORTER_KAFKA_TRACE-N1fixed(1)fixed(1)
    Data PipelineBatchQueue-EXPORTER_KAFKA_LOG-N1fixed(1)fixed(1)
    Data PipelineBatchQueue-JDBC_ASYNC_BATCH_PERSISTENT-N4 (configurable)fixed(N)fixed(N)
    SchedulerRemoteClientManager1scheduled
    SchedulerPersistenceTimer1scheduled
    SchedulerPersistenceTimer-prepare-N2 (configurable)fixed pool
    SchedulerDataTTLKeeper1scheduled
    SchedulerCacheUpdateTimer1scheduled
    SchedulerHierarchyAutoMatching1scheduled
    SchedulerWatermarkWatcher1scheduled
    SchedulerAlarmCore1scheduled
    SchedulerHealthChecker1scheduled
    SchedulerEndpointUriRecognition1 (conditional)scheduled
    SchedulerFileChangeMonitor1scheduled
    SchedulerBanyanDB-ChannelManager1scheduled
    SchedulerGRPCClient-HealthCheck-{host}:{port}1 per clientscheduled
    SchedulerEBPFProfiling-Nconfigurablefixed pool
  • Fix BanyanDB time range overflow in profile thread snapshot query.

  • BrowserErrorLog, OAP Server generated UUID to replace the original client side ID, because Browser scripts can't guarantee generated IDs are globally unique.

  • MQE: fix multiple labeled metric query and ensure no results are returned if no label value combinations match.

  • Fix BrowserErrorLog BanyanDB storage query order.

  • BanyanDB Client: Property query support Order By.

  • MQE: trim the label values condition for the labeled metrics query to enhance the readability.

  • PromQL service: fix time parse issue when using RFC3339 time format for querying.

  • Envoy metrics service receiver: support adapter listener metrics.

  • Envoy metrics service receiver: support config MAL rules files.

  • Fix HttpAlarmCallback creating a new HttpClient on every alarm post() call, leaking NIO selector threads. Replace with a shared static singleton.

  • Add SharedKubernetesClient singleton in library-kubernetes-support to replace 9 separate KubernetesClientBuilder().build() calls across 7 files. Fixes KubernetesCoordinator client leak (never closed, NIO selector thread persisted). Uses KubernetesHttpClientFactory with virtual threads on JDK 25+ or a single fixed executor thread on JDK <25.

  • Reduce Armeria HTTP server event loop threads. All 7 HTTP servers now share one event loop group instead of each creating their own (Armeria default: cores * 2 per server = 140 on 10-core). Event loop: min(5, cores) shared — non-blocking I/O multiplexing needs few threads. Blocking executor: JDK 25+ uses virtual threads; JDK <25 keeps Armeria's default cached pool (up to 200 on-demand threads) because HTTP handlers block on long storage/DB queries.

  • Add the spring-ai components and the GenAI layer.

  • Bump up netty to 4.2.10.Final.

  • Bump up log4j to 2.25.3 and jackson to 2.18.5.

  • Remove PowerMock dependency. Replace Whitebox with ReflectUtil (standard Java reflection + sun.misc.Unsafe for final fields) across all modules to support JDK 25+.

  • Support TraceQL and Tempo API for Zipkin and SkyWalking native trace query.

  • Remove initExp from MAL configuration. It was an internal Groovy startup validation mechanism, not an end-user feature. The v2 ANTLR4 compiler performs fail-fast validation at startup natively.

  • Update hierarchy rule documentation: auto-matching-rules in hierarchy-definition.yml no longer use Groovy scripts. Rules now use a dedicated expression grammar supporting property access, String methods, if/else, comparisons, and logical operators. All shipped rules are fully compatible.

  • Activate otlp-traces handler in receiver-otel by default.

  • Update Istio E2E test versions: remove EOL 1.20.0, add 1.25.0–1.29.0 for ALS/Metrics/Ambient tests. Update Rover with Istio Process test from 1.15.0 to 1.28.0 with Kubernetes 1.28.

  • Support Virtual-GenAI monitoring.

  • Fix on-demand pod log parsing failure by replacing invalid DateTimeFormatter pattern with ISO_OFFSET_DATE_TIME.

  • Fix Zipkin receiver compatibility with application/x-protobuf Content-Type.

  • Support Envoy AI Gateway observability (SWIP-10): new ENVOY_AI_GATEWAY layer with MAL/LAL rules for GenAI metrics (token usage, latency, TTFT, TPOT) and access log sampling via OTLP.

  • OTel metric receiver: convert data point attribute dots to underscores (consistent with resource attributes and metric names). Label mappings are now fallback-only — explicit job_name in resource attributes takes precedence over the service.name fallback.

  • OTel log handler: prefer service.instance.id (OTel spec) over service.instance with fallback.

  • Add SampleFamily.debugDump() for MAL debugging.

  • Support virtual GenAI analysis for otlp and zipkin traces.

UI

  • Fix the missing icon in new native trace view.
  • Enhance the alert page to show the recovery time of resolved alerts.
  • Implement a common pagination component.
  • Fix validation guard for router.
  • Add the coldStage to the Duration for queries.
  • Optimize the pages theme.
  • Fix incorrect virtual service names.
  • Add the GenAI icon to Topology.
  • Bump up dependencies.
  • Correct active/inactive text for the cold stage.
  • Add the gen-ai menu.
  • Fix: set the step to SECOND in the duration for Log/Trace/Alarm/Tag.

Documentation

  • Add benchmark selection into banyanDB storage documentation.
  • Fix progressive TTL doc for banyanDB.
  • Restructure docs/README.md for better navigation with high-level documentation overview.
  • Move Marketplace as a top-level menu section with Overview introduction in menu.yml.
  • Polish marketplace.md as the overview page for all out-of-box monitoring features.
  • Add “What's Next” section to Quick Start docs guiding users to Marketplace.
  • Restructure agent compatibility page with OAP 10.x focus and clearer format for legacy versions.
  • Remove outdated FAQ docs (v3, v6 upgrade guides and 7.x metrics issue).
  • Remove “since 7/8/9.x” version statements from documentation as features are standard in 10.x.

All issues and pull requests are here