PIP-320 OpenTelemetry Scaffolding

Background knowledge

PIP-264 - parent PIP titled “Enhanced OTel-based metric system”

PIP-264, which can also be viewed here, describes in high level a plan to greatly enhance Pulsar metric system by replacing it with OpenTelemetry. You can read in the PIP the numerous existing problems PIP-264 solves. Among them are:

  • Control which metrics to export per topic/group/namespace via the introduction of a metric filter configuration. This configuration is planned to be dynamic as outline in the PIP-264.
  • Reduce the immense metrics cardinality due to high topic count (One of Pulsar great features), by introducing the concept of Metric Group - a group of topics for metric purposes. Metric reporting will also be done to a group granularity. 100k topics can be downsized to 1k groups. The dynamic metric filter configuration would allow the user to control which metric group to un-filter.
  • Proper histogram exporting
  • Clean-up codebase clutter, by relying on a single industry standard API, SDK and metrics protocol (OTLP) instead of existing mix of home-brew libraries and hard coded Prometheus exporter.
  • any many more

You can here why OpenTelemetry was chosen.

OpenTelemetry

Since OpenTelemetry (a.k.a. OTel) is an emerging industry standard, there are plenty of good articles, videos and documentation about it. In this very short paragraph I'll describe what you need to know about OTel from this PIP perspective.

OpenTelemetry is a project aimed to standardize the way we instrument, collect and ship metrics from applications to telemetry backends, be it databases (e.g. Prometheus, Cortex, Thanos) or vendors (e.g. Datadog, Logz.io). It is divided into API, SDK and Collector:

  • API: interfaces to use to instrument: define a counter, record values to a histogram, etc.
  • SDK: a library, available in many languages, implementing the API, and other important features such as reading the metrics and exporting it out to a telemetry backend or OTel Collector.
  • Collector: a lightweight process (application) which can receive or retrieve telemetry, transform it (e.g. filter, drop, aggregate) and export it (e.g. send it to various backends). The SDK supports out-of-the-box exporting metrics as Prometheus HTTP endpoint or sending them out using OTLP protocol. Many times companies choose to ship to the Collector and there ship to their preferred vendors, since each vendor already published their exporter plugin to OTel Collector. This makes the SDK exporters very light-weight as they don‘t need to support any vendor. It’s also easier for the DevOps team as they can make OTel Collector their responsibility, and have application developers only focus on shipping metrics to that collector.

Just to have some context: Pulsar codebase will use the OTel API to create counters / histograms and records values to them. So will the Pulsar plugins and Pulsar Function authors. Pulsar itself will be the one creating the SDK and using that to hand over an implementation of the API where ever needed in Pulsar. Collector is up to the choice of the user, as OTel provides a way to expose the metrics as /metrics endpoint on a configured port, so Prometheus compatible scrapers can grab it from it directly. They can also send it via OTLP to OTel collector.

Telemetry layers

PIP-264 clearly outlined there will be two layers of metrics, collected and exported, side by side: OpenTelemetry and the existing metric system - currently exporting in Prometheus. This PIP will explain in detail how it will work. The basic premise is that you will be able to enable or disable OTel metrics, alongside the existing Prometheus metric exporting.

Why OTel in Pulsar will be marked experimental and not GA

As specified in PIP-264, OpenTelemetry Java SDK has several fixes the Pulsar community must complete before it can be used in production. They are documented in PIP-264. The most important one is reducing memory allocations to be negligible. OTel SDK is built upon immutability, hence allocated memory in O(#topics) which is a performance killer for low latency application like Pulsar.

You can track the proposal and progress the Pulsar and OTel communities are making in this issue.

Metrics endpoint authentication

Today Pulsar metrics endpoint /metrics has an option to be protected by the configured AuthenticationProvider. The configuration option is named authenticateMetricsEndpoint in the broker and authenticateMetricsEndpoint in the proxy.

Motivation

Implementing PIP-264 consists of a long list of steps, which are detailed in this issue. The first step is add all the bare-bones infrastructure to use OpenTelemetry in Pulsar, such that next PRs can use it to start translating existing metrics to their OTel form. It means the same metrics will co-exist in the codebase and also in runtime, if OTel was enabled.

Goals

In Scope

  • Ability to add metrics using OpenTelemetry to Pulsar components: Broker, Function Worker and Proxy.
  • User can disable or enable OpenTelemetry metrics, which by default will be disabled
  • OpenTelemetry metrics will be configured via its native OTel Java SDK configuration options
  • All the necessary information to use OTel with Pulsar will be documented in Pulsar documentation site
  • OpenTelemetry metrics layer defined as experimental, and not GA

Out of Scope

  • Ability to add metrics using OpenTelemetry as Pulsar Function author.
  • Only authenticated sessions can access OTel Prometheus endpoint, using Pulsar authentication
  • Metrics in Pulsar clients (as defined in PIP-264))

High Level Design

Configuration

OpenTelemetry, as any good telemetry library (e.g. log4j, logback), has its own configuration mechanisms:

  • System properties
  • Environment variables
  • Experimental file-based configuration

Pulsar doesn't need to introduce any additional configuration. The user can decide, using OTel configuration things like:

  • How do I want to export the metrics? Prometheus? Which port prometheus will be exposed at
  • Change histogram buckets using Views
  • and more

Pulsar will use AutoConfiguredOpenTelemetrySdk which uses all the above configuration mechanisms (documented here). This class builds an OpenTelemetrySdk based on configurations. This is the entry point to OpenTelemetry API, as it implements OpenTelemetry API class.

Setting sensible defaults for Pulsar

There are some configuration options we wish to change their default, but still allow the users to override it if they wish. We think those default values will make a much easier user experience.

  • otel.experimental.metrics.cardinality.limit - value: 10,000 This property sets an upper bound on the amount of unique Attributes an instrument can have. Take Pulsar for example, an instrument like pulsar.broker.messaging.topic.received.size, the unique Attributes would be in the amount of active topics in the broker. Since Pulsar can handle up to 1M topics, it makes more sense to put the default value to 10k, which translates to 10k topics.

AutoConfiguredOpenTelemetrySdkBuilder allows to add properties using the method addPropertiesSupplier. The System properties and environment variables override it. The file-based configuration still doesn't take those properties supplied into account, but it will.

Opting in

We would like to have the ability to toggle OpenTelemetry-based metrics, as they are still new. We won't need any special Pulsar configuration, as OpenTelemetry SDK comes with a configuration key to do that. Since OTel is still experimental, it will have to be opt-in, hence we will add the following property to be the default using the mechanism described above:

  • otel.sdk.disabled - value: true This property value disables OpenTelemetry.

With OTel disabled, the user remains with the existing metrics system. OTel in a disabled state operates in a no-op mode. This means, instruments do get built, but the instrument builders return the same instance of a no-op instrument, which does nothing on record-values method (e.g. add(number), record(number)). The no-op MeterProvider has no registered MetricReader hence no metric collection will be made. The memory impact is almost 0 and the same goes for CPU impact.

The current metric system doesn‘t have a toggle which causes all existing data structures to stop collecting data. Inserting will need changing in so many places since we don’t have a single place which through all metric instrument are created (one of the motivations for PIP-264). The current system do have a toggle: exposeTopicLevelMetricsInPrometheus. It enables toggling off topic-level metrics, which means the highest cardinality metrics will be namespace level. Once that toggle is false, the amount of data structures accounting memory would in the range of a few thousands which shouldn't post a burden memory wise. If the user refrain from calling /metrics it will also reduce the CPU and memory cost associated with collecting metrics.

When the user enables OTel it means there will be a memory increase, but if the user disabled topic-level metrics in existing system, as specified above, the majority of the memory increase will be due to topic level metrics in OTel, at the expense of not having them in the existing metric system.

Cluster attribute name

A broker is part of a cluster. It is configured in the Pulsar configuration key clusterName. When the broker is part of a cluster, it means it shares the topics defined in that cluster (persisted in Metadata service: e.g. ZK) among the brokers of that cluster.

Today, each unique time series emitted in Prometheus metrics contains the cluster label (almost all of them, as it is done manually). We wish the same with OTel - to have that attribute in each exported unique time series.

OTel has the perfect location to place attributes which are shared across all time series: Resource. An application can have multiple Resource, with each having 1 or more attributes. You define it once, in OTel initialization or configuration. It can contain attributes like the hostname, AWS region, etc. The default contains the service name and some info on the SDK version.

Attributes can be added dynamically, through addResourceCustomizer() in AutoConfiguredOpenTelemetrySdkBuilder. We will use that to inject the cluster attribute, taken from the configuration.

In Prometheus, we submitted a proposal to opentelemetry specifications, which was merged, to allow copying resource attributes into each exported unique time series in Prometheus exporter. We plan to contribute its implementation to OTel Java SDK.

Resources in Prometheus exporter, are exported as target_info{} 1 and the attributes are added to this time series. This will require making joins to get it, making it extremely difficult to use. The other alternative was to introduce our own PulsarAttributesBuilder class, on top of AttributesBuilder of OTel. Getting every contributor to know this class, use it, is hard. Getting this across Pulsar Functions or Plugins authors, will be immensely hard. Also, when exporting as OTLP, it is very inefficient to repeat the attribute across all unique time series, instead of once using Resource. Hence, this needed to be solved in the Prometheus exporter as we did in the proposal.

The attribute will be named pulsar.cluster, as both the proxy and the broker are part of this cluster.

Naming and using OpenTelemetry

Attributes

  • We shall prefix each attribute with pulsar.. Example: pulsar.topic, pulsar.cluster.

Instruments

We should have a clear hierarchy, hence use the following prefix

  • pulsar.broker
  • pulsar.proxy
  • pulsar.function_worker

Meter

It‘s customary to use reverse domain name for meter names. Hence, we’ll use:

  • org.apache.pulsar.broker
  • org.apache.pulsar.proxy
  • org.apache.pulsar.function_worker

OTel meter name is converted to the attribute name otel_scope_name and added to each unique time series attributes by Prometheus exporter.

We won't specify a meter version, as it is used solely to signify the version of the instrumentation, and currently we are the first version, hence not use it.

Detailed Design

Design & Implementation Details

  • OpenTelemetryService class

    • Parameters:
      • Cluster name
    • What it will do:
      • Override default max cardinality to 10k
      • Register a resource with cluster name
      • Place defaults setting to instruct Prometheus Exporter to copy resource attributes
      • In the future: place defaults for Memory Mode to be REUSABLE_DATA
  • PulsarBrokerOpenTelemetry class

    • Initialization
      • Construct an OpenTelemetryService using the cluster name taken from the broker configuration
      • Constructs a Meter for the broker metrics
    • Methods
      • getMeter() returns the Meter for the broker
    • Notes
      • This is the class that will be passed along to other Pulsar service classes that needs to define telemetry such as metrics (in the future: traces).
  • PulsarProxyOpenTelemetry class

    • Same as PulsarBrokerOpenTelemetry but for Pulsar Proxy
  • PulsarWorkerOpenTelemetry class

    • Same as PulsarBrokerOpenTelemetry but for Pulsar function worker

Public-facing Changes

Public API

  • OTel Prometheus Exporter adds /metrics endpoint on a user defined port, if user chose to use it

Configuration

  • OTel configurations are used

Security Considerations

  • OTel currently does not support setting a custom Authenticator for Prometheus exporter.
    An issue has been raised here.
    • Once it do we can secure the Prometheus exporter metrics endpoint using AuthenticationProvider
  • Any user can access metrics, and they are not protected per tenant. Like today's implementation

Links