| # PIP-320: OpenTelemetry Scaffolding |
| |
| # Background knowledge |
| |
| ## PIP-264 - parent PIP titled "Enhanced OTel-based metric system" |
| [PIP-264](https://github.com/apache/pulsar/pull/21080), which can also be viewed [here](pip-264.md), describes in high |
| level a plan to greatly enhance Pulsar metric system by replacing it with [OpenTelemetry](https://opentelemetry.io/). |
| You can read in the PIP the numerous existing problems PIP-264 solves. Among them are: |
| - Control which metrics to export per topic/group/namespace via the introduction of a metric filter configuration. |
| This configuration is planned to be dynamic as outline in the [PIP-264](pip-264.md). |
| - Reduce the immense metrics cardinality due to high topic count (One of Pulsar great features), by introducing |
| the concept of Metric Group - a group of topics for metric purposes. Metric reporting will also be done to a |
| group granularity. 100k topics can be downsized to 1k groups. The dynamic metric filter configuration would allow |
| the user to control which metric group to un-filter. |
| - Proper histogram exporting |
| - Clean-up codebase clutter, by relying on a single industry standard API, SDK and metrics protocol (OTLP) instead of |
| existing mix of home-brew libraries and hard coded Prometheus exporter. |
| - any many more |
| |
| You can [here](pip-264.md#why-opentelemetry) why OpenTelemetry was chosen. |
| |
| ## OpenTelemetry |
| Since OpenTelemetry (a.k.a. OTel) is an emerging industry standard, there are plenty of good articles, videos and |
| documentation about it. In this very short paragraph I'll describe what you need to know about OTel from this PIP |
| perspective. |
| |
| OpenTelemetry is a project aimed to standardize the way we instrument, collect and ship metrics from applications |
| to telemetry backends, be it databases (e.g. Prometheus, Cortex, Thanos) or vendors (e.g. Datadog, Logz.io). |
| It is divided into API, SDK and Collector: |
| - API: interfaces to use to instrument: define a counter, record values to a histogram, etc. |
| - SDK: a library, available in many languages, implementing the API, and other important features such as |
| reading the metrics and exporting it out to a telemetry backend or OTel Collector. |
| - Collector: a lightweight process (application) which can receive or retrieve telemetry, transform it (e.g. |
| filter, drop, aggregate) and export it (e.g. send it to various backends). The SDK supports out-of-the-box |
| exporting metrics as Prometheus HTTP endpoint or sending them out using OTLP protocol. Many times companies choose to |
| ship to the Collector and there ship to their preferred vendors, since each vendor already published their exporter |
| plugin to OTel Collector. This makes the SDK exporters very light-weight as they don't need to support any |
| vendor. It's also easier for the DevOps team as they can make OTel Collector their responsibility, and have |
| application developers only focus on shipping metrics to that collector. |
| |
| Just to have some context: Pulsar codebase will use the OTel API to create counters / histograms and records values to |
| them. So will the Pulsar plugins and Pulsar Function authors. Pulsar itself will be the one creating the SDK |
| and using that to hand over an implementation of the API where ever needed in Pulsar. Collector is up to the choice |
| of the user, as OTel provides a way to expose the metrics as `/metrics` endpoint on a configured port, so Prometheus |
| compatible scrapers can grab it from it directly. They can also send it via OTLP to OTel collector. |
| |
| ## Telemetry layers |
| PIP-264 clearly outlined there will be two layers of metrics, collected and exported, side by side: OpenTelemetry |
| and the existing metric system - currently exporting in Prometheus. This PIP will explain in detail how it will work. |
| The basic premise is that you will be able to enable or disable OTel metrics, alongside the existing Prometheus |
| metric exporting. |
| |
| ## Why OTel in Pulsar will be marked experimental and not GA |
| As specified in [PIP-264](pip-264.md), OpenTelemetry Java SDK has several fixes the Pulsar community must |
| complete before it can be used in production. They are [documented](pip-264.md#what-we-need-to-fix-in-opentelemetry) |
| in PIP-264. The most important one is reducing memory allocations to be negligible. OTel SDK is built upon immutability, |
| hence allocated memory in O(`#topics`) which is a performance killer for low latency application like Pulsar. |
| |
| You can track the proposal and progress the Pulsar and OTel communities are making in |
| [this issue](https://github.com/open-telemetry/opentelemetry-java/issues/5105). |
| |
| |
| ## Metrics endpoint authentication |
| Today Pulsar metrics endpoint `/metrics` has an option to be protected by the configured `AuthenticationProvider`. |
| The configuration option is named `authenticateMetricsEndpoint` in the broker and |
| `authenticateMetricsEndpoint` in the proxy. |
| |
| |
| # Motivation |
| |
| Implementing PIP-264 consists of a long list of steps, which are detailed in |
| [this issue](https://github.com/apache/pulsar/issues/21121). The first step is add all the bare-bones infrastructure |
| to use OpenTelemetry in Pulsar, such that next PRs can use it to start translating existing metrics to their |
| OTel form. It means the same metrics will co-exist in the codebase and also in runtime, if OTel was enabled. |
| |
| # Goals |
| |
| ## In Scope |
| - Ability to add metrics using OpenTelemetry to Pulsar components: Broker, Function Worker and Proxy. |
| - User can disable or enable OpenTelemetry metrics, which by default will be disabled |
| - OpenTelemetry metrics will be configured via its native OTel Java SDK configuration options |
| - All the necessary information to use OTel with Pulsar will be documented in Pulsar documentation site |
| - OpenTelemetry metrics layer defined as experimental, and *not* GA |
| |
| |
| ## Out of Scope |
| - Ability to add metrics using OpenTelemetry as Pulsar Function author. |
| - Only authenticated sessions can access OTel Prometheus endpoint, using Pulsar authentication |
| - Metrics in Pulsar clients (as defined in [PIP-264](pip-264.md#out-of-scope))) |
| |
| # High Level Design |
| |
| ## Configuration |
| OpenTelemetry, as any good telemetry library (e.g. log4j, logback), has its own configuration mechanisms: |
| - System properties |
| - Environment variables |
| - Experimental file-based configuration |
| |
| Pulsar doesn't need to introduce any additional configuration. The user can decide, using OTel configuration |
| things like: |
| * How do I want to export the metrics? Prometheus? Which port prometheus will be exposed at |
| * Change histogram buckets using Views |
| * and more |
| |
| Pulsar will use `AutoConfiguredOpenTelemetrySdk` which uses all the above configuration mechanisms |
| (documented [here](https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure)). |
| This class builds an `OpenTelemetrySdk` based on configurations. This is the entry point to OpenTelemetry API, as it |
| implements `OpenTelemetry` API class. |
| |
| ### Setting sensible defaults for Pulsar |
| There are some configuration options we wish to change their default, but still allow the users to override it |
| if they wish. We think those default values will make a much easier user experience. |
| |
| * `otel.experimental.metrics.cardinality.limit` - value: 10,000 |
| This property sets an upper bound on the amount of unique `Attributes` an instrument can have. Take Pulsar for example, |
| an instrument like `pulsar.broker.messaging.topic.received.size`, the unique `Attributes` would be in the amount of |
| active topics in the broker. Since Pulsar can handle up to 1M topics, it makes more sense to put the default value |
| to 10k, which translates to 10k topics. |
| |
| `AutoConfiguredOpenTelemetrySdkBuilder` allows to add properties using the method `addPropertiesSupplier`. The |
| System properties and environment variables override it. The file-based configuration still doesn't take |
| those properties supplied into account, but it will. |
| |
| |
| ## Opting in |
| We would like to have the ability to toggle OpenTelemetry-based metrics, as they are still new. |
| We won't need any special Pulsar configuration, as OpenTelemetry SDK comes with a configuration key to do that. |
| Since OTel is still experimental, it will have to be opt-in, hence we will add the following property to be the default |
| using the mechanism described [above](#setting-sensible-defaults-for-pulsar): |
| |
| * `otel.sdk.disabled` - value: true |
| This property value disables OpenTelemetry. |
| |
| With OTel disabled, the user remains with the existing metrics system. OTel in a disabled state operates in a |
| no-op mode. This means, instruments do get built, but the instrument builders return the same instance of a |
| no-op instrument, which does nothing on record-values method (e.g. `add(number)`, `record(number)`). The no-op |
| `MeterProvider` has no registered `MetricReader` hence no metric collection will be made. The memory impact |
| is almost 0 and the same goes for CPU impact. |
| |
| The current metric system doesn't have a toggle which causes all existing data structures to stop collecting |
| data. Inserting will need changing in so many places since we don't have a single place which through |
| all metric instrument are created (one of the motivations for PIP-264). |
| The current system do have a toggle: `exposeTopicLevelMetricsInPrometheus`. It enables toggling off |
| topic-level metrics, which means the highest cardinality metrics will be namespace level. |
| Once that toggle is `false`, the amount of data structures accounting memory would in the range of |
| a few thousands which shouldn't post a burden memory wise. If the user refrain from calling |
| `/metrics` it will also reduce the CPU and memory cost associated with collecting metrics. |
| |
| When the user enables OTel it means there will be a memory increase, but if the user disabled topic-level |
| metrics in existing system, as specified above, the majority of the memory increase will be due to topic level |
| metrics in OTel, at the expense of not having them in the existing metric system. |
| |
| |
| |
| ## Cluster attribute name |
| A broker is part of a cluster. It is configured in the Pulsar configuration key `clusterName`. When the broker is part |
| of a cluster, it means it shares the topics defined in that cluster (persisted in Metadata service: e.g. ZK) |
| among the brokers of that cluster. |
| |
| Today, each unique time series emitted in Prometheus metrics contains the `cluster` label (almost all of them, as it |
| is done manually). We wish the same with OTel - to have that attribute in each exported unique time series. |
| |
| OTel has the perfect location to place attributes which are shared across all time series: Resource. An application |
| can have multiple Resource, with each having 1 or more attributes. You define it once, in OTel initialization or |
| configuration. It can contain attributes like the hostname, AWS region, etc. The default contains the service name |
| and some info on the SDK version. |
| |
| Attributes can be added dynamically, through `addResourceCustomizer()` in `AutoConfiguredOpenTelemetrySdkBuilder`. |
| We will use that to inject the `cluster` attribute, taken from the configuration. |
| |
| In Prometheus, we submitted a [proposal](https://github.com/open-telemetry/opentelemetry-specification/pull/3761) |
| to opentelemetry specifications, which was merged, to allow copying resource attributes into each exported |
| unique time series in Prometheus exporter. |
| We plan to contribute its implementation to OTel Java SDK. |
| |
| Resources in Prometheus exporter, are exported as `target_info{} 1` and the attributes are added to this |
| time series. This will require making joins to get it, making it extremely difficult to use. |
| The other alternative was to introduce our own `PulsarAttributesBuilder` class, on top of |
| `AttributesBuilder` of OTel. Getting every contributor to know this class, use it, is hard. Getting this |
| across Pulsar Functions or Plugins authors, will be immensely hard. Also, when exporting as |
| OTLP, it is very inefficient to repeat the attribute across all unique time series, instead of once using |
| Resource. Hence, this needed to be solved in the Prometheus exporter as we did in the proposal. |
| |
| The attribute will be named `pulsar.cluster`, as both the proxy and the broker are part of this cluster. |
| |
| ## Naming and using OpenTelemetry |
| |
| ### Attributes |
| * We shall prefix each attribute with `pulsar.`. Example: `pulsar.topic`, `pulsar.cluster`. |
| |
| ### Instruments |
| We should have a clear hierarchy, hence use the following prefix |
| * `pulsar.broker` |
| * `pulsar.proxy` |
| * `pulsar.function_worker` |
| |
| ### Meter |
| It's customary to use reverse domain name for meter names. Hence, we'll use: |
| * `org.apache.pulsar.broker` |
| * `org.apache.pulsar.proxy` |
| * `org.apache.pulsar.function_worker` |
| |
| OTel meter name is converted to the attribute name `otel_scope_name` and added to each unique time series |
| attributes by Prometheus exporter. |
| |
| We won't specify a meter version, as it is used solely to signify the version of the instrumentation, and |
| currently we are the first version, hence not use it. |
| |
| |
| # Detailed Design |
| |
| ## Design & Implementation Details |
| |
| * `OpenTelemetryService` class |
| * Parameters: |
| * Cluster name |
| * What it will do: |
| - Override default max cardinality to 10k |
| - Register a resource with cluster name |
| - Place defaults setting to instruct Prometheus Exporter to copy resource attributes |
| - In the future: place defaults for Memory Mode to be REUSABLE_DATA |
| |
| * `PulsarBrokerOpenTelemetry` class |
| * Initialization |
| * Construct an `OpenTelemetryService` using the cluster name taken from the broker configuration |
| * Constructs a Meter for the broker metrics |
| * Methods |
| * `getMeter()` returns the `Meter` for the broker |
| * Notes |
| * This is the class that will be passed along to other Pulsar service classes that needs to define |
| telemetry such as metrics (in the future: traces). |
| |
| * `PulsarProxyOpenTelemetry` class |
| * Same as `PulsarBrokerOpenTelemetry` but for Pulsar Proxy |
| * `PulsarWorkerOpenTelemetry` class |
| * Same as `PulsarBrokerOpenTelemetry` but for Pulsar function worker |
| |
| |
| ## Public-facing Changes |
| |
| ### Public API |
| * OTel Prometheus Exporter adds `/metrics` endpoint on a user defined port, if user chose to use it |
| |
| ### Configuration |
| * OTel configurations are used |
| |
| # Security Considerations |
| * OTel currently does not support setting a custom Authenticator for Prometheus exporter. |
| An issue has been raised [here](https://github.com/open-telemetry/opentelemetry-java/issues/6013). |
| * Once it do we can secure the Prometheus exporter metrics endpoint using `AuthenticationProvider` |
| * Any user can access metrics, and they are not protected per tenant. Like today's implementation |
| |
| # Links |
| |
| * Mailing List discussion thread: https://lists.apache.org/thread/xcn9rm551tyf4vxrpb0th0wj0kktnrr2 |
| * Mailing List voting thread: https://lists.apache.org/thread/zp6vl9z9dhwbvwbplm60no13t8fvlqs2 |