| [[operator-monitoring]] |
| = Camel K Operator Monitoring |
| |
| NOTE: The Camel K monitoring architecture relies on https://prometheus.io[Prometheus] and the eponymous operator. Make sure you've checked the xref:observability/monitoring.adoc#prerequisites[Camel K monitoring prerequisites]. |
| |
| [[installation]] |
| == Installation |
| |
| The `kamel install` command provides the `--monitoring` option flag, that can be used to automatically creates the default resources required to monitor the Camel K operator, e.g.: |
| |
| [source,sh] |
| ---- |
| $ kamel install --monitoring=true |
| ---- |
| |
| This creates: |
| |
| * a `PodMonitor` resource targeting the operator _metrics_ endpoint, so that the Prometheus server can scrape the <<metrics>> exposed by the operator; |
| * a `PrometheusRule` resource with default alerting rules based on the exposed metrics. The <<alerting>> provides more details about these default rules. |
| |
| The `kamel install` command also provides the `--monitoring-port` option, that can be used to change the port of the operator monitoring endpoint, e.g.: |
| |
| [source,sh] |
| ---- |
| $ kamel install --monitoring=true --monitoring-port=8888 |
| ---- |
| |
| You can refer to the <<discovery>> and <<alerting>> sections in case you don't want to rely on the default monitoring configuration. |
| |
| [[metrics]] |
| == Metrics |
| |
| The Camel K operator monitoring endpoint exposes the following metrics: |
| |
| .Camel K operator metrics |
| |=== |
| |Name |Type |Description |Buckets |Labels |
| |
| | `camel_k_reconciliation_duration_seconds` |
| | `HistogramVec` |
| | Reconciliation request duration |
| | 0.25s, 0.5s, 1s, 5s |
| | `namespace`, `group`, `version`, `kind`, `result`: `Reconciled`\|`Errored`\|`Requeued`, `tag`: `""`\|`PlatformError`\|`UserError` |
| |
| | `camel_k_build_duration_seconds` |
| | `HistogramVec` |
| | Build duration |
| | 30s, 1m, 1.5m, 2m, 5m, 10m |
| | `result`: `Succeeded`\|`Error` |
| |
| | `camel_k_build_recovery_attempts` |
| | `Histogram` |
| | Build recovery attempts |
| | 0, 1, 2, 3, 4, 5 |
| | `result`: `Succeeded`\|`Error` |
| |
| | `camel_k_build_queue_duration_seconds` |
| | `Histogram` |
| | Build queue duration |
| | 5s, 15s, 30s, 1m, 5m, |
| | N/A |
| |
| | `camel_k_integration_first_readiness_seconds` |
| | `Histogram` |
| | Time to first integration readiness |
| | 5s, 10s, 30s, 1m, 2m |
| | N/A |
| |
| |=== |
| |
| [[discovery]] |
| == Discovery |
| |
| A `PodMonitor` resource must be created for the Prometheus Operator to reconcile, so that the managed Prometheus instance can scrape the Camel K operator _metrics_ endpoint. |
| |
| As an example, hereafter is the `PodMonitor` resource that is created when executing the `kamel install --monitoring=true` command: |
| |
| .operator-pod-monitor.yaml |
| [source,yaml] |
| ---- |
| apiVersion: monitoring.coreos.com/v1 |
| kind: PodMonitor |
| metadata: |
| name: camel-k-operator |
| labels: # <1> |
| ... |
| spec: |
| selector: |
| matchLabels: # <2> |
| app: "camel-k" |
| camel.apache.org/component: operator |
| podMetricsEndpoints: |
| - port: metrics |
| ---- |
| <1> The labels must match the `podMonitorSelector` field from the `Prometheus` resource |
| <2> This label selector matches the Camel K operator Deployment labels |
| |
| The Prometheus Operator https://github.com/coreos/prometheus-operator/blob/v0.38.0/Documentation/user-guides/getting-started.md#related-resources[getting started] guide documents the discovery mechanism, as well as the relationship between the operator resources. |
| |
| In case your operator metrics are not discovered, you may want to rely on https://github.com/coreos/prometheus-operator/blob/v0.38.0/Documentation/troubleshooting.md#troubleshooting-servicemonitor-changes[Troubleshooting `ServiceMonitor` changes], which also applies to `PodMonitor` resources troubleshooting. |
| |
| [[alerting]] |
| == Alerting |
| |
| NOTE: The Prometheus Operator declares the `AlertManager` resource that can be used to configure _Alertmanager_ instances, along with `Prometheus` instances. The following section assumes an `AlertManager` resource already exists in your cluster. |
| |
| A `PrometheusRule` resource can be created for the Prometheus Operator to reconcile, so that the managed AlertManager instance can trigger alerts, based on the metrics exposed by the Camel K operator. |
| |
| As an example, hereafter is the alerting rules that are defined in `PrometheusRule` resource that is created when executing the `kamel install --monitoring=true` command: |
| |
| .Camel K operator alerts |
| |=== |
| |Name |Severity |Description |
| |
| | `CamelKReconciliationDuration` |
| | warning |
| | More than 10% of the reconciliation requests have their duration above 0.5s over at least 1 min. |
| |
| | `CamelKReconciliationFailure` |
| | warning |
| | More than 1% of the reconciliation requests have failed over at least 10 min. |
| |
| | `CamelKSuccessBuildDuration2m` |
| | warning |
| | More than 10% of the successful builds have their duration above 2 min over at least 1 min. |
| |
| | `CamelKSuccessBuildDuration5m` |
| | critical |
| | More than 1% of the successful builds have their duration above 5 min over at least 1 min. |
| |
| | `CamelKBuildError` |
| | critical |
| | More than 1% of the builds have errored over at least 10 min. |
| |
| | `CamelKBuildQueueDuration1m` |
| | warning |
| | More than 1% of the builds have been queued for more than 1 min over at least 1 min. |
| |
| | `CamelKBuildQueueDuration5m` |
| | critical |
| | More than 1% of the builds have been queued for more than 5 min over at least 1 min. |
| |
| |=== |
| |
| You can register your own `PrometheusRule` resources, to be used by Prometheus AlertManager instances to trigger alerts, e.g.: |
| |
| [source,yaml] |
| ---- |
| apiVersion: monitoring.coreos.com/v1 |
| kind: PrometheusRule |
| metadata: |
| name: camel-k-alerts |
| spec: |
| groups: |
| - name: camel-k-alerts |
| rules: |
| - alert: CamelKIntegrationTimeToReadiness |
| expr: | |
| ( |
| 1 - sum(rate(camel_k_integration_first_readiness_seconds_bucket{le="60"}[5m])) by (job) |
| / |
| sum(rate(camel_k_integration_first_readiness_seconds_count[5m])) by (job) |
| ) |
| * 100 |
| > 10 |
| for: 1m |
| labels: |
| severity: warning |
| annotations: |
| message: | |
| {{ printf "%0.0f" $value }}% of the integrations |
| for {{ $labels.job }} have their first time to readiness above 1m. |
| ---- |
| |
| More information can be found in the Prometheus Operator https://github.com/coreos/prometheus-operator/blob/v0.38.0/Documentation/user-guides/alerting.md[Alerting] user guide. |
| You can also find more details in https://docs.openshift.com/container-platform/4.4/monitoring/monitoring-your-own-services.html#creating-alerting-rules_monitoring-your-own-services[Creating alerting rules] from the OpenShift documentation. |