blob: 0f109c220bf4a2d6da8284e7223a9af4ec28fc54 [file] [log] [blame]
[[operator-monitoring]]
= Camel K Operator Monitoring
NOTE: The Camel K monitoring architecture relies on https://prometheus.io[Prometheus] and the eponymous operator. Make sure you've checked the xref:observability/monitoring.adoc#prerequisites[Camel K monitoring prerequisites].
[[installation]]
== Installation
The `kamel install` command provides the `--monitoring` option flag, that can be used to automatically creates the default resources required to monitor the Camel K operator, e.g.:
[source,sh]
----
$ kamel install --monitoring=true
----
This creates:
* a `PodMonitor` resource targeting the operator _metrics_ endpoint, so that the Prometheus server can scrape the <<metrics>> exposed by the operator;
* a `PrometheusRule` resource with default alerting rules based on the exposed metrics. The <<alerting>> provides more details about these default rules.
The `kamel install` command also provides the `--monitoring-port` option, that can be used to change the port of the operator monitoring endpoint, e.g.:
[source,sh]
----
$ kamel install --monitoring=true --monitoring-port=8888
----
You can refer to the <<discovery>> and <<alerting>> sections in case you don't want to rely on the default monitoring configuration.
[[metrics]]
== Metrics
The Camel K operator monitoring endpoint exposes the following metrics:
.Camel K operator metrics
|===
|Name |Type |Description |Buckets |Labels
| `camel_k_reconciliation_duration_seconds`
| `HistogramVec`
| Reconciliation request duration
| 0.25s, 0.5s, 1s, 5s
| `namespace`, `group`, `version`, `kind`, `result`: `Reconciled`\|`Errored`\|`Requeued`, `tag`: `""`\|`PlatformError`\|`UserError`
| `camel_k_build_duration_seconds`
| `HistogramVec`
| Build duration
| 30s, 1m, 1.5m, 2m, 5m, 10m
| `result`: `Succeeded`\|`Error`
| `camel_k_build_recovery_attempts`
| `Histogram`
| Build recovery attempts
| 0, 1, 2, 3, 4, 5
| `result`: `Succeeded`\|`Error`
| `camel_k_build_queue_duration_seconds`
| `Histogram`
| Build queue duration
| 5s, 15s, 30s, 1m, 5m,
| N/A
| `camel_k_integration_first_readiness_seconds`
| `Histogram`
| Time to first integration readiness
| 5s, 10s, 30s, 1m, 2m
| N/A
|===
[[discovery]]
== Discovery
A `PodMonitor` resource must be created for the Prometheus Operator to reconcile, so that the managed Prometheus instance can scrape the Camel K operator _metrics_ endpoint.
As an example, hereafter is the `PodMonitor` resource that is created when executing the `kamel install --monitoring=true` command:
.operator-pod-monitor.yaml
[source,yaml]
----
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: camel-k-operator
labels: # <1>
...
spec:
selector:
matchLabels: # <2>
app: "camel-k"
camel.apache.org/component: operator
podMetricsEndpoints:
- port: metrics
----
<1> The labels must match the `podMonitorSelector` field from the `Prometheus` resource
<2> This label selector matches the Camel K operator Deployment labels
The Prometheus Operator https://github.com/coreos/prometheus-operator/blob/v0.38.0/Documentation/user-guides/getting-started.md#related-resources[getting started] guide documents the discovery mechanism, as well as the relationship between the operator resources.
In case your operator metrics are not discovered, you may want to rely on https://github.com/coreos/prometheus-operator/blob/v0.38.0/Documentation/troubleshooting.md#troubleshooting-servicemonitor-changes[Troubleshooting `ServiceMonitor` changes], which also applies to `PodMonitor` resources troubleshooting.
[[alerting]]
== Alerting
NOTE: The Prometheus Operator declares the `AlertManager` resource that can be used to configure _Alertmanager_ instances, along with `Prometheus` instances. The following section assumes an `AlertManager` resource already exists in your cluster.
A `PrometheusRule` resource can be created for the Prometheus Operator to reconcile, so that the managed AlertManager instance can trigger alerts, based on the metrics exposed by the Camel K operator.
As an example, hereafter is the alerting rules that are defined in `PrometheusRule` resource that is created when executing the `kamel install --monitoring=true` command:
.Camel K operator alerts
|===
|Name |Severity |Description
| `CamelKReconciliationDuration`
| warning
| More than 10% of the reconciliation requests have their duration above 0.5s over at least 1 min.
| `CamelKReconciliationFailure`
| warning
| More than 1% of the reconciliation requests have failed over at least 10 min.
| `CamelKSuccessBuildDuration2m`
| warning
| More than 10% of the successful builds have their duration above 2 min over at least 1 min.
| `CamelKSuccessBuildDuration5m`
| critical
| More than 1% of the successful builds have their duration above 5 min over at least 1 min.
| `CamelKBuildError`
| critical
| More than 1% of the builds have errored over at least 10 min.
| `CamelKBuildQueueDuration1m`
| warning
| More than 1% of the builds have been queued for more than 1 min over at least 1 min.
| `CamelKBuildQueueDuration5m`
| critical
| More than 1% of the builds have been queued for more than 5 min over at least 1 min.
|===
You can register your own `PrometheusRule` resources, to be used by Prometheus AlertManager instances to trigger alerts, e.g.:
[source,yaml]
----
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: camel-k-alerts
spec:
groups:
- name: camel-k-alerts
rules:
- alert: CamelKIntegrationTimeToReadiness
expr: |
(
1 - sum(rate(camel_k_integration_first_readiness_seconds_bucket{le="60"}[5m])) by (job)
/
sum(rate(camel_k_integration_first_readiness_seconds_count[5m])) by (job)
)
* 100
> 10
for: 1m
labels:
severity: warning
annotations:
message: |
{{ printf "%0.0f" $value }}% of the integrations
for {{ $labels.job }} have their first time to readiness above 1m.
----
More information can be found in the Prometheus Operator https://github.com/coreos/prometheus-operator/blob/v0.38.0/Documentation/user-guides/alerting.md[Alerting] user guide.
You can also find more details in https://docs.openshift.com/container-platform/4.4/monitoring/monitoring-your-own-services.html#creating-alerting-rules_monitoring-your-own-services[Creating alerting rules] from the OpenShift documentation.