blob: 68ea05f3c7bd5d4a5ec9648eeb8e8331a4503e99 [file] [log] [blame]
[[operating]]
= Operating
NOTE: The following guide uses the terminology from the https://sre.google/sre-book/service-level-objectives/[Site Reliability Engineer] book.
The Camel K operator exposes a monitoring endpoint, that publishes xref:observability/operator.adoc#metrics[metrics] indicating the _level of service_ provided to its users.
These metrics materialize the Service Level Indicators (SLIs) for the Camel K operator.
Service Level Objectives (SLOs) can be defined based on these SLIs.
The xref:observability/operator.adoc#alerting[default alerts] created for the Camel K operator query the SLIs corresponding metrics, and match the SLOs for the Camel K operator, so that they fire up as soon as the _level of service_ is not met, and preemptive measures can be taken before beaching the Service Level Agreement (SLA) for the Camel K operator.
[[operator-sops]]
== Operator SOPs
The following section lists the Standard Operating Procedures (SOPs), corresponding to the xref:observability/operator.adoc#alerting[default alerts], created for the Camel K operator.
It assumes the operator has been installed according to the xref:observability/operator.adoc#installation[installation] section from the operator monitoring documentation.
It documents the recommended troubleshooting actions to be performed when a particular alert fires.
It is meant to be a living document, to be improved iteratively over time, as users face problematic situations, and actions to troubleshoot and solve them are perfected.
NOTE: The commands in the following section rely on the `jq` tool, to process the output of the `kubectl` commands. You can refer to the https://stedolan.github.io/jq/download/[download] instructions from the tool Website.
=== CamelKReconciliationDuration
==== Description
This alert has severity level of "warning".
It's firing when more than 10% of the reconciliation requests have their duration above 0.5s.
==== Troubleshooting
* Check the `rate(camel_k_reconciliation_duration_seconds_bucket{le="0.5"}[5m])` SLI, and identify the resource kinds for which the duration is longer than 0.5s.
* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future.
=== CamelKReconciliationFailure
==== Description
This alert has severity level of "warning".
It's firing when some reconciliation requests have failed.
==== Troubleshooting
* Check the `camel_k_reconciliation_duration_seconds_count{result="Errored"}` SLI, and identify the `kind` label(s) for which the value is not zero.
* Search the operator logs for errors, e.g.:
+
[source,sh]
----
$ kubectl logs deployment/camel-k-operator --since=1h \
| jq -R 'fromjson?
| select(.level == "error")'
----
Check the `error`, `errorVerbose` and `stacktrace` fields.
* Inspect the resources corresponding to the errors, e.g.:
+
[source,sh]
----
$ kubectl logs deployment/camel-k-operator --since=1h \
| jq -rR 'fromjson?
| select(.level == "error")
| [{namespace, name, controller}]
| unique
| .[]
| "-n \(.namespace) \(.controller | rtrimstr("-controller"))/\(.name)"' \
| xargs kubectl describe
----
Check the resource specification and events.
* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future.
=== CamelKSuccessBuildDuration2m
==== Description
This alert has severity level of "warning".
It's firing when more than 10% of the successful builds have their duration above 2 min.
==== Troubleshooting
* Inspect the successful Builds whose duration is longer than 2 minutes, e.g.:
+
[source,sh]
----
$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(.status.phase == "Succeeded")
| select(.status.duration
| "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss")
| mktime > 120)
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs kubectl describe
----
Check the resource specification and events.
* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future.
=== CamelKSuccessBuildDuration5m
=== Description
This alert has severity level of "critical".
It's firing when more than 1% of the successful builds have their duration above 5 min.
==== Troubleshooting
* Inspect the successful Builds whose duration is longer than 5 minutes, e.g.:
+
[source,sh]
----
$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(.status.phase == "Succeeded")
| select(.status.duration
| "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss")
| mktime > 300)
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs kubectl describe
----
Check the resource specification and events.
* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future.
=== CamelKBuildError
=== Description
This alert has severity level of "critical".
It's firing when more than 1% of the builds have errored over at least 10 min.
==== Troubleshooting
* Inspect the errored Builds, e.g.:
+
[source,sh]
----
$ kubectl get builds.camel.apache.org -o json \
| jq -r '.items[]
| select(.status.phase == "Error")
| "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \
| xargs kubectl get -o jsonpath='{.metadata.namespace}{"/"}{.metadata.name}{"\nError: "}{.status.error}{"\n"}'
----
Check the resource specification and events.
* Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future.