| [[operating]] |
| = Operating |
| |
| NOTE: The following guide uses the terminology from the https://sre.google/sre-book/service-level-objectives/[Site Reliability Engineer] book. |
| |
| The Camel K operator exposes a monitoring endpoint, that publishes xref:observability/operator.adoc#metrics[metrics] indicating the _level of service_ provided to its users. |
| These metrics materialize the Service Level Indicators (SLIs) for the Camel K operator. |
| |
| Service Level Objectives (SLOs) can be defined based on these SLIs. |
| The xref:observability/operator.adoc#alerting[default alerts] created for the Camel K operator query the SLIs corresponding metrics, and match the SLOs for the Camel K operator, so that they fire up as soon as the _level of service_ is not met, and preemptive measures can be taken before beaching the Service Level Agreement (SLA) for the Camel K operator. |
| |
| [[operator-sops]] |
| == Operator SOPs |
| |
| The following section lists the Standard Operating Procedures (SOPs), corresponding to the xref:observability/operator.adoc#alerting[default alerts], created for the Camel K operator. |
| It assumes the operator has been installed according to the xref:observability/operator.adoc#installation[installation] section from the operator monitoring documentation. |
| |
| It documents the recommended troubleshooting actions to be performed when a particular alert fires. |
| It is meant to be a living document, to be improved iteratively over time, as users face problematic situations, and actions to troubleshoot and solve them are perfected. |
| |
| NOTE: The commands in the following section rely on the `jq` tool, to process the output of the `kubectl` commands. You can refer to the https://stedolan.github.io/jq/download/[download] instructions from the tool Website. |
| |
| === CamelKReconciliationDuration |
| |
| ==== Description |
| |
| This alert has severity level of "warning". |
| It's firing when more than 10% of the reconciliation requests have their duration above 0.5s. |
| |
| ==== Troubleshooting |
| |
| * Check the `rate(camel_k_reconciliation_duration_seconds_bucket{le="0.5"}[5m])` SLI, and identify the resource kinds for which the duration is longer than 0.5s. |
| |
| * Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. |
| |
| === CamelKReconciliationFailure |
| |
| ==== Description |
| |
| This alert has severity level of "warning". |
| It's firing when some reconciliation requests have failed. |
| |
| ==== Troubleshooting |
| |
| * Check the `camel_k_reconciliation_duration_seconds_count{result="Errored"}` SLI, and identify the `kind` label(s) for which the value is not zero. |
| |
| * Search the operator logs for errors, e.g.: |
| + |
| [source,sh] |
| ---- |
| $ kubectl logs deployment/camel-k-operator --since=1h \ |
| | jq -R 'fromjson? |
| | select(.level == "error")' |
| ---- |
| Check the `error`, `errorVerbose` and `stacktrace` fields. |
| |
| * Inspect the resources corresponding to the errors, e.g.: |
| + |
| [source,sh] |
| ---- |
| $ kubectl logs deployment/camel-k-operator --since=1h \ |
| | jq -rR 'fromjson? |
| | select(.level == "error") |
| | [{namespace, name, controller}] |
| | unique |
| | .[] |
| | "-n \(.namespace) \(.controller | rtrimstr("-controller"))/\(.name)"' \ |
| | xargs kubectl describe |
| ---- |
| Check the resource specification and events. |
| |
| * Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. |
| |
| === CamelKSuccessBuildDuration2m |
| |
| ==== Description |
| |
| This alert has severity level of "warning". |
| It's firing when more than 10% of the successful builds have their duration above 2 min. |
| |
| ==== Troubleshooting |
| |
| * Inspect the successful Builds whose duration is longer than 2 minutes, e.g.: |
| + |
| [source,sh] |
| ---- |
| $ kubectl get builds.camel.apache.org -o json \ |
| | jq -r '.items[] |
| | select(.status.phase == "Succeeded") |
| | select(.status.duration |
| | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss") |
| | mktime > 120) |
| | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ |
| | xargs kubectl describe |
| ---- |
| Check the resource specification and events. |
| |
| * Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. |
| |
| === CamelKSuccessBuildDuration5m |
| |
| === Description |
| |
| This alert has severity level of "critical". |
| It's firing when more than 1% of the successful builds have their duration above 5 min. |
| |
| ==== Troubleshooting |
| |
| * Inspect the successful Builds whose duration is longer than 5 minutes, e.g.: |
| + |
| [source,sh] |
| ---- |
| $ kubectl get builds.camel.apache.org -o json \ |
| | jq -r '.items[] |
| | select(.status.phase == "Succeeded") |
| | select(.status.duration |
| | "01-Jan-1970 \(sub("(?<time>.*)\\..*"; "\(.time)s"))" | strptime("%d-%b-%Y %Mm%Ss")? // strptime("%d-%b-%Y %Ss") |
| | mktime > 300) |
| | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ |
| | xargs kubectl describe |
| ---- |
| Check the resource specification and events. |
| |
| * Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. |
| |
| === CamelKBuildError |
| |
| === Description |
| |
| This alert has severity level of "critical". |
| It's firing when more than 1% of the builds have errored over at least 10 min. |
| |
| ==== Troubleshooting |
| |
| * Inspect the errored Builds, e.g.: |
| + |
| [source,sh] |
| ---- |
| $ kubectl get builds.camel.apache.org -o json \ |
| | jq -r '.items[] |
| | select(.status.phase == "Error") |
| | "-n \(.metadata.namespace) builds.camel.apache.org/\(.metadata.name)"' \ |
| | xargs kubectl get -o jsonpath='{.metadata.namespace}{"/"}{.metadata.name}{"\nError: "}{.status.error}{"\n"}' |
| ---- |
| Check the resource specification and events. |
| |
| * Improve this SOP if there's anything missing, and contact engineering if there are any changes they could make to make this easier in the future. |