blob: b787aea6dfcfec796ace58e09ef1c8bcf4993e24 [file] [log] [blame] [view]
# AWS Cloud EKS monitoring
SkyWalking leverages OpenTelemetry Collector with [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) to transfer the metrics to
[OpenTelemetry receiver](opentelemetry-receiver.md) and into the [Meter System](./../../concepts-and-designs/mal.md).
### Data flow
1. OpenTelemetry Collector fetches metrics from EKS via [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) and pushes metrics to SkyWalking OAP Server via OpenTelemetry gRPC exporter.
2. The SkyWalking OAP Server parses the expression with [MAL](../../concepts-and-designs/mal.md) to filter/calculate/aggregate and store the results.
### Set up
1. Deploy [amazon/aws-otel-collector](https://hub.docker.com/r/amazon/aws-otel-collector) with [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) to EKS
2. Config SkyWalking [OpenTelemetry receiver](opentelemetry-receiver.md).
Read [Monitoring AWS EKS and S3 with SkyWalking](https://skywalking.apache.org/blog/2023-03-12-skywalking-aws-s3-eks/) for more details
### EKS Monitoring
[AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) provides multiple dimensions metrics for EKS cluster, node, service, etc.
Accordingly, SkyWalking observes the status, and payload of the EKS cluster, which is cataloged as a `LAYER: AWS_EKS` `Service` in the OAP. Meanwhile, the k8s nodes would be recognized as `LAYER: AWS_EKS` `instance`s. The k8s service would be recognized as `endpoint`s.
#### Specify Job Name
SkyWalking distinguishes AWS Cloud EKS metrics by attributes `job_name`, which value is `aws-cloud-eks-monitoring`.
You could leverage OTEL Collector processor to add the attribute as follows:
```yaml
processors:
resource/job-name:
attributes:
- key: job_name
value: aws-cloud-eks-monitoring
action: insert
```
Notice, if you don't specify `job_name` attribute, SkyWalking OAP will ignore the metrics
#### Supported Metrics
| Monitoring Panel | Unit | Metric Name | Catalog | Description | Data Source |
|---------------------------------------|---------|--------------------------------------------|------------|--------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Node Count | | eks_cluster_node_count | Service | The node count of the EKS cluster | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Failed Node Count | | eks_cluster_failed_node_count | Service | The failed node count of the EKS cluster | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Pod Count (namespace dimension) | | eks_cluster_namespace_count | Service | The count of pod in the EKS cluster(namespace dimension) | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Pod Count (service dimension) | | eks_cluster_service_count | Service | The count of pod in the EKS cluster(service dimension) | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Dropped Count (per second) | count/s | eks_cluster_net_rx_dropped | Service | Network RX dropped count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count (per second) | count/s | eks_cluster_net_rx_error | Service | Network RX error count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Dropped Count (per second) | count/s | eks_cluster_net_rx_dropped | Service | Network TX dropped count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count (per second) | count/s | eks_cluster_net_rx_error | Service | Network TX error count | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Pod Count | | eks_cluster_node_pod_number | Instance | The count of pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| CPU Utilization | percent | eks_cluster_node_cpu_utilization | Instance | The CPU Utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Memory Utilization | percent | eks_cluster_node_memory_utilization | Instance | The Memory Utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX | bytes/s | eks_cluster_node_net_rx_bytes | Instance | Network RX bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count | count/s | eks_cluster_node_net_rx_bytes | Instance | Network RX error count of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX | bytes/s | eks_cluster_node_net_rx_bytes | Instance | Network TX bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count | count/s | eks_cluster_node_net_rx_bytes | Instance | Network TX error count of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Disk IO Write | bytes/s | eks_cluster_node_net_rx_bytes | Instance | The IO write bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Disk IO Read | bytes/s | eks_cluster_node_net_rx_bytes | Instance | The IO read bytes of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| FS Utilization | percent | eks_cluster_node_net_rx_bytes | Instance | The filesystem utilization of the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| CPU Utilization | percent | eks_cluster_node_pod_cpu_utilization | Instance | The CPU Utilization of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Memory Utilization | percent | eks_cluster_node_pod_memory_utilization | Instance | The Memory Utilization of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX | bytes/s | eks_cluster_node_pod_net_rx_bytes | Instance | Network RX bytes of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count | count/s | eks_cluster_node_pod_net_rx_error | Instance | Network RX error count of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX | bytes/s | eks_cluster_node_pod_net_tx_bytes | Instance | Network RX bytes of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count | count/s | eks_cluster_node_pod_net_tx_error | Instance | Network RX error count of the pod running on the node | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| CPU Utilization | percent | eks_cluster_service_pod_cpu_utilization | Endpoint | The CPU Utilization of pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Memory Utilization | percent | eks_cluster_service_pod_memory_utilization | Endpoint | The Memory Utilization of pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX | bytes/s | eks_cluster_service_pod_net_rx_bytes | Endpoint | Network RX bytes of the pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network RX Error Count | count/s | eks_cluster_service_pod_net_rx_error | Endpoint | Network TX error count of the pod that belongs to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX | bytes/s | eks_cluster_service_pod_net_tx_bytes | Endpoint | Network TX bytes of the pod that belong to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
| Network TX Error Count | count/s | eks_cluster_node_pod_net_tx_error | Endpoint | Network TX error count of the pod that belongs to the service | [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) |
### Customizations
You can customize your own metrics/expression/dashboard panel.
The metrics definition and expression rules are found in `/config/otel-rules/aws-eks/`.
The AWS Cloud EKS dashboard panel configurations are found in `/config/ui-initialized-templates/aws_eks`.
### OTEL Configuration Sample With AWS Container Insights Receiver
```yaml
extensions:
health_check:
receivers:
awscontainerinsightreceiver:
processors:
resource/job-name:
attributes:
- key: job_name
value: aws-cloud-eks-monitoring
action: insert
exporters:
otlp:
endpoint: oap-service:11800
tls:
insecure: true
logging:
loglevel: debug
service:
pipelines:
metrics:
receivers: [awscontainerinsightreceiver]
processors: [resource/job-name]
exporters: [otlp,logging]
extensions: [health_check]
```
Refer to [AWS Container Insights Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/awscontainerinsightreceiver/README.md) for more information