Profiling Module

The profiling module is used to profiling the processes from the Process Discovery Module, and send the snapshot to the backend server.

Configuration

NameDefaultEnvironment KeyDescription
profiling.activetrueROVER_PROFILING_ACTIVEIs active the process profiling.
profiling.check_interval10sROVER_PROFILING_CHECK_INTERVALCheck the profiling task interval.
profiling.flush_interval5sROVER_PROFILING_FLUSH_INTERVALCombine existing profiling data and report to the backend interval.
profiling.task.on_cpu.dump_period9msROVER_PROFILING_TASK_ON_CPU_DUMP_PERIODThe profiling stack dump period.
profiling.task.network.report_interval2sROVER_PROFILING_TASK_NETWORK_TOPOLOGY_REPORT_INTERVALThe interval of send metrics to the backend.
profiling.task.network.meter_prefixrover_net_pROVER_PROFILING_TASK_NETWORK_TOPOLOGY_METER_PREFIXThe prefix of network profiling metrics name.
profiling.task.network.protocol_analyze.per_cpu_buffer400KBROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_PER_CPU_BUFFERThe size of socket data buffer on each CPU.
profiling.task.network.protocol_analyze.parallels2ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_PARALLELSThe count of parallel protocol analyzer.
profiling.task.network.protocol_analyze.queue_size5000ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_QUEUE_SIZEThe size of per paralleled analyzer queue.
profiling.task.network.protocol_analyze.sampling.http.default_request_encodingUTF-8ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_SAMPLING_HTTP_DEFAULT_REQUEST_ENCODINGThe default body encoding when sampling the request.
profiling.task.network.protocol_analyze.sampling.http.default_response_encodingUTF-8ROVER_PROFILING_TASK_NETWORK_PROTOCOL_ANALYZE_SAMPLING_HTTP_DEFAULT_RESPONSE_ENCODINGThe default body encoding when sampling the response.
profiling.continuous.meter_prefixrover_con_pROVER_PROFILING_CONTINUOUS_METER_PREFIXThe continuous related meters prefix name.
profiling.continuous.fetch_interval1sROVER_PROFILING_CONTINUOUS_FETCH_INTERVALThe interval of fetch metrics from the system, such as Process CPU, System Load, etc.
profiling.continuous.check_interval5sROVER_PROFILING_CONTINUOUS_CHECK_INTERVALThe interval of check metrics is reach the thresholds.
profiling.continuous.trigger.execute_duration10mROVER_PROFILING_CONTINUOUS_TRIGGER_EXECUTE_DURATIONThe duration of the profiling task.
profiling.continuous.trigger.silence_duration20mROVER_PROFILING_CONTINUOUS_TRIGGER_SILENCE_DURATIONThe minimal duration between the execution of the same profiling task.

Profiling Type

All the profiling tasks are using the Linux Official Function and kprobe or uprobe to open perf event, and attach the eBPF Program to dump stacks.

On CPU

On CPU Profiling task is using PERF_COUNT_SW_CPU_CLOCK to profiling the process with the CPU clock.

Off CPU

Off CPU Profiling task is attach the finish_task_switch in krobe to profiling the process.

Network

Network Profiling task is intercept IO-related syscall and urprobe in process to identify the network traffic and generate the metrics. Also, the following protocol are supported for analyzing using OpenSSL library, BoringSSL library, GoTLS, NodeTLS or plaintext:

  1. HTTP/1.x
  2. HTTP/2
  3. MySQL
  4. CQL(The Cassandra Query Language)
  5. MongoDB
  6. Kafka
  7. DNS

Collecting data

Network profiling uses metrics, logs send to the backend service.

Data Type

The network profiling has customized the following two types of metrics to represent the network data:

  1. Counter: Records the total number of data in a certain period of time. Each counter containers the following data:
    1. Count: The count of the execution.
    2. Bytes: The package size of the execution.
    3. Exe Time: The consumed time(nanosecond) of the execution.
  2. Histogram: Records the distribution of the data in the bucket.
  3. TopN: Record the highest latency data in a certain period of time.
Labels

Each metric contains the following labels to identify the process relationship:

NameTypeDescription
client_process_id or server_process_idstringThe ID of the current process, which is determined by the role of the current process in the connection as server or client.
client_local or server_localbooleanThe remote process is a local process.
client_address or server_addressstringThe remote process address. ex: IP:port.
sideenumThe current process is either “client” or “server” in this connection.
protocolstringIdentification the protocol based on the package data content.
is_sslboolIs the current connection using SSL.
Layer-4 Data

Based on the above two data types, the following metrics are provided.

NameTypeUnitDescription
writeCounternanosecondThe socket write counter
readCounternanosecondThe socket read counter
write RTTCountermicrosecondThe socket write RTT counter
connectCounternanosecondThe socket connect/accept with other server/client counter
closeCounternanosecondThe socket close counter
retransmitCounternanosecondThe socket retransmit package counter
dropCounternanosecondThe socket drop package counter
write RTTHistogrammicrosecondThe socket write RTT execute time histogram
write execute timeHistogramnanosecondThe socket write data execute time histogram
read execute timeHistogramnanosecondThe socket read data execute time histogram
connect execute timeHistogramnanosecondThe socket connect/accept with other server/client execute time histogram
close execute timeHistogramnanosecondThe socket close execute time histogram
HTTP/1.x Data
Metrics
NameTypeUnitDescription
http1_request_cpmCountercountThe HTTP request counter
http1_response_status_cpmCountercountThe count of per HTTP response code
http1_request_package_sizeHistogramByte sizeThe request package size
http1_response_package_sizeHistogramByte sizeThe response package size
http1_client_durationHistogrammillisecondThe duration of single HTTP response on the client side
http1_server_durationHistogrammillisecondThe duration of single HTTP response on the server side
Logs
NameTypeUnitDescription
slow_tracesTopNmillisecondThe Top N slow trace(id)s
status_4xxTopNmillisecondThe Top N trace(id)s with response status in 400-499
status_5xxTopNmillisecondThe Top N trace(id)s with response status in 500-599
Span Attached Event
NameDescription
HTTP Request SamplingComplete information about the HTTP request, it's only reported when it matches slow/4xx/5xx traces.
HTTP Response SamplingComplete information about the HTTP response, it's only reported when it matches slow/4xx/5xx traces.
Syscall xxxThe methods to use when the process invoke with the network-related syscall method. It's only reported when it matches slow/4xx/5xx traces.

Continuous Profiling

The continuous profiling feature monitors low-power target process information, including process CPU usage and network requests, based on configuration passed from the backend. When a threshold is met, it automatically initiates a profiling task(on/off CPU, Network) to provide more detailed analysis.

Monitor Type

System Load

Monitor the average system load for the last minute, which is equivalent to using the first value of the load average in the uptime command.

Process CPU

The target process utilizes a certain percentage of the CPU on the current host.

Process Thread Count

The real-time number of threads in the target process.

Network

Network monitoring uses eBPF technology to collect real-time performance data of the current process responding to requests. Requests sent upstream are not monitored by the system.

Currently, network monitoring supports parsing of the HTTP/1.x protocol and supports the following types of monitoring:

  1. Error Rate: The percentage of network request errors, such as HTTP status codes within the range of [500-600), is considered as erroneous.
  2. Avg Response Time: Average response time(ms) for specified URI.

Metrics

Rover would periodically send collected monitoring data to the backend using the Native Meter Protocol.

NameUnitDescription
process_cpu(0-100)%The CPU usage percent
process_thread_countcountThe thread count of process
system_loadcountThe average system load for the last minute, each process have same value
http_error_rate(0-100)%The network request error rate percentage
http_avg_response_timemsThe network average response duration