Flink has a monitoring API that can be used to query status and statistics of running jobs, as well as recent completed jobs. This monitoring API is used by Flink's own dashboard, but is designed to be used also by custom monitoring tools.
The monitoring API is a REST-ful API that accepts HTTP requests and responds with JSON data.
The monitoring API is backed by a web server that runs as part of the Dispatcher. By default, this server listens at post 8081
, which can be configured in flink-conf.yaml
via rest.port
. Note that the monitoring API web server and the web dashboard web server are currently the same and thus run together at the same port. They respond to different HTTP URLs, though.
In the case of multiple Dispatchers (for high availability), each Dispatcher will run its own instance of the monitoring API, which offers information about completed and running job while that Dispatcher was elected the cluster leader.
The REST API backend is in the flink-runtime
project. The core class is org.apache.flink.runtime.webmonitor.WebMonitorEndpoint
, which sets up the server and the request routing.
We use Netty and the Netty Router library to handle REST requests and translate URLs. This choice was made because this combination has lightweight dependencies, and the performance of Netty HTTP is very good.
To add new requests, one needs to
MessageHeaders
class which serves as an interface for the new request,AbstractRestHandler
class which handles the request according to the added MessageHeaders
class,org.apache.flink.runtime.webmonitor.WebMonitorEndpoint#initializeHandlers()
.A good example is the org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler
that uses the org.apache.flink.runtime.rest.messages.JobExceptionsHeaders
.
The REST API is versioned, with specific versions being queryable by prefixing the url with the version prefix. Prefixes are always of the form v[version_number]
. For example, to access version 1 of /foo/bar
one would query /v1/foo/bar
.
If no version is specified Flink will default to the oldest version supporting the request.
Querying unsupported/non-existing versions will return a 404 error.
There exist several async operations among these APIs, e.g. trigger savepoint
, rescale a job
. They would return a triggerid
to identify the operation you just POST and then you need to use that triggerid
to query for the status of the operation.
{% include generated/rest_v1_dispatcher.html %}