content/2019/02/21/monitoring-apache-flink-applications-101/index.html - flink-web - Git at Google


 <!DOCTYPE html>
 <html lang="en" dir=ZgotmplZ>

 <head>


 <link rel="stylesheet" href="/bootstrap/css/bootstrap.min.css">
 <script src="/bootstrap/js/bootstrap.bundle.min.js"></script>
 <link rel="stylesheet" type="text/css" href="/font-awesome/css/font-awesome.min.css">
 <script src="/js/anchor.min.js"></script>
 <script src="/js/flink.js"></script>
 <link rel="canonical" href="https://flink.apache.org/2019/02/21/monitoring-apache-flink-applications-101/">

   <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <meta name="description" content="This blog post provides an introduction to Apache Flink’s built-in monitoring and metrics system, that allows developers to effectively monitor their Flink jobs. Oftentimes, the task of picking the relevant metrics to monitor a Flink application can be overwhelming for a DevOps team that is just starting with stream processing and Apache Flink. Having worked with many organizations that deploy Flink at scale, I would like to share my experience and some best practice with the community.">
 <meta name="theme-color" content="#FFFFFF"><meta property="og:title" content="Monitoring Apache Flink Applications 101" />
 <meta property="og:description" content="This blog post provides an introduction to Apache Flink’s built-in monitoring and metrics system, that allows developers to effectively monitor their Flink jobs. Oftentimes, the task of picking the relevant metrics to monitor a Flink application can be overwhelming for a DevOps team that is just starting with stream processing and Apache Flink. Having worked with many organizations that deploy Flink at scale, I would like to share my experience and some best practice with the community." />
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://flink.apache.org/2019/02/21/monitoring-apache-flink-applications-101/" /><meta property="article:section" content="posts" />
 <meta property="article:published_time" content="2019-02-21T12:00:00+00:00" />
 <meta property="article:modified_time" content="2019-02-21T12:00:00+00:00" />
 <title>Monitoring Apache Flink Applications 101 | Apache Flink</title>
 <link rel="manifest" href="/manifest.json">
 <link rel="icon" href="/favicon.png" type="image/x-icon">
 <link rel="stylesheet" href="/book.min.22eceb4d17baa9cdc0f57345edd6f215a40474022dfee39b63befb5fb3c596b5.css" integrity="sha256-IuzrTRe6qc3A9XNF7dbyFaQEdAIt/uObY777X7PFlrU=">
 <script defer src="/en.search.min.2698f0d1b683dae4d6cb071668b310a55ebcf1c48d11410a015a51d90105b53e.js" integrity="sha256-Jpjw0baD2uTWywcWaLMQpV688cSNEUEKAVpR2QEFtT4="></script>
 <!--
 Made with Book Theme
 https://github.com/alex-shpak/hugo-book
 -->

   <meta name="generator" content="Hugo 0.124.1">


     <script>
       var _paq = window._paq = window._paq || [];


       _paq.push(['disableCookies']);

       _paq.push(["setDomains", ["*.flink.apache.org","*.nightlies.apache.org/flink"]]);
       _paq.push(['trackPageView']);
       _paq.push(['enableLinkTracking']);
       (function() {
         var u="//analytics.apache.org/";
         _paq.push(['setTrackerUrl', u+'matomo.php']);
         _paq.push(['setSiteId', '1']);
         var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
         g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
       })();
     </script>

 </head>

 <body dir=ZgotmplZ>


 <header>
   <nav class="navbar navbar-expand-xl">
     <div class="container-fluid">
       <a class="navbar-brand" href="/">
         <img src="/img/logo/png/100/flink_squirrel_100_color.png" alt="Apache Flink" height="47" width="47" class="d-inline-block align-text-middle">
         <span>Apache Flink</span>
       </a>
       <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
           <i class="fa fa-bars navbar-toggler-icon"></i>
       </button>
       <div class="collapse navbar-collapse" id="navbarSupportedContent">
         <ul class="navbar-nav">


     <li class="nav-item dropdown">
       <a class="nav-link dropdown-toggle" href="#" role="button" data-bs-toggle="dropdown" aria-expanded="false">About</a>
       <ul class="dropdown-menu">

           <li>


     <a class="dropdown-item" href="/what-is-flink/flink-architecture/">Architecture</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/flink-applications/">Applications</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/flink-operations/">Operations</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/use-cases/">Use Cases</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/powered-by/">Powered By</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/roadmap/">Roadmap</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/community/">Community & Project Info</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/security/">Security</a>


           </li>

           <li>


     <a class="dropdown-item" href="/what-is-flink/special-thanks/">Special Thanks</a>


           </li>

       </ul>
     </li>


     <li class="nav-item dropdown">
       <a class="nav-link dropdown-toggle" href="#" role="button" data-bs-toggle="dropdown" aria-expanded="false">Getting Started</a>
       <ul class="dropdown-menu">

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/">With Flink<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/docs/try-flink-kubernetes-operator/quick-start/">With Flink Kubernetes Operator<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/introduction/">With Flink CDC<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-ml-docs-stable/docs/try-flink-ml/quick-start/">With Flink ML<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-statefun-docs-stable/getting-started/project-setup.html">With Flink Stateful Functions<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-docs-stable/docs/learn-flink/overview/">Training Course<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

       </ul>
     </li>


     <li class="nav-item dropdown">
       <a class="nav-link dropdown-toggle" href="#" role="button" data-bs-toggle="dropdown" aria-expanded="false">Documentation</a>
       <ul class="dropdown-menu">

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-docs-stable/">Flink 1.19 (stable)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-docs-master/">Flink Master (snapshot)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-stable/">Kubernetes Operator 1.8 (latest)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main">Kubernetes Operator Main (snapshot)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-cdc-docs-stable">CDC 3.0 (stable)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-cdc-docs-master">CDC Master (snapshot)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-ml-docs-stable/">ML 2.3 (stable)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-ml-docs-master">ML Master (snapshot)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-statefun-docs-stable/">Stateful Functions 3.3 (stable)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

           <li>


     <a class="dropdown-item" href="https://nightlies.apache.org/flink/flink-statefun-docs-master">Stateful Functions Master (snapshot)<i class="link fa fa-external-link title" aria-hidden="true"></i>
     </a>


           </li>

       </ul>
     </li>


     <li class="nav-item dropdown">
       <a class="nav-link dropdown-toggle" href="#" role="button" data-bs-toggle="dropdown" aria-expanded="false">How to Contribute</a>
       <ul class="dropdown-menu">

           <li>


     <a class="dropdown-item" href="/how-to-contribute/overview/">Overview</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/contribute-code/">Contribute Code</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/reviewing-prs/">Review Pull Requests</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/code-style-and-quality-preamble/">Code Style and Quality Guide</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/contribute-documentation/">Contribute Documentation</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/documentation-style-guide/">Documentation Style Guide</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/improve-website/">Contribute to the Website</a>


           </li>

           <li>


     <a class="dropdown-item" href="/how-to-contribute/getting-help/">Getting Help</a>


           </li>

       </ul>
     </li>


     <li class="nav-item">


     <a class="nav-link" href="/posts/">Flink Blog</a>


     </li>


     <li class="nav-item">


     <a class="nav-link" href="/downloads/">Downloads</a>


     </li>


         </ul>
         <div class="book-search">
           <div class="book-search-spinner hidden">
             <i class="fa fa-refresh fa-spin"></i>
           </div>
           <form class="search-bar d-flex" onsubmit="return false;"su>
             <input type="text" id="book-search-input" placeholder="Search" aria-label="Search" maxlength="64" data-hotkeys="s/">
             <i class="fa fa-search search"></i>
             <i class="fa fa-circle-o-notch fa-spin spinner"></i>
           </form>
           <div class="book-search-spinner hidden"></div>
           <ul id="book-search-results"></ul>
         </div>
       </div>
     </div>
   </nav>
   <div class="navbar-clearfix"></div>
 </header>


       <main class="flex">
         <section class="container book-page">

 <article class="markdown">
     <h1>
         <a href="/2019/02/21/monitoring-apache-flink-applications-101/">Monitoring Apache Flink Applications 101</a>
     </h1>


   February 21, 2019 -


   Konstantin Knauf

   <a href="https://twitter.com/snntrable">(@snntrable)</a>


     <p><!-- improve style of tables -->
 <style>
   table { border: 0px solid black; table-layout: auto; width: 800px; }
   th, td { border: 1px solid black; padding: 5px; padding-left: 10px; padding-right: 10px; }
   th { text-align: center }
   td { vertical-align: top }
 </style>
 <p>This blog post provides an introduction to Apache Flink’s built-in monitoring
 and metrics system, that allows developers to effectively monitor their Flink
 jobs. Oftentimes, the task of picking the relevant metrics to monitor a
 Flink application can be overwhelming for a DevOps team that is just starting
 with stream processing and Apache Flink. Having worked with many organizations
 that deploy Flink at scale, I would like to share my experience and some best
 practice with the community.</p>
 <p>With business-critical applications running on Apache Flink, performance monitoring
 becomes an increasingly important part of a successful production deployment. It
 ensures that any degradation or downtime is immediately identified and resolved
 as quickly as possible.</p>
 <p>Monitoring goes hand-in-hand with observability, which is a prerequisite for
 troubleshooting and performance tuning. Nowadays, with the complexity of modern
 enterprise applications and the speed of delivery increasing, an engineering
 team must understand and have a complete overview of its applications’ status at
 any given point in time.</p>
 <h2 id="flinks-metrics-system">
   Flink’s Metrics System
   <a class="anchor" href="#flinks-metrics-system">#</a>
 </h2>
 <p>The foundation for monitoring Flink jobs is its <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html">metrics
 system</a>
 which consists of two components; <code>Metrics</code> and <code>MetricsReporters</code>.</p>
 <h3 id="metrics">
   Metrics
   <a class="anchor" href="#metrics">#</a>
 </h3>
 <p>Flink comes with a comprehensive set of built-in metrics such as:</p>
 <ul>
 <li>Used JVM Heap / NonHeap / Direct Memory (per Task-/JobManager)</li>
 <li>Number of Job Restarts (per Job)</li>
 <li>Number of Records Per Second (per Operator)</li>
 <li>&hellip;</li>
 </ul>
 <p>These metrics have different scopes and measure more general (e.g. JVM or
 operating system) as well as Flink-specific aspects.</p>
 <p>As a user, you can and should add application-specific metrics to your
 functions. Typically these include counters for the number of invalid records or
 the number of records temporarily buffered in managed state. Besides counters,
 Flink offers additional metrics types like gauges and histograms. For
 instructions on how to register your own metrics with Flink’s metrics system
 please check out <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html#registering-metrics">Flink’s
 documentation</a>.
 In this blog post, we will focus on how to get the most out of Flink’s built-in
 metrics.</p>
 <h3 id="metricsreporters">
   MetricsReporters
   <a class="anchor" href="#metricsreporters">#</a>
 </h3>
 <p>All metrics can be queried via Flink’s REST API. However, users can configure
 MetricsReporters to send the metrics to external systems. Apache Flink provides
 reporters to the most common monitoring tools out-of-the-box including JMX,
 Prometheus, Datadog, Graphite and InfluxDB. For information about how to
 configure a reporter check out Flink’s <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html#reporter">MetricsReporter
 documentation</a>.</p>
 <p>In the remaining part of this blog post, we will go over some of the most
 important metrics to monitor your Apache Flink application.</p>
 <h2 id="monitoring-general-health">
   Monitoring General Health
   <a class="anchor" href="#monitoring-general-health">#</a>
 </h2>
 <p>The first thing you want to monitor is whether your job is actually in a <em>RUNNING</em>
 state. In addition, it pays off to monitor the number of restarts and the time
 since the last restart.</p>
 <p>Generally speaking, successful checkpointing is a strong indicator of the
 general health of your application. For each checkpoint, checkpoint barriers
 need to flow through the whole topology of your Flink job and events and
 barriers cannot overtake each other. Therefore, a successful checkpoint shows
 that no channel is fully congested.</p>
 <p><strong>Key Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>uptime</code></td>
 <td>job</td>
 <td>The time that the job has been running without interruption.</td>
 </tr>
 <tr>
 <td><code>fullRestarts</code></td>
 <td>job</td>
 <td>The total number of full restarts since this job was submitted.</td>
 </tr>
 <tr>
 <td><code>numberOfCompletedCheckpoints</code></td>
 <td>job</td>
 <td>The number of successfully completed checkpoints.</td>
 </tr>
 <tr>
 <td><code>numberOfFailedCheckpoints</code></td>
 <td>job</td>
 <td>The number of failed checkpoints.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Example Dashboard Panels</strong></p>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-1.png" width="800px" alt="Uptime (35 minutes), Restarting Time (3 milliseconds) and Number of Full Restarts (7)"/>
 <br/>
 <i><small>Uptime (35 minutes), Restarting Time (3 milliseconds) and Number of Full Restarts (7)</small></i>
 </center>
 <br/>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-2.png" width="800px" alt="Completed Checkpoints (18336), Failed (14)"/>
 <br/>
 <i><small>Completed Checkpoints (18336), Failed (14)</small></i>
 </center>
 <br/>
 <p><strong>Possible Alerts</strong></p>
 <ul>
 <li><code>ΔfullRestarts</code> &gt; <code>threshold</code></li>
 <li><code>ΔnumberOfFailedCheckpoints</code> &gt; <code>threshold</code></li>
 </ul>
 <h2 id="monitoring-progress--throughput">
   Monitoring Progress &amp; Throughput
   <a class="anchor" href="#monitoring-progress--throughput">#</a>
 </h2>
 <p>Knowing that your application is RUNNING and checkpointing is working fine is good,
 but it does not tell you whether the application is actually making progress and
 keeping up with the upstream systems.</p>
 <h3 id="throughput">
   Throughput
   <a class="anchor" href="#throughput">#</a>
 </h3>
 <p>Flink provides multiple metrics to measure the throughput of our application.
 For each operator or task (remember: a task can contain multiple <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/dev/stream/operators/#task-chaining-and-resource-groups">chained
 tasks</a>
 Flink counts the number of records and bytes going in and out. Out of those
 metrics, the rate of outgoing records per operator is often the most intuitive
 and easiest to reason about.</p>
 <p><strong>Key Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>numRecordsOutPerSecond</code></td>
 <td>task</td>
 <td>The number of records this operator/task sends per second.</td>
 </tr>
 <tr>
 <td><code>numRecordsOutPerSecond</code></td>
 <td>operator</td>
 <td>The number of records this operator sends per second.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Example Dashboard Panels</strong></p>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-3.png" width="800px" alt="Mean Records Out per Second per Operator"/>
 <br/>
 <i><small>Mean Records Out per Second per Operator</small></i>
 </center>
 <br/>
 <p><strong>Possible Alerts</strong></p>
 <ul>
 <li><code>recordsOutPerSecond</code> = <code>0</code> (for a non-Sink operator)</li>
 </ul>
 <p><em>Note</em>: Source operators always have zero incoming records. Sink operators
 always have zero outgoing records because the metrics only count
 Flink-internal communication. There is a <a href="https://issues.apache.org/jira/browse/FLINK-7286">JIRA
 ticket</a> to change this
 behavior.</p>
 <h3 id="progress">
   Progress
   <a class="anchor" href="#progress">#</a>
 </h3>
 <p>For applications, that use event time semantics, it is important that watermarks
 progress over time. A watermark of time <em>t</em> tells the framework, that it
 should not anymore expect to receive  events with a timestamp earlier than <em>t</em>,
 and in turn, to trigger all operations that were scheduled for a timestamp &lt; <em>t</em>.
 For example, an event time window that ends at <em>t</em> = 30 will be closed and
 evaluated once the watermark passes 30.</p>
 <p>As a consequence, you should monitor the watermark at event time-sensitive
 operators in your application, such as process functions and windows. If the
 difference between the current processing time and the watermark, known as
 even-time skew, is unusually high, then it typically implies one of two issues.
 First, it could mean that your are simply processing old events, for example
 during catch-up after a downtime or when your job is simply not able to keep up
 and events are queuing up. Second, it could mean a single upstream sub-task has
 not sent a watermark for a long time (for example because it did not receive any
 events to base the watermark on), which also prevents the watermark in
 downstream operators to progress. This <a href="https://issues.apache.org/jira/browse/FLINK-5017">JIRA
 ticket</a> provides further
 information and a work around for the latter.</p>
 <p><strong>Key Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>currentOutputWatermark</code></td>
 <td>operator</td>
 <td>The last watermark this operator has emitted.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Example Dashboard Panels</strong></p>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-4.png" width="800px" alt="Event Time Lag per Subtask of a single operator in the topology. In this case, the watermark is lagging a few seconds behind for each subtask."/>
 <br/>
 <i><small>Event Time Lag per Subtask of a single operator in the topology. In this case, the watermark is lagging a few seconds behind for each subtask.</small></i>
 </center>
 <br/>
 <p><strong>Possible Alerts</strong></p>
 <ul>
 <li><code>currentProcessingTime - currentOutputWatermark</code> &gt; <code>threshold</code></li>
 </ul>
 <h3 id="keeping-up">
   &ldquo;Keeping Up&rdquo;
   <a class="anchor" href="#keeping-up">#</a>
 </h3>
 <p>When consuming from a message queue, there is often a direct way to monitor if
 your application is keeping up. By using connector-specific metrics you can
 monitor how far behind the head of the message queue your current consumer group
 is. Flink forwards the underlying metrics from most sources.</p>
 <p><strong>Key Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>records-lag-max</code></td>
 <td>user</td>
 <td>applies to <code>FlinkKafkaConsumer</code>. The maximum lag in terms of the number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.</td>
 </tr>
 <tr>
 <td><code>millisBehindLatest</code></td>
 <td>user</td>
 <td>applies to <code>FlinkKinesisConsumer</code>. The number of milliseconds a consumer is behind the head of the stream. For any consumer and Kinesis shard, this indicates how far it is behind the current time.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Possible Alerts</strong></p>
 <ul>
 <li><code>records-lag-max</code>  &gt; <code>threshold</code></li>
 <li><code>millisBehindLatest</code> &gt; <code>threshold</code></li>
 </ul>
 <h2 id="monitoring-latency">
   Monitoring Latency
   <a class="anchor" href="#monitoring-latency">#</a>
 </h2>
 <p>Generally speaking, latency is the delay between the creation of an event and
 the time at which results based on this event become visible. Once the event is
 created it is usually stored in a persistent message queue, before it is
 processed by Apache Flink, which then writes the results to a database or calls
 a downstream system. In such a pipeline, latency can be introduced at each stage
 and for various reasons including the following:</p>
 <ol>
 <li>It might take a varying amount of time until events are persisted in the
 message queue.</li>
 <li>During periods of high load or during recovery, events might spend some time
 in the message queue until they are processed by Flink (see previous section).</li>
 <li>Some operators in a streaming topology need to buffer events for some time
 (e.g. in a time window) for functional reasons.</li>
 <li>Each computation in your Flink topology (framework or user code), as well as
 each network shuffle, takes time and adds to latency.</li>
 <li>If the application emits through a transactional sink, the sink will only
 commit and publish transactions upon successful checkpoints of Flink, adding
 latency usually up to the checkpointing interval for each record.</li>
 </ol>
 <p>In practice, it has proven invaluable to add timestamps to your events at
 multiple stages (at least at creation, persistence, ingestion by Flink,
 publication by Flink, possibly sampling those to save bandwidth). The
 differences between these timestamps can be exposed as a user-defined metric in
 your Flink topology to derive the latency distribution of each stage.</p>
 <p>In the rest of this section, we will only consider latency, which is introduced
 inside the Flink topology and cannot be attributed to transactional sinks or
 events being buffered for functional reasons (4.).</p>
 <p>To this end, Flink comes with a feature called <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html#latency-tracking">Latency
 Tracking</a>.
 When enabled, Flink will insert so-called latency markers periodically at all
 sources. For each sub-task, a latency distribution from each source to this
 operator will be reported. The granularity of these histograms can be further
 controlled by setting <em>metrics.latency.granularity</em> as desired.</p>
 <p>Due to the potentially high number of histograms (in particular for
 <em>metrics.latency.granularity: subtask</em>), enabling latency tracking can
 significantly impact the performance of the cluster. It is recommended to only
 enable it to locate sources of latency during debugging.</p>
 <p><strong>Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>latency</code></td>
 <td>operator</td>
 <td>The latency from the source operator to this operator.</td>
 </tr>
 <tr>
 <td><code>restartingTime</code></td>
 <td>job</td>
 <td>The time it took to restart the job, or how long the current restart has been in progress.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Example Dashboard Panel</strong></p>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-5.png" width="800px" alt="Latency distribution between a source and a single sink subtask."/>
 <br/>
 <i><small>Latency distribution between a source and a single sink subtask.</small></i>
 </center>
 <br/>
 <h2 id="jvm-metrics">
   JVM Metrics
   <a class="anchor" href="#jvm-metrics">#</a>
 </h2>
 <p>So far we have only looked at Flink-specific metrics. As long as latency &amp;
 throughput of your application are in line with your expectations and it is
 checkpointing consistently, this is probably everything you need. On the other
 hand, if you job’s performance is starting to degrade among the firstmetrics you
 want to look at are memory consumption and CPU load of your Task- &amp; JobManager
 JVMs.</p>
 <h3 id="memory">
   Memory
   <a class="anchor" href="#memory">#</a>
 </h3>
 <p>Flink reports the usage of Heap, NonHeap, Direct &amp; Mapped memory for JobManagers
 and TaskManagers.</p>
 <ul>
 <li>
 <p>Heap memory - as with most JVM applications - is the most volatile and important
 metric to watch. This is especially true when using Flink’s filesystem
 statebackend as it keeps all state objects on the JVM Heap. If the size of
 long-living objects on the Heap increases significantly, this can usually be
 attributed to the size of your application state (check the
 <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html#checkpointing">checkpointing metrics</a>
 for an estimated size of the on-heap state). The possible reasons for growing
 state are very application-specific. Typically, an increasing number of keys, a
 large event-time skew between different input streams or simply missing state
 cleanup may cause growing state.</p>
 </li>
 <li>
 <p>NonHeap memory is dominated by the metaspace, the size of which is unlimited by default
 and holds class metadata as well as static content. There is a
 <a href="https://issues.apache.org/jira/browse/FLINK-10317">JIRA Ticket</a> to limit the size
 to 250 megabyte by default.</p>
 </li>
 <li>
 <p>The biggest driver of Direct memory is by far the
 number of Flink’s network buffers, which can be
 <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/ops/config.html#configuring-the-network-buffers">configured</a>.</p>
 </li>
 <li>
 <p>Mapped memory is usually close to zero as Flink does not use memory-mapped files.</p>
 </li>
 </ul>
 <p>In a containerized environment you should additionally monitor the overall
 memory consumption of the Job- and TaskManager containers to ensure they don’t
 exceed their resource limits. This is particularly important, when using the
 RocksDB statebackend, since RocksDB allocates a considerable amount of
 memory off heap. To understand how much memory RocksDB might use, you can
 checkout <a href="https://www.da-platform.com/blog/manage-rocksdb-memory-size-apache-flink">this blog
 post</a>
 by Stefan Richter.</p>
 <p><strong>Key Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>Status.JVM.Memory.NonHeap.Committed</code></td>
 <td>job-/taskmanager</td>
 <td>The amount of non-heap memory guaranteed to be available to the JVM (in bytes).</td>
 </tr>
 <tr>
 <td><code>Status.JVM.Memory.Heap.Used</code></td>
 <td>job-/taskmanager</td>
 <td>The amount of heap memory currently used (in bytes).</td>
 </tr>
 <tr>
 <td><code>Status.JVM.Memory.Heap.Committed</code></td>
 <td>job-/taskmanager</td>
 <td>The amount of heap memory guaranteed to be available to the JVM (in bytes).</td>
 </tr>
 <tr>
 <td><code>Status.JVM.Memory.Direct.MemoryUsed</code></td>
 <td>job-/taskmanager</td>
 <td>The amount of memory used by the JVM for the direct buffer pool (in bytes).</td>
 </tr>
 <tr>
 <td><code>Status.JVM.Memory.Mapped.MemoryUsed</code></td>
 <td>job-/taskmanager</td>
 <td>The amount of memory used by the JVM for the mapped buffer pool (in bytes).</td>
 </tr>
 <tr>
 <td><code>Status.JVM.GarbageCollector.G1 Young Generation.Time</code></td>
 <td>job-/taskmanager</td>
 <td>The total time spent performing G1 Young Generation garbage collection.</td>
 </tr>
 <tr>
 <td><code>Status.JVM.GarbageCollector.G1 Old Generation.Time</code></td>
 <td>job-/taskmanager</td>
 <td>The total time spent performing G1 Old Generation garbage collection.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Example Dashboard Panel</strong></p>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-6.png" width="800px" alt="TaskManager memory consumption and garbage collection times."/>
 <br/>
 <i><small>TaskManager memory consumption and garbage collection times.</small></i>
 </center>
 <br/>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-7.png" width="800px" alt="JobManager memory consumption and garbage collection times."/>
 <br/>
 <i><small>JobManager memory consumption and garbage collection times.</small></i>
 </center>
 <br/>
 <p><strong>Possible Alerts</strong></p>
 <ul>
 <li><code>container memory limit</code> &lt; <code>container memory + safety margin</code></li>
 </ul>
 <h3 id="cpu">
   CPU
   <a class="anchor" href="#cpu">#</a>
 </h3>
 <p>Besides memory, you should also monitor the CPU load of the TaskManagers. If
 your TaskManagers are constantly under very high load, you might be able to
 improve the overall performance by decreasing the number of task slots per
 TaskManager (in case of a Standalone setup), by providing more resources to the
 TaskManager (in case of a containerized setup), or by providing more
 TaskManagers. In general, a system already running under very high load during
 normal operations, will need much more time to catch-up after recovering from a
 downtime. During this time you will see a much higher latency (event-time skew) than
 usual.</p>
 <p>A sudden increase in the CPU load might also be attributed to high garbage
 collection pressure, which should be visible in the JVM memory metrics as well.</p>
 <p>If one or a few TaskManagers are constantly under very high load, this can slow
 down the whole topology due to long checkpoint alignment times and increasing
 event-time skew. A common reason is skew in the partition key of the data, which
 can be mitigated by pre-aggregating before the shuffle or keying on a more
 evenly distributed key.</p>
 <p><strong>Key Metrics</strong></p>
 <table>
 <thead>
 <tr>
 <th>Metric</th>
 <th>Scope</th>
 <th>Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td><code>Status.JVM.CPU.Load</code></td>
 <td>job-/taskmanager</td>
 <td>The recent CPU usage of the JVM.</td>
 </tr>
 </tbody>
 </table>
 <br/>
 <p><strong>Example Dashboard Panel</strong></p>
 <center>
 <img src="/img/blog/2019-02-21-monitoring-best-practices/fig-8.png" width="800px" alt="TaskManager & JobManager CPU load."/>
 <br/>
 <i><small>TaskManager & JobManager CPU load.</small></i>
 </center>
 <br/>
 <h2 id="system-resources">
   System Resources
   <a class="anchor" href="#system-resources">#</a>
 </h2>
 <p>In addition to the JVM metrics above, it is also possible to use Flink’s metrics
 system to gather insights about system resources, i.e. memory, CPU &amp;
 network-related metrics for the whole machine as opposed to the Flink processes
 alone. System resource monitoring is disabled by default and requires additional
 dependencies on the classpath. Please check out the
 <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html#system-resources">Flink system resource metrics documentation</a> for
 additional guidance and details. System resource monitoring in Flink can be very
 helpful in setups without existing host monitoring capabilities.</p>
 <h2 id="conclusion">
   Conclusion
   <a class="anchor" href="#conclusion">#</a>
 </h2>
 <p>This post tries to shed some light on Flink’s metrics and monitoring system. You
 can utilise it as a starting point when you first think about how to
 successfully monitor your Flink application. I highly recommend to start
 monitoring your Flink application early on in the development phase. This way
 you will be able to improve your dashboards and alerts over time and, more
 importantly, observe the performance impact of the changes to your application
 throughout the development phase. By doing so, you can ask the right questions
 about the runtime behaviour of your application, and learn much more about
 Flink’s internals early on.</p>
 <p>Last but not least, this post only scratches the surface of the overall metrics
 and monitoring capabilities of Apache Flink. I highly recommend going over
 <a href="//nightlies.apache.org/flink/flink-docs-release-1.7/monitoring/metrics.html">Flink’s metrics documentation</a>
 for a full reference of Flink’s metrics system.</p>
 </p>
 </article>


     <div class="edit-this-page">
       <p>
         <a href="https://cwiki.apache.org/confluence/display/FLINK/Flink+Translation+Specifications">Want to contribute translation?</a>
       </p>
       <p>
         <a href="//github.com/apache/flink-web/edit/asf-site/docs/content/posts/2019-02-21-monitoring-best-practices.md">
           Edit This Page<i class="fa fa-edit fa-fw"></i>
         </a>
       </p>
     </div>

         </section>

           <aside class="book-toc">


 <nav id="TableOfContents"><h3>On This Page <a href="javascript:void(0)" class="toc" onclick="collapseToc()"><i class="fa fa-times" aria-hidden="true"></i></a></h3>
   <ul>
     <li>
       <ul>
         <li><a href="#flinks-metrics-system">Flink’s Metrics System</a>
           <ul>
             <li><a href="#metrics">Metrics</a></li>
             <li><a href="#metricsreporters">MetricsReporters</a></li>
           </ul>
         </li>
         <li><a href="#monitoring-general-health">Monitoring General Health</a></li>
         <li><a href="#monitoring-progress--throughput">Monitoring Progress &amp; Throughput</a>
           <ul>
             <li><a href="#throughput">Throughput</a></li>
             <li><a href="#progress">Progress</a></li>
             <li><a href="#keeping-up">&ldquo;Keeping Up&rdquo;</a></li>
           </ul>
         </li>
         <li><a href="#monitoring-latency">Monitoring Latency</a></li>
         <li><a href="#jvm-metrics">JVM Metrics</a>
           <ul>
             <li><a href="#memory">Memory</a></li>
             <li><a href="#cpu">CPU</a></li>
           </ul>
         </li>
         <li><a href="#system-resources">System Resources</a></li>
         <li><a href="#conclusion">Conclusion</a></li>
       </ul>
     </li>
   </ul>
 </nav>


           </aside>
           <aside class="expand-toc hidden">
             <a class="toc" onclick="expandToc()" href="javascript:void(0)">
               <i class="fa fa-bars" aria-hidden="true"></i>
             </a>
           </aside>

       </main>

       <footer>


 <div class="separator"></div>
 <div class="panels">
   <div class="wrapper">
       <div class="panel">
         <ul>
           <li>
             <a href="https://flink-packages.org/">flink-packages.org</a>
           </li>
           <li>
             <a href="https://www.apache.org/">Apache Software Foundation</a>
           </li>
           <li>
             <a href="https://www.apache.org/licenses/">License</a>
           </li>


                 <li>
                   <a  href="/zh/">
                     <i class="fa fa-globe" aria-hidden="true"></i>&nbsp;中文版
                   </a>
                 </li>


        </ul>
       </div>
       <div class="panel">
         <ul>
           <li>
             <a href="/what-is-flink/security">Security</a-->
           </li>
           <li>
             <a href="https://www.apache.org/foundation/sponsorship.html">Donate</a>
           </li>
           <li>
             <a href="https://www.apache.org/foundation/thanks.html">Thanks</a>
           </li>
        </ul>
       </div>
       <div class="panel icons">
         <div>
           <a href="/posts">
             <div class="icon flink-blog-icon"></div>
             <span>Flink blog</span>
           </a>
         </div>
         <div>
           <a href="https://github.com/apache/flink">
             <div class="icon flink-github-icon"></div>
             <span>Github</span>
           </a>
         </div>
         <div>
           <a href="https://twitter.com/apacheflink">
             <div class="icon flink-twitter-icon"></div>
             <span>Twitter</span>
           </a>
         </div>
       </div>
   </div>
 </div>

 <hr/>

 <div class="container disclaimer">
   <p>The contents of this website are © 2024 Apache Software Foundation under the terms of the Apache License v2. Apache Flink, Flink, and the Flink logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p>
 </div>


       </footer>

   </body>
 </html>