blob: 5ee7edc7d01cd3d1302bd933e920d29e17380059 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data" />
<meta name="author" content="Cloudera" />
<title>Apache Kudu - Apache Kudu Troubleshooting</title>
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href=""
<!-- Custom styles for this template -->
<link href="/css/kudu.css" rel="stylesheet"/>
<link href="/css/asciidoc.css" rel="stylesheet"/>
<link rel="shortcut icon" href="/img/logo-favicon.ico" />
<link rel="stylesheet" href="" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src=""></script>
<script src=""></script>
<div class="kudu-site container-fluid">
<!-- Static navbar -->
<nav class="navbar navbar-default">
<div class="container-fluid">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<a class="logo" href="/"><img
srcset="// 1x, // 2x"
alt="Apache Kudu"/></a>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav navbar-right">
<li >
<a href="/">Home</a>
<li >
<a href="/overview.html">Overview</a>
<li class="active">
<a href="/docs/">Documentation</a>
<li >
<a href="/releases/">Releases</a>
<li >
<a href="/blog/">Blog</a>
<!-- NOTE: this dropdown menu does not appear on Mobile, so don't add anything here
that doesn't also appear elsewhere on the site. -->
<li class="dropdown">
<a href="/community.html" role="button" aria-haspopup="true" aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li class="dropdown-header">GET IN TOUCH</li>
<li><a class="icon email" href="/community.html">Mailing Lists</a></li>
<li><a class="icon slack" href="">Slack Channel</a></li>
<li role="separator" class="divider"></li>
<li><a href="/community.html#meetups-user-groups-and-conference-presentations">Events and Meetups</a></li>
<li><a href="/committers.html">Project Committers</a></li>
<li><a href="/ecosystem.html">Ecosystem</a></li>
<!--<li><a href="/roadmap.html">Roadmap</a></li>-->
<li><a href="/community.html#contributions">How to Contribute</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">DEVELOPER RESOURCES</li>
<li><a class="icon github" href="">GitHub</a></li>
<li><a class="icon gerrit" href="">Gerrit Code Review</a></li>
<li><a class="icon jira" href="">JIRA Issue Tracker</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">SOCIAL MEDIA</li>
<li><a class="icon twitter" href="">Twitter</a></li>
<li><a href="">Reddit</a></li>
<li role="separator" class="divider"></li>
<li class="dropdown-header">APACHE SOFTWARE FOUNDATION</li>
<li><a href="" target="_blank">Security</a></li>
<li><a href="" target="_blank">Sponsorship</a></li>
<li><a href="" target="_blank">Thanks</a></li>
<li><a href="" target="_blank">License</a></li>
<li >
<a href="/faq.html">FAQ</a>
</ul><!-- /.nav -->
</div><!-- /#navbar -->
</div><!-- /.container-fluid -->
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
<div class="container">
<div class="row">
<div class="col-md-9">
<h1>Apache Kudu Troubleshooting</h1>
<div class="sect1">
<h2 id="_issues_starting_the_master_or_tablet_server"><a class="link" href="#_issues_starting_the_master_or_tablet_server">Issues starting the Master or Tablet Server</a></h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="req_hole_punching"><a class="link" href="#req_hole_punching">Errors During Hole Punching Test</a></h3>
<div class="paragraph">
<p>Kudu requires hole punching capabilities in order to be efficient. Hole punching support
depends upon your operation system kernel version and local filesystem implementation.</p>
<div class="ulist">
<p>RHEL or CentOS 6.4 or later, patched to kernel version of 2.6.32-358 or later.
Unpatched RHEL or CentOS 6.4 does not include a kernel with support for hole punching.</p>
<p>Ubuntu 14.04 includes version 3.13 of the Linux kernel, which supports hole punching.</p>
<p>Newer versions of the EXT4 or XFS file systems support hole punching, but EXT3 does
not. Older versions of XFS that do not support hole punching return a <code>EOPNOTSUPP</code>
(operation not supported) error. Older versions of either EXT4 or XFS that do
not support hole punching cause Kudu to emit an error message such as the following
and to fail to start:</p>
<div class="listingblock">
<div class="content">
<pre>Error during hole punch test. The log block manager requires a
filesystem with hole punching support such as ext4 or xfs. On el6,
kernel version 2.6.32-358 or newer is required. To run without hole
punching (at the cost of some efficiency and scalability), reconfigure
Kudu with --block_manager=file. Refer to the Kudu documentation for more
details. Raw error message follows.</pre>
<div class="paragraph">
<p>Without hole punching support, the log block manager is unsafe to use. It won&#8217;t
ever delete blocks, and will consume ever more space on disk.</p>
<div class="paragraph">
<p>If you can&#8217;t use hole punching in your environment, you can still
try Kudu. Enable the file block manager instead of the log block manager by
adding the <code>--block_manager=file</code> flag to the commands you use to start the master
and tablet servers. The file block manager does not scale as well as the log block
<div class="admonitionblock warning">
<td class="icon">
<i class="fa icon-warning" title="Warning"></i>
<td class="content">
The file block manager is known to scale and perform poorly, and should
only be used for small-scale evaluation and development.
<div class="sect2">
<h3 id="ntp"><a class="link" href="#ntp">NTP clock synchronization</a></h3>
<div class="paragraph">
<p>For the master and tablet server daemons, the server&#8217;s clock must be synchronized using NTP.
In addition, the <strong>maximum clock error</strong> (not to be mistaken with the estimated error)
be below a configurable threshold. The default value is 10 seconds, but it can be set with the flag
<div class="paragraph">
<p>If NTP is not installed, or if the clock is reported as unsynchronized, Kudu will not
start, and will emit a message such as:</p>
<div class="listingblock">
<div class="content">
<pre>F0924 20:24:36.336809 14550 Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error reading clock. Clock considered unsynchronized.</pre>
<div class="paragraph">
<p>If NTP is installed and synchronized, but the maximum clock error is too high,
the user will see a message such as:</p>
<div class="listingblock">
<div class="content">
<pre>Sep 17, 8:13:09.873 PM FATAL Couldn't get the current time: Clock synchronized, but error: 11130000, is past the maximum allowable error: 10000000</pre>
<div class="paragraph">
<div class="listingblock">
<div class="content">
<pre>Sep 17, 8:32:31.135 PM FATAL Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Cannot initialize HybridClock. Clock synchronized but error was too high (11711000 us).</pre>
<div class="admonitionblock tip">
<td class="icon">
<i class="fa icon-tip" title="Tip"></i>
<td class="content">
If NTP is installed the user can monitor the synchronization status by running
<code>ntptime</code>. The relevant value is what is reported for <code>maximum error</code>.
<div class="paragraph">
<p>To install NTP, use the appropriate command for your operating system:</p>
<table class="tableblock frame-all grid-all spread">
<col style="width: 50%;">
<col style="width: 50%;">
<th class="tableblock halign-left valign-top">OS</th>
<th class="tableblock halign-left valign-top">Command</th>
<td class="tableblock halign-left valign-top"><p class="tableblock">Debian/Ubuntu</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock"><code>sudo apt-get install ntp</code></p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">RHEL/CentOS</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock"><code>sudo yum install ntp</code></p></td>
<div class="paragraph">
<p>If NTP is installed but not running, start it using one of these commands:</p>
<table class="tableblock frame-all grid-all spread">
<col style="width: 50%;">
<col style="width: 50%;">
<th class="tableblock halign-left valign-top">OS</th>
<th class="tableblock halign-left valign-top">Command</th>
<td class="tableblock halign-left valign-top"><p class="tableblock">Debian/Ubuntu</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock"><code>sudo service ntp restart</code></p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">RHEL/CentOS</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock"><code>sudo /etc/init.d/ntpd restart</code></p></td>
<div class="admonitionblock tip">
<td class="icon">
<i class="fa icon-tip" title="Tip"></i>
<td class="content">
NTP requires a network connection and may take a few minutes to synchronize the clock.
In some cases a spotty network connection may make NTP report the clock as unsynchronized.
A common, though temporary, workaround for this is to restart NTP with one of the commands above.
<div class="paragraph">
<p>If the clock is being reported as synchronized by NTP, but the maximum error is too high,
the user can increase the threshold to a higher value by setting the above
mentioned flag. For example to increase the possible maximum error to
20 seconds the flag should be set like: <code>--max_clock_sync_error_usec=20000000</code></p>
<div class="sect1">
<h2 id="_troubleshooting_performance_issues"><a class="link" href="#_troubleshooting_performance_issues">Troubleshooting Performance Issues</a></h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="kudu_tracing"><a class="link" href="#kudu_tracing">Kudu Tracing</a></h3>
<div class="paragraph">
<p>The <code>kudu-master</code> and <code>kudu-tserver</code> daemons include built-in tracing support
based on the open source
<a href="">Chromium Tracing</a>
framework. You can use tracing to help diagnose latency issues or other problems
on Kudu servers.</p>
<div class="sect3">
<h4 id="_accessing_the_tracing_interface"><a class="link" href="#_accessing_the_tracing_interface">Accessing the tracing interface</a></h4>
<div class="paragraph">
<p>The tracing interface is accessed via a web browser as part of the
embedded web server in each of the Kudu daemons.</p>
<table class="tableblock frame-all grid-all spread">
<caption class="title">Table 1. Tracing Interface URLs</caption>
<col style="width: 50%;">
<col style="width: 50%;">
<th class="tableblock halign-left valign-top">Daemon</th>
<th class="tableblock halign-left valign-top">URL</th>
<td class="tableblock halign-left valign-top"><p class="tableblock">Tablet Server</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="" class="bare"></a></p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Master</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock"><a href="" class="bare"></a></p></td>
<div class="admonitionblock warning">
<td class="icon">
<i class="fa icon-warning" title="Warning"></i>
<td class="content">
The tracing interface is known to work in recent versions of Google Chrome.
Other browsers may not work as expected.
<div class="sect3">
<h4 id="_collecting_a_trace"><a class="link" href="#_collecting_a_trace">Collecting a trace</a></h4>
<div class="paragraph">
<p>After navigating to the tracing interface, click the <strong>Record</strong> button on the top left corner
of the screen. When beginning to diagnose a problem, start by selecting all categories.
Click <strong>Record</strong> to begin recording a trace.</p>
<div class="paragraph">
<p>During the trace collection, events are collected into an in-memory ring buffer.
This ring buffer is fixed in size, so it will eventually fill up to 100%. However, new events
are still being collected while older events are being removed. While recording the trace,
trigger the behavior or workload you are interested in exploring.</p>
<div class="paragraph">
<p>After collecting for several seconds, click <strong>Stop</strong>. The collected trace will be
downloaded and displayed. Use the <strong>?</strong> key to display help text about using the tracing
interface to explore the trace.</p>
<div class="sect3">
<h4 id="_saving_a_trace"><a class="link" href="#_saving_a_trace">Saving a trace</a></h4>
<div class="paragraph">
<p>You can save collected traces as JSON files for later analysis by clicking <strong>Save</strong>
after collecting the trace. To load and analyze a saved JSON file, click <strong>Load</strong>
and choose the file.</p>
<div class="sect2">
<h3 id="_rpc_timeout_traces"><a class="link" href="#_rpc_timeout_traces">RPC Timeout Traces</a></h3>
<div class="paragraph">
<p>If client applications are experiencing RPC timeouts, the Kudu tablet server
<code>WARNING</code> level logs should contain a log entry which includes an RPC-level trace. For example:</p>
<div class="listingblock">
<div class="content">
<pre>W0922 00:56:52.313848 10858] Call kudu.consensus.ConsensusService.UpdateConsensus
from (request call id 3555909) took 1464ms (client timeout 1000).
W0922 00:56:52.314888 10858] Trace:
0922 00:56:50.849505 (+ 0us)] Inserting onto call queue
0922 00:56:50.849527 (+ 22us)] Handling call
0922 00:56:50.849574 (+ 47us)] Updating replica for 2 ops
0922 00:56:50.849628 (+ 54us)] Early marking committed up to term: 8 index: 880241
0922 00:56:50.849968 (+ 340us)] Triggering prepare for 2 ops
0922 00:56:50.850119 (+ 151us)] Serialized 1555 byte log entry
0922 00:56:50.850213 (+ 94us)] Marking committed up to term: 8 index: 880241
0922 00:56:50.850218 (+ 5us)] Updating last received op as term: 8 index: 880243
0922 00:56:50.850219 (+ 1us)] Filling consensus response to leader.
0922 00:56:50.850221 (+ 2us)] Waiting on the replicates to finish logging
0922 00:56:52.313763 (+1463542us)] finished
0922 00:56:52.313764 (+ 1us)] UpdateReplicas() finished
0922 00:56:52.313788 (+ 24us)] Queueing success response</pre>
<div class="paragraph">
<p>These traces can give an indication of which part of the request was slow. Please
include them in bug reports related to RPC latency outliers.</p>
<div class="sect2">
<h3 id="_kernel_stack_watchdog_traces"><a class="link" href="#_kernel_stack_watchdog_traces">Kernel Stack Watchdog Traces</a></h3>
<div class="paragraph">
<p>Each Kudu server process has a background thread called the Stack Watchdog, which
monitors the other threads in the server in case they have blocked for
longer-than-expected periods of time. These traces can indicate operating system issues
or bottlenecked storage.</p>
<div class="paragraph">
<p>When the watchdog thread identifies a case of thread blockage, it logs an entry
in the <code>WARNING</code> log like the following:</p>
<div class="listingblock">
<div class="content">
<pre>W0921 23:51:54.306350 10912] Thread 10937 stuck at /data/kudu/consensus/ for 537ms:
Kernel stack:
[&lt;ffffffffa00b209d&gt;] do_get_write_access+0x29d/0x520 [jbd2]
[&lt;ffffffffa00b2471&gt;] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
[&lt;ffffffffa00fe6d8&gt;] __ext4_journal_get_write_access+0x38/0x80 [ext4]
[&lt;ffffffffa00d9b23&gt;] ext4_reserve_inode_write+0x73/0xa0 [ext4]
[&lt;ffffffffa00d9b9c&gt;] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
[&lt;ffffffffa00d9e90&gt;] ext4_dirty_inode+0x40/0x60 [ext4]
[&lt;ffffffff811ac48b&gt;] __mark_inode_dirty+0x3b/0x160
[&lt;ffffffff8119c742&gt;] file_update_time+0xf2/0x170
[&lt;ffffffff8111c1e0&gt;] __generic_file_aio_write+0x230/0x490
[&lt;ffffffff8111c4c8&gt;] generic_file_aio_write+0x88/0x100
[&lt;ffffffffa00d3fb1&gt;] ext4_file_write+0x61/0x1e0 [ext4]
[&lt;ffffffff81180f5b&gt;] do_sync_readv_writev+0xfb/0x140
[&lt;ffffffff81181ee6&gt;] do_readv_writev+0xd6/0x1f0
[&lt;ffffffff81182046&gt;] vfs_writev+0x46/0x60
[&lt;ffffffff81182102&gt;] sys_pwritev+0xa2/0xc0
[&lt;ffffffff8100b072&gt;] system_call_fastpath+0x16/0x1b
[&lt;ffffffffffffffff&gt;] 0xffffffffffffffff
User stack:
@ 0x3a1ace10c4 (unknown)
@ 0x1262103 (unknown)
@ 0x12622d4 (unknown)
@ 0x12603df (unknown)
@ 0x8e7bfb (unknown)
@ 0x8f478b (unknown)
@ 0x8f55db (unknown)
@ 0x12a7b6f (unknown)
@ 0x3a1b007851 (unknown)
@ 0x3a1ace894d (unknown)
@ (nil) (unknown)</pre>
<div class="paragraph">
<p>These traces can be useful for diagnosing root-cause latency issues when they are caused by systems
below Kudu, such as disk controllers or file systems.</p>
<div class="col-md-3">
<div id="toc" data-spy="affix" data-offset-top="70">
<a href="index.html">Introducing Kudu</a>
<a href="release_notes.html">Kudu Release Notes</a>
<a href="quickstart.html">Getting Started with Kudu</a>
<a href="installation.html">Installation Guide</a>
<a href="configuration.html">Configuring Kudu</a>
<a href="kudu_impala_integration.html">Using Impala with Kudu</a>
<a href="administration.html">Administering Kudu</a>
<span class="active-toc">Troubleshooting Kudu</span>
<ul class="sectlevel1">
<li><a href="#_issues_starting_the_master_or_tablet_server">Issues starting the Master or Tablet Server</a>
<ul class="sectlevel2">
<li><a href="#req_hole_punching">Errors During Hole Punching Test</a></li>
<li><a href="#ntp">NTP clock synchronization</a></li>
<li><a href="#_troubleshooting_performance_issues">Troubleshooting Performance Issues</a>
<ul class="sectlevel2">
<li><a href="#kudu_tracing">Kudu Tracing</a></li>
<li><a href="#_rpc_timeout_traces">RPC Timeout Traces</a></li>
<li><a href="#_kernel_stack_watchdog_traces">Kernel Stack Watchdog Traces</a></li>
<a href="developing.html">Developing Applications with Kudu</a>
<a href="schema_design.html">Kudu Schema Design</a>
<a href="transaction_semantics.html">Kudu Transaction Semantics</a>
<a href="contributing.html">Contributing to Kudu</a>
<a href="style_guide.html">Kudu Documentation Style Guide</a>
<a href="configuration_reference.html">Kudu Configuration Reference</a>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p class="small">
Copyright &copy; 2020 The Apache Software Foundation. Last updated 2016-10-04 13:29:12 PDT
<p class="small">
Apache Kudu, Kudu, Apache, the Apache feather logo, and the Apache Kudu
project logo are either registered trademarks or trademarks of The
Apache Software Foundation in the United States and other countries.
<div class="col-md-3">
<a class="pull-right" href="">
<img src=""/>
<script src=""></script>
// Try to detect touch-screen devices. Note: Many laptops have touch screens.
$(document).ready(function() {
if ("ontouchstart" in document.documentElement) {
} else {
<script src=""
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
ga('create', 'UA-68448017-1', 'auto');
ga('send', 'pageview');
<script src=""></script>
anchors.options = {
placement: 'right',
visible: 'touch',