| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| [[troubleshooting]] |
| = Apache Kudu Troubleshooting |
| |
| :author: Kudu Team |
| :imagesdir: ./images |
| :icons: font |
| :toc: left |
| :toclevels: 2 |
| :doctype: book |
| :backend: html5 |
| :sectlinks: |
| :experimental: |
| |
| include::top.adoc[tags=version] |
| |
| == Startup Errors |
| |
| [[req_hole_punching]] |
| === Errors During Hole Punching Test |
| |
| Kudu requires hole punching capabilities in order to be efficient. Hole punching support |
| depends upon your operation system kernel version and local filesystem implementation. |
| |
| - RHEL or CentOS 6.4 or later, patched to kernel version of 2.6.32-358 or later. |
| Unpatched RHEL or CentOS 6.4 does not include a kernel with support for hole punching. |
| - Ubuntu 14.04 includes version 3.13 of the Linux kernel, which supports hole punching. |
| - Newer versions of the ext4 and xfs filesystems support hole punching. Older versions |
| that do not support hole punching will cause Kudu to emit an error message such as the |
| following and to fail to start: |
| + |
| ---- |
| Error during hole punch test. The log block manager requires a |
| filesystem with hole punching support such as ext4 or xfs. On el6, |
| kernel version 2.6.32-358 or newer is required. To run without hole |
| punching (at the cost of some efficiency and scalability), reconfigure |
| Kudu to use the file block manager. Refer to the Kudu documentation for |
| more details. WARNING: the file block manager is not suitable for |
| production use and should be used only for small-scale evaluation and |
| development on systems where hole-punching is not available. It's |
| impossible to switch between block managers after data is written to the |
| server. Raw error message follows |
| ---- |
| |
| [NOTE] |
| ext4 mountpoints may actually be backed by ext2 or ext3 formatted devices, which do not |
| support hole punching. The hole punching test will fail when run on such filesystems. There |
| are several different ways to determine whether an ext4 mountpoint is backed by an ext2, |
| ext3, or ext4 formatted device; see link:https://unix.stackexchange.com/q/60723[this Stack |
| Exchange post] for details. |
| |
| Without hole punching support, the log block manager is unsafe to use. It won't |
| ever delete blocks, and will consume ever more space on disk. |
| |
| You can run the following sequence of commands on a mounted file system to |
| check whether it supports hole punching (the originally allocated 10MiB |
| turns into 5MiB after punching a 5MiB hole in the file at 1MiB offset): |
| |
| ---- |
| $ dd if=/dev/zero of=hole_punch bs=1M count=10 2>/dev/null |
| $ du -h hole_punch |
| 10M hole_punch |
| $ fallocate -p -o 1M -l 5M hole_punch |
| $ du -h hole_punch |
| 5.0M hole_punch |
| ---- |
| |
| If you can't use hole punching in your environment, you can still |
| try Kudu. Enable the file block manager instead of the log block manager by |
| adding the `--block_manager=file` flag to the commands you use to start the master |
| and tablet servers. The file block manager does not scale as well as the log block |
| manager. |
| |
| [WARNING] |
| ==== |
| The file block manager is known to scale and perform poorly, and should |
| only be used for small-scale evaluation and development, and only on systems |
| where hole punching is unavailable. |
| |
| The file block manager uses one file per block. As multiple blocks are written |
| for each rowset, the number of blocks can be very high, especially for actively |
| written tablets. This can cause performance issues compared to the log block |
| manager even with a small amount of data and it's *impossible to switch between |
| block managers* without wiping and reinitializing the tablet servers. |
| ==== |
| |
| [[disk_issues]] |
| === Already present: FS layout already exists |
| |
| When Kudu starts, it checks each configured data directory, expecting either for all to be |
| initialized or for all to be empty. If a server fails to start with a log message like |
| |
| ---- |
| Check failed: _s.ok() Bad status: Already present: FS layout already exists; not overwriting existing layout: FSManager roots already exist: /data0/kudu/data |
| ---- |
| |
| then this precondition has failed. This could be because Kudu was configured with non-empty data |
| directories on first startup, or because a previously-running, healthy Kudu process was restarted |
| and at least one data directory was deleted or is somehow corrupted, perhaps because of a disk |
| error. If in the latter situation, consult the |
| link:administration.html#change_dir_config[Changing Directory Configurations] documentation. |
| |
| [[ntp]] |
| === NTP Clock Synchronization |
| The local clock of the machine where Kudu master or tablet server is running |
| must be synchronized using the Network Time Protocol (NTP) if using the `system` |
| time source. The time source is controlled by the `--time_source` flag and |
| by default is set to `system`. |
| |
| Kudu requires the *maximum clock error* (not to be mistaken with the estimated |
| error) of the NTP-synchronized clock be below a configurable threshold. |
| The default threshold value is 10 seconds and it can be customized using the |
| `--max_clock_sync_error_usec` flag. |
| |
| When running with the `system` time source, Kudu will not start and will emit |
| a message such as below if the local clock is reported unsynchronized: |
| |
| ---- |
| F0924 20:24:36.336809 14550 hybrid_clock.cc:191 Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error reading clock. Clock considered unsynchronized. |
| ---- |
| |
| If the machine's clock is synchronized, but the maximum clock error is too high, |
| the user will see a message such as: |
| |
| ---- |
| Sep 17, 8:13:09.873 PM FATAL hybrid_clock.cc:196 Couldn't get the current time: Clock synchronized, but error: 11130000, is past the maximum allowable error: 10000000 |
| ---- |
| |
| or |
| |
| ---- |
| Sep 17, 8:32:31.135 PM FATAL tablet_server_main.cc:38 Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Cannot initialize HybridClock. Clock synchronized but error was too high (11711000 us). |
| ---- |
| |
| In this and following NTP-related paragraphs, when talking about the |
| 'synchronization' with true time using NTP, we are referring to a couple of |
| things: |
| - the synchronization status of the NTP server which drives the local clock |
| of the machine |
| - the synchronization status of the local machine's clock itself as reported |
| by the kernel's NTP discipline |
| |
| The former can be retrieved using the `ntpstat`, `ntpq`, and `ntpdc` utilities |
| if using `ntpd` (they are included in the `ntp` package) or the `chronyc` |
| utility if using `chronyd` (that's a part of the `chrony` package). The latter |
| can be retrieved using either the `ntptime` utility (the `ntptime` utility is |
| also a part of the `ntp` package) or the `chronyc` utility if using `chronyd`. |
| For more information, see the manual pages of the mentioned utilities and the |
| paragraphs below. |
| |
| ==== NTP-related Packages |
| For a long time, `ntpd` has been the recommended NTP server to use on Kudu |
| nodes to synchronize local machines' clocks. Newer releases of Linux OS offer |
| `chronyd` as an alternative to `ntpd` for network time synchronization. Both |
| have been tested and proven to provide necessary functionality for clock |
| synchronisation in a Kudu cluster. |
| |
| ===== Installing And Running `ntpd` |
| `ntpd` is the NTP server from the ubiquitous `ntp` suite. |
| |
| To install `ntpd` and other NTP-related utilities, use the appropriate command |
| for your operating system: |
| [cols="1,1", options="header"] |
| |=== |
| | OS | Command |
| | Debian/Ubuntu | `sudo apt-get install ntp` |
| | RHEL/CentOS | `sudo yum install ntp` |
| |=== |
| |
| If `ntpd` is installed but not running, start it using one of these commands |
| (don't forget to run `ntpdate` first): |
| [cols="1,1", options="header"] |
| |=== |
| | OS | Command |
| | Debian/Ubuntu | `sudo service ntp restart` |
| | RHEL/CentOS | `sudo service ntpd restart` |
| |=== |
| |
| Make sure `ntpdate` is in the list of services running when the machine starts: |
| `ntpdate` should be run prior starting `ntpd` to avoid long synchronization |
| delay of the machine's local clock with the true time. The smaller the offset |
| between local machine's clock and the true time, the faster the NTP server can |
| synchronize the clock. |
| |
| When running `ntpdate`, make sure the tool reports success: check its exit |
| status and output. In case of issues connecting to the NTP servers, make sure |
| NTP traffic is not being blocked by a firewall (NTP generates UDP traffic on |
| port 123 by default) or other network connectivity issue. |
| |
| Below are a few examples of configuration files for `ntpd`. By default, `ntpd` |
| uses `/etc/ntp.conf` configuration file. |
| |
| ---- |
| # Use my organization's internal NTP server (server in a local network). |
| server ntp1.myorg.internal iburst maxpoll 7 |
| # Add servers from the NTP public pool for redundancy and robustness. |
| server 0.pool.ntp.org iburst maxpoll 8 |
| server 1.pool.ntp.org iburst maxpoll 8 |
| server 2.pool.ntp.org iburst maxpoll 8 |
| server 3.pool.ntp.org iburst maxpoll 8 |
| ---- |
| |
| ---- |
| # AWS case: use dedicated NTP server available via link-local IP address. |
| server 169.254.169.123 iburst |
| ---- |
| |
| ---- |
| # GCE case: use dedicated NTP server available from within cloud instance. |
| server metadata.google.internal iburst |
| ---- |
| |
| Sometimes it takes too long to synchronize the machine's local clock with the |
| true time even if the `ntpstat` utility reports that the NTP daemon is |
| synchronized with one of the reference NTP servers. This manifests as the |
| following: the utilities which report on the synchronization status of the NTP |
| daemon claim that all is well, but `ntptime` claims that the status of the |
| local clock is unsynchronized and Kudu tablet servers and masters refuse to |
| start, outputting an error like the one mentioned above. This situation often |
| happens if the `ntpd` is run with the `-x` option. According to the manual |
| page of `ntpd`, the `-x` flag configures the NTP server to only slew the clock. |
| Without `-x`, the NTP server would do a step adjustment instead: |
| |
| ---- |
| -x Normally, the time is slewed if the offset is less than the |
| step threshold, which is 128 ms by default, and stepped if |
| above the threshold. This option sets the threshold to 600 s, |
| which is well within the accuracy window to set the clock manually. |
| Note: Since the slew rate of typical Unix kernels is limited to |
| 0.5 ms/s, each second of adjustment requires an amortization |
| interval of 2000 s. Thus, an adjustment as much as 600 s |
| will take almost 14 days to complete. |
| ---- |
| |
| In such cases, removing the `-x` option will help synchronize the local clock |
| faster. |
| |
| More information on best practices and examples of practical resolution of |
| various NTP synchronization issues can be found found at |
| link:https://www.redhat.com/en/blog/avoiding-clock-drift-vms[clock-drift] |
| |
| ===== Monitoring Clock Synchronization Status With The `ntp` Suite |
| When the `ntp` package is installed, you can monitor the synchronization status |
| of the machine's clock by running `ntptime`. For example, a system |
| with a local clock that is synchronized may report: |
| |
| ---- |
| ntp_gettime() returns code 0 (OK) |
| time de24c0cf.8d5da274 Tue, Feb 6 2018 16:03:27.552, (.552210980), |
| maximum error 224455 us, estimated error 383 us, TAI offset 0 |
| ntp_adjtime() returns code 0 (OK) |
| modes 0x0 (), |
| offset 1279.543 us, frequency 2.500 ppm, interval 1 s, |
| maximum error 224455 us, estimated error 383 us, |
| status 0x2001 (PLL,NANO), |
| time constant 10, precision 0.001 us, tolerance 500 ppm, |
| ---- |
| |
| Note the following most important pieces of output: |
| |
| - `maximum error 22455 us`: this value is well under the 10-second maximum |
| error required by Kudu. |
| - `status 0x2001 (PLL,NANO)`: this indicates the local clock is synchronized |
| with the true time up to the maximum error above |
| |
| In contrast, a system with unsynchronized local clock would report something |
| like the following: |
| |
| ---- |
| ntp_gettime() returns code 5 (ERROR) |
| time de24c240.0c006000 Tue, Feb 6 2018 16:09:36.046, (.046881), |
| maximum error 16000000 us, estimated error 16000000 us, TAI offset 0 |
| ntp_adjtime() returns code 5 (ERROR) |
| modes 0x0 (), |
| offset 0.000 us, frequency 2.500 ppm, interval 1 s, |
| maximum error 16000000 us, estimated error 16000000 us, |
| status 0x40 (UNSYNC), |
| time constant 10, precision 1.000 us, tolerance 500 ppm, |
| ---- |
| |
| The `UNSYNC` status means the local clock is not synchronized with the |
| true time. Because of that, the maximum reported error doesn't convey any |
| meaningful estimation of the actual error. |
| |
| The `ntpstat` utility reports a summary on the synchronization status of |
| the NTP daemon itself. For example, a system which have `ntpd` running and |
| synchronized with one of its reference servers may report: |
| |
| ---- |
| $ ntpstat |
| synchronised to NTP server (172.18.7.3) at stratum 4 |
| time correct to within 160 ms |
| polling server every 1024 s |
| ---- |
| |
| Keep in mind that the synchronization status of the NTP daemon itself doesn't |
| reflect the synchronization status of the local clock. The way NTP daemon |
| drives the local clock is subject to many constraints, and it may take the NTP |
| daemon some time to synchronize the local clock after it itself has latched |
| to one of the reference servers. |
| |
| If more detailed information is needed on the synchronization status of the |
| NTP server (but not the synchronization status of the local clock), the `ntpq` |
| or `ntpdc` tools can be used to get detailed information about what NTP server |
| is currently acting as the source of the true time and which are considered |
| as candidates (either viable or not): |
| |
| ---- |
| $ ntpq -nc lpeers |
| remote refid st t when poll reach delay offset jitter |
| ============================================================================== |
| -108.59.2.24 130.133.1.10 2 u 13 64 1 71.743 0.373 0.016 |
| +192.96.202.120 129.6.15.28 2 u 12 64 1 72.583 -0.426 0.028 |
| -69.10.161.7 204.26.59.157 3 u 11 64 1 15.741 2.641 0.021 |
| -173.255.206.154 45.56.123.24 3 u 10 64 1 43.502 0.199 0.029 |
| -69.195.159.158 128.138.140.44 2 u 9 64 1 53.885 -0.016 0.013 |
| *216.218.254.202 .CDMA. 1 u 6 64 1 1.475 -0.400 0.012 |
| +129.250.35.250 249.224.99.213 2 u 7 64 1 1.342 -0.640 0.018 |
| |
| $ ntpq -nc opeers |
| remote local st t when poll reach delay offset disp |
| ============================================================================== |
| -108.59.2.24 10.17.100.238 2 u 17 64 1 71.743 0.373 187.573 |
| +192.96.202.120 10.17.100.238 2 u 16 64 1 72.583 -0.426 187.594 |
| -69.10.161.7 10.17.100.238 3 u 15 64 1 15.741 2.641 187.569 |
| -173.255.206.154 10.17.100.238 3 u 14 64 1 43.502 0.199 187.580 |
| -69.195.159.158 10.17.100.238 2 u 13 64 1 53.885 -0.016 187.561 |
| *216.218.254.202 10.17.100.238 1 u 10 64 1 1.475 -0.400 187.543 |
| +129.250.35.250 10.17.100.238 2 u 11 64 1 1.342 -0.640 187.588 |
| ---- |
| |
| TIP: Both `lpeers` and `opeers` may be helpful as `lpeers` lists refid and |
| jitter, while `opeers` lists clock dispersion. |
| |
| [[chronyd]] |
| ===== Installing And Running `chronyd` |
| Kudu has been tested and is supported on machines whose local clock is |
| synchronized with NTP using `chronyd` version 3.2 and newer. |
| |
| The OS package is called `chrony` and contains both the NTP server `chronyd` |
| and the `chronyc` command line utility. To install the `chronyd` NTP server |
| and other utilities, use the appropriate command for your operating system: |
| |
| [cols="1,1", options="header"] |
| |=== |
| | OS | Command |
| | Debian/Ubuntu | `sudo apt-get install chrony` |
| | RHEL/CentOS | `sudo yum install chrony` |
| |=== |
| |
| If `chronyd` is installed but not yet running, start it using one of these |
| commands (don't forget to run `chronyd -q` first): |
| [cols="1,1", options="header"] |
| |=== |
| | OS | Command |
| | Debian/Ubuntu | `sudo service chrony restart` |
| | RHEL/CentOS | `sudo service chronyd restart` |
| |=== |
| |
| By default, `chronyd` uses `/etc/chrony.conf` configuration file. The `rtcsync` |
| option must be enabled in `chrony.conf`. Without `rtcsync`, the local machine's |
| clock will always be reported as unsynchronized and Kudu masters and tablet |
| servers will not be able to start. The following |
| link:https://github.com/mlichvar/chrony/blob/994409a03697b8df68115342dc8d1e7ceeeb40bd/sys_timex.c#L162-L166[code] |
| explains the observed behavior of `chronyd` when setting the synchronization |
| status of the local clock on Linux. |
| |
| As verified at RHEL7.5/CentOS7.5 with `chronyd` 3.2 and newer, the default |
| configuration file is good enough to satisfy Kudu requirements for the system |
| clock if running on a machine that has Internet access. |
| |
| An link:https://chrony.tuxfamily.org/faq.html#_what_is_the_minimum_recommended_configuration_for_an_ntp_client[example of a minimum viable configuration] for `chronyd` is: |
| |
| ---- |
| pool pool.ntp.org iburst |
| driftfile /var/lib/chrony/drift |
| makestep 1 3 |
| rtcsync |
| ---- |
| |
| ===== Monitoring Clock Synchronization Status With The `chrony` Suite |
| When the `chrony` package is installed, you can monitor the synchronization |
| status of the machine's clock by running `chronyc tracking` (add `-n` option |
| if no resolution of IP addresses back to FQDNs is desired: |
| `chronyc -n tracking`). |
| |
| For example, a system where `chronyd` hasn't synchronized the local clock yet |
| may report something like the following: |
| |
| ---- |
| Reference ID : 00000000 () |
| Stratum : 0 |
| Ref time (UTC) : Thu Jan 01 00:00:00 1970 |
| System time : 0.000000000 seconds fast of NTP time |
| Last offset : +0.000000000 seconds |
| RMS offset : 0.000000000 seconds |
| Frequency : 69.422 ppm slow |
| Residual freq : +0.000 ppm |
| Skew : 0.000 ppm |
| Root delay : 1.000000000 seconds |
| Root dispersion : 1.000000000 seconds |
| Update interval : 0.0 seconds |
| Leap status : Not synchronised |
| ---- |
| |
| A system with its local clock already synchronized may report: |
| |
| ---- |
| Reference ID : A9FEA9FE (169.254.169.254) |
| Stratum : 3 |
| Ref time (UTC) : Tue Mar 03 06:33:23 2020 |
| System time : 0.000011798 seconds fast of NTP time |
| Last offset : +0.000014285 seconds |
| RMS offset : 0.001493311 seconds |
| Frequency : 69.417 ppm slow |
| Residual freq : +0.000 ppm |
| Skew : 0.006 ppm |
| Root delay : 0.000786347 seconds |
| Root dispersion : 0.000138749 seconds |
| Update interval : 1036.7 seconds |
| Leap status : Normal |
| ---- |
| |
| Note the following important pieces of output: |
| |
| - `Root delay`: the total of the network path delays (round trips) |
| to the Stratum 1 server with which this `chronyd` instance is synchronized. |
| - `Root dispersion`: the total dispersion accumulated through all the paths up |
| to the Stratum 1 server with which this `chronyd` instance is synchronized. |
| - `Leap status`: whether the local clock is synchronized with the true time |
| up to the maximum error (see below). The `Normal` status means the clock is |
| synchronized, and `Not synchronised` naturally means otherwise. |
| |
| An absolute bound on the error of the clock maintained internally by `chronyd` |
| at the time of the last NTP update can be expressed as: |
| |
| ---- |
| clock_error <= abs(last_offset) + (root_delay / 2) + root_dispersion |
| ---- |
| |
| `chronyc sources` reports on the list of reference NTP servers: |
| |
| ---- |
| 210 Number of sources = 4 |
| MS Name/IP address Stratum Poll Reach LastRx Last sample |
| =============================================================================== |
| ^* 169.254.169.254 2 10 377 371 +240us[ +254us] +/- 501us |
| ^- 64.62.190.177 3 11 377 102 +1033us[+1033us] +/- 81ms |
| ^- 64.246.132.14 1 11 377 129 +323us[ +323us] +/- 16ms |
| ^- 184.105.182.16 2 10 377 130 -4719us[-4719us] +/- 55ms |
| ---- |
| |
| To get more details on the measurement stats for reference NTP servers use |
| `chronyc sourcestats`: |
| |
| ---- |
| 210 Number of sources = 4 |
| Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev |
| ============================================================================== |
| 169.254.169.254 46 27 323m +0.000 0.006 +72ns 68us |
| 64.62.190.177 12 10 224m +0.071 0.050 +1240us 154us |
| 64.246.132.14 21 13 326m +0.012 0.030 +434us 230us |
| 184.105.182.16 6 3 86m +0.252 0.559 -5097us 306us |
| ---- |
| |
| Use `chronyc ntpdata [server]` to get information on a particular reference |
| server (or all servers if the `server` parameter is omitted): |
| |
| ---- |
| Remote address : 169.254.169.254 (A9FEA9FE) |
| Remote port : 123 |
| Local address : 172.31.113.1 (AC1F7101) |
| Leap status : Normal |
| Version : 4 |
| Mode : Server |
| Stratum : 2 |
| Poll interval : 10 (1024 seconds) |
| Precision : -20 (0.000000954 seconds) |
| Root delay : 0.000229 seconds |
| Root dispersion : 0.000107 seconds |
| Reference ID : 474F4F47 () |
| Reference time : Tue Mar 03 06:33:24 2020 |
| Offset : -0.000253832 seconds |
| Peer delay : 0.000557465 seconds |
| Peer dispersion : 0.000000987 seconds |
| Response time : 0.000000001 seconds |
| Jitter asymmetry: +0.50 |
| NTP tests : 111 111 1111 |
| Interleaved : No |
| Authenticated : No |
| TX timestamping : Daemon |
| RX timestamping : Kernel |
| Total TX : 50 |
| Total RX : 50 |
| Total valid RX : 50 |
| ---- |
| |
| For troubleshooting tips on clock synchronisation with chronyd see |
| link:https://chrony.tuxfamily.org/faq.html#_computer_is_not_synchronising[this |
| useful guide]. |
| |
| ===== Notes on running `chronyd` in local reference mode |
| When running Kudu cluster in an isolated network, it's possible to synchronize |
| the clocks on all nodes with one another, but not necessarily to true time. |
| The local clock of one dedicated server can be kept vaguely in line with true |
| time by manual input, and other nodes will synchronize with that one. The |
| `chronyd` at the dedicated server runs in |
| link:https://chrony.tuxfamily.org/doc/3.4/chrony.conf.html#local[local reference mode], |
| and `chronyd` at other nodes synchronize with it as with a regular true time |
| reference. |
| |
| If using such a setup, it's not recommended to run any Kudu servers at the |
| node where `chronyd` is running with local reference mode (at least for |
| `chronyd` of version 3.4 and below). Otherwise, the following might happen: |
| |
| - `chronyc tracking` reports that all is well, `Leap Status` is `Normal` |
| - `ntptime` reports that the local clock is out of sync |
| - `ntp_gettime() returns code 5 (ERROR)` |
| - `ntp_adjtime() returns code 5 (ERROR)` with `status 0x40 (UNSYNC)` |
| |
| With that, `kudu-master` and `kudu-tserver` fail to start, reporting an |
| errors like below: |
| |
| ---- |
| E0321 09:47:04.076095 2558 master_main.cc:40] Service unavailable: RunMasterServer() failed: Cannot initialize clock: timed out waiting for clock synchronisation: Error reading clock. Clock considered unsynchronized |
| ---- |
| |
| ---- |
| E0321 09:50:03.023491 2853 tablet_server_main.cc:40] Service unavailable: RunTabletServer() failed: Cannot initialize clock: timed out waiting for clock synchronisation: Error reading clock. Clock considered unsynchronized |
| ---- |
| |
| ==== NTP Configuration Best Practices |
| In order to provide stable time synchronization with low maximum error, follow |
| these best NTP configuration best practices. |
| |
| *Run `ntpdate` (or its alternatives `ntpd -q` or `chronyd -q` in case of chrony) |
| prior to running the NTP server.* If the offset of the local clock is too far |
| from the true time, it can take a long time before the NTP server synchronizes |
| the local clock, even if it's allowed to perform step adjustments. So, after |
| configuring `ntpd` or `chronyd`, first run the `ntpdate` tool with the same set |
| of NTP servers or run `ntpd -q/chronyd -q`. It's assumed that the NTP server |
| is not running when `ntpdate/ntpd -q/chronyd -q` is run. On RHEL/CentOS, if |
| using the `ntp` suite, enable the `ntpdate` service; if using the `chrony` |
| suite, enable the `chrony-wait` service. |
| |
| *In certain public cloud environments, use the highly-available NTP server |
| accessible via link-local IP address or other dedicated NTP server provided |
| as a service.* If your cluster is running in a public cloud environment, |
| consult the cloud provider's documentation for the recommended NTP setup. |
| Both AWS and GCE clouds offer dedicated highly available NTP servers accessible |
| from within a cloud instance via link-local IP address. |
| |
| *Unless using highly-available NTP reference server accessible via link-local |
| address, always configure at least four time sources for NTP server at the |
| local machine.* In addition to providing redundancy in case one of time sources |
| becomes unavailable, this might make the configuration more robust since the |
| NTP is designed to increase its accuracy with a diversity of sources in networks |
| with higher round-trip times and jitter. |
| |
| *Use the `iburst` option for faster synchronization at startup*. The `iburst` |
| option instructs the NTP server (both `ntpd` and `chronyd`) to send an initial |
| "burst" of time queries at startup. This results in a faster synchronization |
| of the `ntpd/chronyd` with their reference servers upon startup. |
| |
| *If the maximum clock error goes beyond the default threshold set by Kudu |
| (10 seconds), consider setting lower value for the `maxpoll` option for every |
| NTP server in `ntp.conf/chrony.conf`*. For example, consider setting the |
| `maxpoll` to 7 which will cause the NTP daemon to make requests to the |
| corresponding NTP server at least every 128 seconds. The default maximum poll |
| interval is 10 (1024 seconds) for both `ntpd` and `chronyd`. |
| |
| [NOTE] |
| ==== |
| If using custom `maxpoll` interval, don't set `maxpoll` too low (e.g., lower |
| than 6) to avoid flooding NTP servers, especially the public ones. Otherwise |
| they may blacklist the client (i.e. the NTP daemon at your machine) and cease |
| providing NTP service at all. If in doubt, consult the `ntp.conf` or |
| `chrony.conf` manual page correspondingly. |
| ==== |
| |
| |
| ==== Troubleshooting NTP Stability Problems |
| |
| As of Kudu 1.6.0, both `kudu-master` and `kudu-tserver` are able to continue to |
| operate during a brief loss of clock synchronization. If clock synchronization |
| is lost for several hours, they may crash. If `kudu-master` or `kudu-tserver` |
| process crashes due to clock synchronization issues, consult the `ERROR` log |
| for a dump of related information which may help to diagnose the issue. |
| |
| TIP: Kudu 1.5.0 and earlier versions were less resilient to brief NTP outages. In |
| addition, they contained a link:https://issues.apache.org/jira/browse/KUDU-2209[bug] |
| which could cause Kudu to incorrectly measure the maximum error, resulting in |
| crashes. If you experience crashes related to clock synchronization on these |
| earlier versions of Kudu and it appears that the system's NTP configuration |
| is correct, consider upgrading to Kudu 1.6.0 or later. |
| |
| TIP: If using other than link-local NTP servers, it may take some time for the |
| NTP server running on a local machine to synchronize with one of its reference |
| servers in case of network connectivity issues. In case of a spotty network |
| between the machine and the reference NTP servers, `ntpd/chronyd` may become |
| unsynchronized with its reference NTP servers. If that happens, consider finding |
| other set of reference NTP servers: the best bet is to use NTP servers in the |
| local network or *.pool.ntp.org servers. |
| |
| |
| [[disk_space_usage]] |
| == Disk Space Usage |
| |
| When using the log block manager (the default on Linux), Kudu uses |
| link:https://en.wikipedia.org/wiki/Sparse_file[sparse files] to store data. A |
| sparse file has a different apparent size than the actual amount of disk space |
| it uses. This means that some tools may inaccurately report the disk space |
| used by Kudu. For example, the size listed by `ls -l` does not accurately |
| reflect the disk space used by Kudu data files: |
| |
| ---- |
| $ ls -lh /data/kudu/tserver/data |
| total 117M |
| -rw------- 1 kudu kudu 160M Mar 26 19:37 0b9807b8b17d48a6a7d5b16bf4ac4e6d.data |
| -rw------- 1 kudu kudu 4.4K Mar 26 19:37 0b9807b8b17d48a6a7d5b16bf4ac4e6d.metadata |
| -rw------- 1 kudu kudu 32M Mar 26 19:37 2f26eeacc7e04b65a009e2c9a2a8bd20.data |
| -rw------- 1 kudu kudu 4.3K Mar 26 19:37 2f26eeacc7e04b65a009e2c9a2a8bd20.metadata |
| -rw------- 1 kudu kudu 672M Mar 26 19:37 30a2dd2cd3554d8a9613f588a8d136ff.data |
| -rw------- 1 kudu kudu 4.4K Mar 26 19:37 30a2dd2cd3554d8a9613f588a8d136ff.metadata |
| -rw------- 1 kudu kudu 32M Mar 26 19:37 7434c83c5ec74ae6af5974e4909cbf82.data |
| -rw------- 1 kudu kudu 4.3K Mar 26 19:37 7434c83c5ec74ae6af5974e4909cbf82.metadata |
| -rw------- 1 kudu kudu 672M Mar 26 19:37 772d070347a04f9f8ad2ad3241440090.data |
| -rw------- 1 kudu kudu 4.4K Mar 26 19:37 772d070347a04f9f8ad2ad3241440090.metadata |
| -rw------- 1 kudu kudu 160M Mar 26 19:37 86e50a95531f46b6a79e671e6f5f4151.data |
| -rw------- 1 kudu kudu 4.4K Mar 26 19:37 86e50a95531f46b6a79e671e6f5f4151.metadata |
| -rw------- 1 kudu kudu 687 Mar 26 19:26 block_manager_instance |
| ---- |
| |
| Notice that the total size reported is 117MiB, while the first file's size is |
| listed as 160MiB. Adding the `-s` option to `ls` will cause `ls` to output the |
| file's disk space usage. |
| |
| The `du` and `df` utilities report the actual disk space usage by default. |
| |
| ---- |
| $ du -h /data/kudu/tserver/data |
| 118M /data/kudu/tserver/data |
| ---- |
| |
| The apparent size can be shown with the `--apparent-size` flag to `du`. |
| |
| ---- |
| $ du -h --apparent-size /data/kudu/tserver/data |
| 1.7G /data/kudu/tserver/data |
| ---- |
| |
| [[crash_reporting]] |
| == Reporting Kudu Crashes |
| |
| Kudu uses the |
| link:https://chromium.googlesource.com/breakpad/breakpad/[Google Breakpad] |
| library to generate a minidump whenever Kudu experiences a crash. These |
| minidumps are typically only a few MB in size and are generated even if core |
| dump generation is disabled. At this time, generating minidumps is only |
| possible in Kudu on Linux builds. |
| |
| A minidump file contains important debugging information about the process that |
| crashed, including shared libraries loaded and their versions, a list of |
| threads running at the time of the crash, the state of the processor registers |
| and a copy of the stack memory for each thread, and CPU and operating system |
| version information. |
| |
| It is also possible to force Kudu to create a minidump without killing the |
| process by sending a `USR1` signal to the `kudu-tserver` or `kudu-master` |
| process. For example: |
| |
| ---- |
| sudo pkill -USR1 kudu-tserver |
| ---- |
| |
| By default, Kudu stores its minidumps in a subdirectory of its configured glog |
| directory called `minidumps`. This location can be customized by setting the |
| `--minidump_path` flag. Kudu will retain only a certain number of minidumps |
| before deleting the oldest ones, in an effort to avoid filling up the disk with |
| minidump files. The maximum number of minidumps that will be retained can be |
| controlled by setting the `--max_minidumps` gflag. |
| |
| Minidumps contain information specific to the binary that created them and so |
| are not usable without access to the exact binary that crashed, or a very |
| similar binary. For more information on processing and using minidump files, |
| see scripts/dump_breakpad_symbols.py. |
| |
| NOTE: A minidump can be emailed to a Kudu developer or attached to a JIRA in |
| order to help a Kudu developer debug a crash. In order for it to be useful, the |
| developer will need to know the exact version of Kudu and the operating system |
| where the crash was observed. Note that while a minidump does not contain a |
| heap memory dump, it does contain stack memory and therefore it is possible for |
| application data to appear in a minidump. If confidential or personal |
| information is stored on the cluster, do not share minidump files. |
| |
| == Performance Troubleshooting |
| |
| [[kudu_tracing]] |
| === Kudu Tracing |
| |
| The `kudu-master` and `kudu-tserver` daemons include built-in tracing support |
| based on the open source |
| link:https://www.chromium.org/developers/how-tos/trace-event-profiling-tool[Chromium Tracing] |
| framework. You can use tracing to help diagnose latency issues or other problems |
| on Kudu servers. |
| |
| ==== Accessing the tracing interface |
| |
| The tracing interface is accessed via a web browser as part of the |
| embedded web server in each of the Kudu daemons. |
| |
| .Tracing Interface URLs |
| [options="header"] |
| |=== |
| | Daemon | URL |
| | Tablet Server | http://tablet-server-1.example.com:8050/tracing.html |
| | Master | http://master-1.example.com:8051/tracing.html |
| |=== |
| |
| WARNING: The tracing interface is known to work in recent versions of Google Chrome. |
| Other browsers may not work as expected. |
| |
| ==== Collecting a trace |
| |
| After navigating to the tracing interface, click the *Record* button on the top left corner |
| of the screen. When beginning to diagnose a problem, start by selecting all categories. |
| Click *Record* to begin recording a trace. |
| |
| During the trace collection, events are collected into an in-memory ring buffer. |
| This ring buffer is fixed in size, so it will eventually fill up to 100%. However, new events |
| are still being collected while older events are being removed. While recording the trace, |
| trigger the behavior or workload you are interested in exploring. |
| |
| After collecting for several seconds, click *Stop*. The collected trace will be |
| downloaded and displayed. Use the *?* key to display help text about using the tracing |
| interface to explore the trace. |
| |
| ==== Saving a trace |
| |
| You can save collected traces as JSON files for later analysis by clicking *Save* |
| after collecting the trace. To load and analyze a saved JSON file, click *Load* |
| and choose the file. |
| |
| === RPC Timeout Traces |
| |
| If client applications are experiencing RPC timeouts, the Kudu tablet server |
| `WARNING` level logs should contain a log entry which includes an RPC-level trace. For example: |
| |
| ---- |
| W0922 00:56:52.313848 10858 inbound_call.cc:193] Call kudu.consensus.ConsensusService.UpdateConsensus |
| from 192.168.1.102:43499 (request call id 3555909) took 1464ms (client timeout 1000). |
| W0922 00:56:52.314888 10858 inbound_call.cc:197] Trace: |
| 0922 00:56:50.849505 (+ 0us) service_pool.cc:97] Inserting onto call queue |
| 0922 00:56:50.849527 (+ 22us) service_pool.cc:158] Handling call |
| 0922 00:56:50.849574 (+ 47us) raft_consensus.cc:1008] Updating replica for 2 ops |
| 0922 00:56:50.849628 (+ 54us) raft_consensus.cc:1050] Early marking committed up to term: 8 index: 880241 |
| 0922 00:56:50.849968 (+ 340us) raft_consensus.cc:1056] Triggering prepare for 2 ops |
| 0922 00:56:50.850119 (+ 151us) log.cc:420] Serialized 1555 byte log entry |
| 0922 00:56:50.850213 (+ 94us) raft_consensus.cc:1131] Marking committed up to term: 8 index: 880241 |
| 0922 00:56:50.850218 (+ 5us) raft_consensus.cc:1148] Updating last received op as term: 8 index: 880243 |
| 0922 00:56:50.850219 (+ 1us) raft_consensus.cc:1195] Filling consensus response to leader. |
| 0922 00:56:50.850221 (+ 2us) raft_consensus.cc:1169] Waiting on the replicates to finish logging |
| 0922 00:56:52.313763 (+1463542us) raft_consensus.cc:1182] finished |
| 0922 00:56:52.313764 (+ 1us) raft_consensus.cc:1190] UpdateReplicas() finished |
| 0922 00:56:52.313788 (+ 24us) inbound_call.cc:114] Queueing success response |
| ---- |
| |
| These traces can give an indication of which part of the request was slow. Please |
| include them in bug reports related to RPC latency outliers. |
| |
| === Kernel Stack Watchdog Traces |
| |
| Each Kudu server process has a background thread called the Stack Watchdog, which |
| monitors the other threads in the server in case they have blocked for |
| longer-than-expected periods of time. These traces can indicate operating system issues |
| or bottlenecked storage. |
| |
| When the watchdog thread identifies a case of thread blockage, it logs an entry |
| in the `WARNING` log like the following: |
| |
| ---- |
| W0921 23:51:54.306350 10912 kernel_stack_watchdog.cc:111] Thread 10937 stuck at /data/kudu/consensus/log.cc:505 for 537ms: |
| Kernel stack: |
| [<ffffffffa00b209d>] do_get_write_access+0x29d/0x520 [jbd2] |
| [<ffffffffa00b2471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] |
| [<ffffffffa00fe6d8>] __ext4_journal_get_write_access+0x38/0x80 [ext4] |
| [<ffffffffa00d9b23>] ext4_reserve_inode_write+0x73/0xa0 [ext4] |
| [<ffffffffa00d9b9c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] |
| [<ffffffffa00d9e90>] ext4_dirty_inode+0x40/0x60 [ext4] |
| [<ffffffff811ac48b>] __mark_inode_dirty+0x3b/0x160 |
| [<ffffffff8119c742>] file_update_time+0xf2/0x170 |
| [<ffffffff8111c1e0>] __generic_file_aio_write+0x230/0x490 |
| [<ffffffff8111c4c8>] generic_file_aio_write+0x88/0x100 |
| [<ffffffffa00d3fb1>] ext4_file_write+0x61/0x1e0 [ext4] |
| [<ffffffff81180f5b>] do_sync_readv_writev+0xfb/0x140 |
| [<ffffffff81181ee6>] do_readv_writev+0xd6/0x1f0 |
| [<ffffffff81182046>] vfs_writev+0x46/0x60 |
| [<ffffffff81182102>] sys_pwritev+0xa2/0xc0 |
| [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b |
| [<ffffffffffffffff>] 0xffffffffffffffff |
| |
| User stack: |
| @ 0x3a1ace10c4 (unknown) |
| @ 0x1262103 (unknown) |
| @ 0x12622d4 (unknown) |
| @ 0x12603df (unknown) |
| @ 0x8e7bfb (unknown) |
| @ 0x8f478b (unknown) |
| @ 0x8f55db (unknown) |
| @ 0x12a7b6f (unknown) |
| @ 0x3a1b007851 (unknown) |
| @ 0x3a1ace894d (unknown) |
| @ (nil) (unknown) |
| ---- |
| |
| These traces can be useful for diagnosing root-cause latency issues when they are caused by systems |
| below Kudu, such as disk controllers or filesystems. |
| |
| [[memory_limits]] |
| === Memory Limits |
| |
| Kudu has a hard and soft memory limit. The hard memory limit is the maximum amount a Kudu process |
| is allowed to use, and is controlled by the `--memory_limit_hard_bytes` flag. The soft memory limit |
| is a percentage of the hard memory limit, controlled by the flag `memory_limit_soft_percentage` and |
| with a default value of 80%, that determines the amount of memory a process may use before it will |
| start rejecting some write operations. |
| |
| If the logs or RPC traces contain messages like |
| |
| ---- |
| Service unavailable: Soft memory limit exceeded (at 96.35% of capacity) |
| ---- |
| |
| then Kudu is rejecting writes due to memory backpressure. This may result in write timeouts. There |
| are several ways to relieve the memory pressure on Kudu: |
| |
| - If the host has more memory available for Kudu, increase `--memory_limit_hard_bytes`. |
| - Increase the rate at which Kudu can flush writes from memory to disk by increasing the number of |
| disks or increasing the number of maintenance manager threads `--maintenance_manager_num_threads`. |
| Generally, the recommended ratio of maintenance manager threads to data directories is 1:3. |
| - Reduce the volume of writes flowing to Kudu on the application side. |
| |
| Finally, on versions of Kudu prior to 1.8, check the value of |
| `--block_cache_capacity_mb`. This setting determines the maximum size of Kudu's |
| block cache. While a higher value can help with read and write performance, |
| do not raise `--block_cache_capacity_mb` above the memory pressure threshold, |
| which is `--memory_pressure_percentage` (default 60%) of |
| `--memory_limit_hard_bytes`, as this will cause Kudu to flush aggressively even |
| if write throughput is low. Keeping `--block_cache_capacity_mb` below 50% of the |
| memory pressure threshold is recommended. With the defaults, this means |
| `--block_cache_capacity_mb` should not exceed 30% of |
| `--memory_limit_hard_bytes`. On Kudu 1.8 and higher, servers will refuse to |
| start if the block cache capacity exceeds the memory pressure threshold. |
| |
| [[block_cache_size]] |
| === Block Cache Size |
| |
| Kudu uses an LRU cache for recently read data. On workloads that scan a subset |
| of the data repeatedly, raising the size of this cache can offer significant |
| performance benefits. To increase the amount of memory dedicated to the block |
| cache, increase the value of the flag `--block_cache_capacity_mb`. The default |
| is 512MiB. |
| |
| Kudu provides a set of useful metrics for evaluating the performance of the |
| block cache, which can be found on the `/metrics` endpoint of the web UI. An |
| example set: |
| |
| ---- |
| { |
| "name": "block_cache_inserts", |
| "value": 64 |
| }, |
| { |
| "name": "block_cache_lookups", |
| "value": 512 |
| }, |
| { |
| "name": "block_cache_evictions", |
| "value": 0 |
| }, |
| { |
| "name": "block_cache_misses", |
| "value": 96 |
| }, |
| { |
| "name": "block_cache_misses_caching", |
| "value": 64 |
| }, |
| { |
| "name": "block_cache_hits", |
| "value": 0 |
| }, |
| { |
| "name": "block_cache_hits_caching", |
| "value": 352 |
| }, |
| { |
| "name": "block_cache_usage", |
| "value": 6976 |
| } |
| ---- |
| |
| To judge the efficiency of the block cache on a tablet server, first wait until |
| the server has been running and serving normal requests for some time, so the |
| cache is not cold. Unless the server stores very little data or is idle, |
| `block_cache_usage` should be equal or nearly equal to `block_cache_capacity_mb`. |
| Once the cache has reached steady state, compare `block_cache_lookups` to |
| `block_cache_misses_caching`. The latter metric counts the number of blocks that |
| Kudu expected to read from cache but which weren't found in the cache. If a |
| significant amount of lookups result in misses on expected cache hits, and the |
| `block_cache_evictions` metric is significant compared to `block_cache_inserts`, |
| then raising the size of the block cache may provide a performance boost. |
| However, the utility of the block cache is highly dependent on workload, so it's |
| necessary to test the benefits of a larger block cache. |
| |
| WARNING: Do not raise the block cache size `--block_cache_capacity_mb` higher |
| than the memory pressure threshold (defaults to 60% of `--memory_limit_hard_bytes`). |
| As this would cause poor flushing behavior, Kudu servers version 1.8 and higher |
| will refuse to start when misconfigured in this way. |
| |
| [[heap_sampling]] |
| === Heap Sampling |
| |
| For advanced debugging of memory usage, released builds of Kudu enable Heap Sampling by default. |
| This allows Kudu developers to associate memory usage with the specific lines of code and data |
| structures responsible. When reporting a bug related to memory usage or an apparent memory leak, |
| heap profiling can give quantitative data to pinpoint the issue. |
| |
| If heap sampling is enabled, the current sampled heap occupancy can be retrieved over HTTP |
| by visiting `http://tablet-server.example.com:8050/pprof/heap` or |
| `http://master.example.com:8051/pprof/heap`. The output is a machine-readable dump of the |
| stack traces with their associated heap usage. |
| |
| Rather than visiting the heap profile page directly in a web browser, it is typically |
| more useful to use the `pprof` tool that is distributed as part of the `gperftools` |
| open source project. For example, a developer with a local build tree can use the |
| following command to collect the sampled heap usage and output an SVG diagram: |
| |
| ---- |
| thirdparty/installed/uninstrumented/bin/pprof -svg 'http://localhost:8051/pprof/heap' > /tmp/heap.svg |
| ---- |
| |
| The resulting SVG may be visualized in a web browser or sent to the Kudu community to help |
| troubleshoot memory occupancy issues. |
| |
| TIP: Heap samples contain only summary information about allocations and do not contain any |
| _data_ from the heap. It is safe to share heap samples in public without fear of exposing |
| confidential or sensitive data. |
| |
| [[slow_dns_nscd]] |
| === Slow DNS Lookups and `nscd` |
| |
| For better scalability on nodes hosting many replicas, we recommend that you use |
| `nscd` (name service cache daemon) to cache both DNS name resolution and static name resolution (via `/etc/hosts`). |
| |
| When DNS lookups are slow, you will see a log message similar to the following: |
| |
| ---- |
| W0926 11:19:01.339553 27231 net_util.cc:193] Time spent resolve address for kudu-tserver.example.com: real 4.647s user 0.000s sys 0.000s |
| ---- |
| |
| `nscd` (name service cache daemon) can alleviate slow name resolution by providing |
| a cache for the most common name service requests, such as for passwords, groups, |
| and hosts. |
| |
| Refer to your operating system documentation for how to install and enable `nscd`. |
| |
| == Issues using Kudu |
| |
| [[hive_handler]] |
| === ClassNotFoundException: com.cloudera.kudu.hive.KuduStorageHandler |
| |
| Users will encounter this exception when trying to use a Kudu table via Hive. This |
| is not a case of a missing jar, but simply that Impala stores Kudu metadata in |
| Hive in a format that's unreadable to other tools, including Hive itself and Spark. |
| There is no workaround for Hive users. Spark users need to create temporary tables. |
| |
| [[too_many_threads]] |
| === Runtime error: Could not create thread: Resource temporarily unavailable (error 11) |
| |
| Users will encounter this error when Kudu is unable to create more threads, |
| usually on versions of Kudu older than 1.7. It happens on tablet servers, and |
| is a sign that the tablet server hosts too many tablet replicas. To fix the |
| issue, users can raise the `nproc` ulimit as detailed in the documentation for |
| their operating system or distribution. However, the better solution is to |
| reduce the number of replicas on the tablet server. This may involve rethinking |
| the table's partitioning schema. For the recommended limits on number of |
| replicas per tablet server, see the known issues and scaling limitations |
| documentation for the appropriate Kudu release. The |
| link:http://kudu.apache.org/releases/[releases page] has links to documentation |
| for previous versions of Kudu; for the latest release, see the |
| link:known_issues.html[known issues page]. |
| |
| [[tombstoned_or_stopped_tablets]] |
| === Tombstoned or STOPPED tablet replicas |
| |
| Users may notice some replicas on a tablet server are in a STOPPED state, and |
| remain on the server indefinitely. These replicas are tombstones. A tombstone |
| indicates that the tablet server once held a bona fide replica of its tablet. |
| For example, if a tablet server goes down and its replicas are re-replicated |
| elsewhere, if the tablet server rejoins the cluster its replicas will become |
| tombstones. A tombstone will remain until the table it belongs to is deleted, or |
| a new replica of the same tablet is placed on the tablet server. A count of |
| tombstoned replicas and details of each one are available on the /tablets page |
| of the tablet server web UI. |
| |
| The Raft consensus algorithm that Kudu uses for replication requires tombstones |
| for correctness in certain rare situations. They consume minimal resources and |
| hold no data. They must not be deleted. |
| |
| [[cfile_corruption]] |
| === Corruption: checksum error on CFile block |
| |
| In versions prior to Kudu 1.8.0, if the data on disk becomes corrupt, users |
| will encounter warnings containing "Corruption: checksum error on CFile block" |
| in the tablet server logs and client side errors when trying to scan tablets |
| with corrupt CFile blocks. Fixing this corruption is a manual process. |
| |
| To fix the issue, users can first identify all the affected tablets by |
| running a checksum scan on the affected tables or tablets using the |
| `link:command_line_tools_reference.html#cluster-ksck[ksck]` tool. |
| |
| ---- |
| sudo -u kudu kudu cluster ksck <master_addresses> -checksum_scan -tables=<tables> |
| sudo -u kudu kudu cluster ksck <master_addresses> -checksum_scan -tablets=<tablets> |
| ---- |
| |
| If there is at least one replica for each tablet that does not return a corruption |
| error, you can repair the bad copies by deleting them and forcing them to be |
| re-replicated from the leader using the |
| `link:command_line_tools_reference.html#remote_replica-delete[remote_replica delete] tool`. |
| |
| ---- |
| sudo -u kudu kudu remote_replica delete <tserver_address> <tablet_id> "Cfile Corruption" |
| ---- |
| |
| If all of the replica are corrupt, then some data loss has occurred. |
| Until link:https://issues.apache.org/jira/browse/KUDU-2526[KUDU-2526] is |
| completed this can happen if the corrupt replica became the leader and the |
| existing follower replicas are replaced. |
| |
| If data has been lost, you can repair the table by replacing the corrupt tablet |
| with an empty one using the |
| `link:command_line_tools_reference.html#tablet-unsafe_replace_tablet[unsafe_replace_tablet]` tool. |
| |
| ---- |
| sudo -u kudu kudu tablet unsafe_replace_tablet <master_addresses> <tablet_id> |
| ---- |
| |
| From versions 1.8.0 onwards, Kudu will mark the affected replicas as failed, |
| leading to their automatic re-replication elsewhere. |
| |
| [[symbolizing_stack_traces]] |
| === Symbolizing stack traces |
| Sometimes you might see the following in the logs: |
| |
| ---- |
| 0323 03:59:31.091198 (+607857us) spinlock_profiling.cc:243] Waited 492 ms on lock 0x4cb0960. stack: 0000000002398852 0000000000ad8c69 0000000000aa62ba 000000000221aaa8 000000000221b1a8 00000000023a8f83 00007fa8b818be24 00007fa8b646a34c |
| ---- |
| |
| That's usually a sign of high contention among threads to acquire a lock, and |
| in this case the reported time shows how long a thread spent on a CPU before |
| acquiring the lock. The call stack addresses listed helps to restore the stack |
| trace of the waiting thread and pinpoint the problem in the code. |
| |
| It's possible to translate the addresses into the name of functions and lines |
| in the code having the binary that produced the output (in this example, it's |
| `kudu-master`). If the binary is stripped of symbols and debug information, |
| it's possible do so as well if separate debug information for the binary is |
| available. |
| |
| Assuming both the stripped release binary and the debug information are |
| available as RPMs, unpack them into a directory (e.g., `sysroot`): |
| ---- |
| $ mkdir sysroot && cd sysroot |
| $ rpm2cpio ../kudu-1.10.0.el7.x86_64.rpm | cpio -idmv |
| $ rpm2cpio ../kudu-debuginfo-1.10.0.el7.x86_64.rpm | cpio -idmv |
| ---- |
| |
| Use `addr2line` to find the line in the code for the stack address (in case if |
| the binary is not stripped of debug information, supply the actual binary with |
| `-e` option instead of the debug info file): |
| |
| ---- |
| addr2line -C -f -e usr/lib/debug/usr/lib/kudu/sbin-release/kudu-master.debug 0x0000000000aa62ba |
| kudu::master::MasterServiceImpl::ConnectToMaster(kudu::master::ConnectToMasterRequestPB const*, kudu::master::ConnectToMasterResponsePB*, kudu::rpc::RpcContext*) |
| /usr/src/debug/kudu-1.10.0/src/kudu/master/master_service.cc:504 |
| ---- |
| |
| To achieve the same with `gdb`, first find the address of the `.text` section |
| in the symbol file (in the example, `0000000000a2cdb0`): |
| ---- |
| $ readelf -S usr/lib/debug/usr/lib/kudu/sbin-release/kudu-master.debug | grep .text |
| [13] .text NOBITS 0000000000a2cdb0 000002c0 |
| ---- |
| |
| Then start up `gdb`, pointing it to the `kudu-master` executable (that's the |
| executable that produced the output in the log file): |
| ---- |
| gdb usr/lib/kudu/sbin-release/kudu-master |
| ---- |
| |
| Now load the `.debug` symbols into `gdb` using the address found above, tell |
| `gdb` where to find source files, and set the sysroot: |
| ---- |
| (gdb) add-symbol-file usr/lib/debug/usr/lib/kudu/sbin-release/kudu-master.debug 0x0000000000a2cdb0 |
| (gdb) set substitute-path /usr/src/debug/kudu-1.10.0 usr/src/debug/kudu-1.10.0 |
| (gdb) set sysroot . |
| ---- |
| |
| To translate the address into line number and function information, use |
| `info line * <address>`: |
| ---- |
| (gdb) info line * 0x0000000000aa62ba |
| Line 504 of "/usr/src/debug/kudu-1.10.0/src/kudu/master/master_service.cc" |
| starts at address 0xaa62af <kudu::master::MasterServiceImpl::ConnectToMaster(kudu::master::ConnectToMasterRequestPB const*, kudu::master::ConnectToMasterResponsePB*, kudu::rpc::RpcContext*)+47> |
| and ends at 0xaa62bb <kudu::master::MasterServiceImpl::ConnectToMaster(kudu::master::ConnectToMasterRequestPB const*, kudu::master::ConnectToMasterResponsePB*, kudu::rpc::RpcContext*)+59>. |
| ---- |