| = Diving Deep, Use External Tools |
| |
| Machine access allows operators to dive even deeper than logs and |
| `nodetool` allow. While every Cassandra operator may have their personal |
| favorite toolsets for troubleshooting issues, this page contains some of |
| the most common operator techniques and examples of those tools. Many of |
| these commands work only on Linux, but if you are deploying on a |
| different operating system you may have access to other substantially |
| similar tools that assess similar OS level metrics and processes. |
| |
| == JVM Tooling |
| |
| The JVM ships with a number of useful tools. Some of them are useful for |
| debugging Cassandra issues, especially related to heap and execution |
| stacks. |
| |
| *NOTE*: There are two common gotchas with JVM tooling and Cassandra: |
| |
| [arabic] |
| . By default Cassandra ships with `-XX:+PerfDisableSharedMem` set to |
| prevent long pauses (see `CASSANDRA-9242` and `CASSANDRA-9483` for |
| details). If you want to use JVM tooling you can instead have `/tmp` |
| mounted on an in memory `tmpfs` which also effectively works around |
| `CASSANDRA-9242`. |
| . Make sure you run the tools as the same user as Cassandra is running |
| as, e.g. if the database is running as `cassandra` the tool also has to |
| be run as `cassandra`, e.g. via `sudo -u cassandra <cmd>`. |
| |
| === Garbage Collection State (jstat) |
| |
| If you suspect heap pressure you can use `jstat` to dive deep into the |
| garbage collection state of a Cassandra process. This command is always |
| safe to run and yields detailed heap information including eden heap |
| usage (E), old generation heap usage (O), count of eden collections |
| (YGC), time spend in eden collections (YGCT), old/mixed generation |
| collections (FGC) and time spent in old/mixed generation collections |
| (FGCT): |
| |
| [source, bash] |
| ---- |
| jstat -gcutil <cassandra pid> 500ms |
| S0 S1 E O M CCS YGC YGCT FGC FGCT GCT |
| 0.00 0.00 81.53 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 82.36 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 82.36 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 83.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 83.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 84.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 84.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 85.03 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 85.03 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| 0.00 0.00 85.94 31.16 93.07 88.20 12 0.151 3 0.257 0.408 |
| ---- |
| |
| In this case we see we have a relatively healthy heap profile, with |
| 31.16% old generation heap usage and 83% eden. If the old generation |
| routinely is above 75% then you probably need more heap (assuming CMS |
| with a 75% occupancy threshold). If you do have such persistently high |
| old gen that often means you either have under-provisioned the old |
| generation heap, or that there is too much live data on heap for |
| Cassandra to collect (e.g. because of memtables). Another thing to watch |
| for is time between young garbage collections (YGC), which indicate how |
| frequently the eden heap is collected. Each young gc pause is about |
| 20-50ms, so if you have a lot of them your clients will notice in their |
| high percentile latencies. |
| |
| === Thread Information (jstack) |
| |
| To get a point in time snapshot of exactly what Cassandra is doing, run |
| `jstack` against the Cassandra PID. *Note* that this does pause the JVM |
| for a very brief period (<20ms).: |
| |
| [source, bash] |
| ---- |
| $ jstack <cassandra pid> > threaddump |
| |
| # display the threaddump |
| $ cat threaddump |
| |
| # look at runnable threads |
| $grep RUNNABLE threaddump -B 1 |
| "Attach Listener" #15 daemon prio=9 os_prio=0 tid=0x00007f829c001000 nid=0x3a74 waiting on condition [0x0000000000000000] |
| java.lang.Thread.State: RUNNABLE |
| -- |
| "DestroyJavaVM" #13 prio=5 os_prio=0 tid=0x00007f82e800e000 nid=0x2a19 waiting on condition [0x0000000000000000] |
| java.lang.Thread.State: RUNNABLE |
| -- |
| "JPS thread pool" #10 prio=5 os_prio=0 tid=0x00007f82e84d0800 nid=0x2a2c runnable [0x00007f82d0856000] |
| java.lang.Thread.State: RUNNABLE |
| -- |
| "Service Thread" #9 daemon prio=9 os_prio=0 tid=0x00007f82e80d7000 nid=0x2a2a runnable [0x0000000000000000] |
| java.lang.Thread.State: RUNNABLE |
| -- |
| "C1 CompilerThread3" #8 daemon prio=9 os_prio=0 tid=0x00007f82e80cc000 nid=0x2a29 waiting on condition [0x0000000000000000] |
| java.lang.Thread.State: RUNNABLE |
| -- |
| |
| # Note that the nid is the Linux thread id |
| ---- |
| |
| Some of the most important information in the threaddumps are |
| waiting/blocking threads, including what locks or monitors the thread is |
| blocking/waiting on. |
| |
| == Basic OS Tooling |
| |
| A great place to start when debugging a Cassandra issue is understanding |
| how Cassandra is interacting with system resources. The following are |
| all resources that Cassandra makes heavy uses of: |
| |
| * CPU cores. For executing concurrent user queries |
| * CPU processing time. For query activity (data decompression, row |
| merging, etc.) |
| * CPU processing time (low priority). For background tasks (compaction, |
| streaming, etc ...) |
| * RAM for Java Heap. Used to hold internal data-structures and by |
| default the Cassandra memtables. Heap space is a crucial component of |
| write performance as well as generally. |
| * RAM for OS disk cache. Used to cache frequently accessed SSTable |
| blocks. OS disk cache is a crucial component of read performance. |
| * Disks. Cassandra cares a lot about disk read latency, disk write |
| throughput, and of course disk space. |
| * Network latency. Cassandra makes many internode requests, so network |
| latency between nodes can directly impact performance. |
| * Network throughput. Cassandra (as other databases) frequently have the |
| so called "incast" problem where a small request (e.g. |
| `SELECT * from foo.bar`) returns a massively large result set (e.g. the |
| entire dataset). In such situations outgoing bandwidth is crucial. |
| |
| Often troubleshooting Cassandra comes down to troubleshooting what |
| resource the machine or cluster is running out of. Then you create more |
| of that resource or change the query pattern to make less use of that |
| resource. |
| |
| === High Level Resource Usage (top/htop) |
| |
| Cassandra makes signifiant use of system resources, and often the very |
| first useful action is to run `top` or `htop` |
| (https://hisham.hm/htop/[website])to see the state of the machine. |
| |
| Useful things to look at: |
| |
| * System load levels. While these numbers can be confusing, generally |
| speaking if the load average is greater than the number of CPU cores, |
| Cassandra probably won't have very good (sub 100 millisecond) latencies. |
| See |
| http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html[Linux |
| Load Averages] for more information. |
| * CPU utilization. `htop` in particular can help break down CPU |
| utilization into `user` (low and normal priority), `system` (kernel), |
| and `io-wait` . Cassandra query threads execute as normal priority |
| `user` threads, while compaction threads execute as low priority `user` |
| threads. High `system` time could indicate problems like thread |
| contention, and high `io-wait` may indicate slow disk drives. This can |
| help you understand what Cassandra is spending processing resources |
| doing. |
| * Memory usage. Look for which programs have the most resident memory, |
| it is probably Cassandra. The number for Cassandra is likely |
| inaccurately high due to how Linux (as of 2018) accounts for memory |
| mapped file memory. |
| |
| [[os-iostat]] |
| === IO Usage (iostat) |
| |
| Use iostat to determine how data drives are faring, including latency |
| distributions, throughput, and utilization: |
| |
| [source, bash] |
| ---- |
| $ sudo iostat -xdm 2 |
| Linux 4.13.0-13-generic (hostname) 07/03/2018 _x86_64_ (8 CPU) |
| |
| Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util |
| sda 0.00 0.28 0.32 5.42 0.01 0.13 48.55 0.01 2.21 0.26 2.32 0.64 0.37 |
| sdb 0.00 0.00 0.00 0.00 0.00 0.00 79.34 0.00 0.20 0.20 0.00 0.16 0.00 |
| sdc 0.34 0.27 0.76 0.36 0.01 0.02 47.56 0.03 26.90 2.98 77.73 9.21 1.03 |
| |
| Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util |
| sda 0.00 0.00 2.00 32.00 0.01 4.04 244.24 0.54 16.00 0.00 17.00 1.06 3.60 |
| sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 |
| sdc 0.00 24.50 0.00 114.00 0.00 11.62 208.70 5.56 48.79 0.00 48.79 1.12 12.80 |
| ---- |
| |
| In this case we can see that `/dev/sdc1` is a very slow drive, having an |
| `await` close to 50 milliseconds and an `avgqu-sz` close to 5 ios. The |
| drive is not particularly saturated (utilization is only 12.8%), but we |
| should still be concerned about how this would affect our p99 latency |
| since 50ms is quite long for typical Cassandra operations. That being |
| said, in this case most of the latency is present in writes (typically |
| writes are more latent than reads), which due to the LSM nature of |
| Cassandra is often hidden from the user. |
| |
| Important metrics to assess using iostat: |
| |
| * Reads and writes per second. These numbers will change with the |
| workload, but generally speaking the more reads Cassandra has to do from |
| disk the slower Cassandra read latencies are. Large numbers of reads per |
| second can be a dead giveaway that the cluster has insufficient memory |
| for OS page caching. |
| * Write throughput. Cassandra's LSM model defers user writes and batches |
| them together, which means that throughput to the underlying medium is |
| the most important write metric for Cassandra. |
| * Read latency (`r_await`). When Cassandra missed the OS page cache and |
| reads from SSTables, the read latency directly determines how fast |
| Cassandra can respond with the data. |
| * Write latency. Cassandra is less sensitive to write latency except |
| when it syncs the commit log. This typically enters into the very high |
| percentiles of write latency. |
| |
| Note that to get detailed latency breakdowns you will need a more |
| advanced tool such as xref:use_tools.adoc#bcc-tools[`bcc-tools`]. |
| |
| === OS page Cache Usage |
| |
| As Cassandra makes heavy use of memory mapped files, the health of the |
| operating system's https://en.wikipedia.org/wiki/Page_cache[Page Cache] |
| is crucial to performance. Start by finding how much available cache is |
| in the system: |
| |
| [source, bash] |
| ---- |
| $ free -g |
| total used free shared buff/cache available |
| Mem: 15 9 2 0 3 5 |
| Swap: 0 0 0 |
| ---- |
| |
| In this case 9GB of memory is used by user processes (Cassandra heap) |
| and 8GB is available for OS page cache. Of that, 3GB is actually used to |
| cache files. If most memory is used and unavailable to the page cache, |
| Cassandra performance can suffer significantly. This is why Cassandra |
| starts with a reasonably small amount of memory reserved for the heap. |
| |
| If you suspect that you are missing the OS page cache frequently you can |
| use advanced tools like xref:use_tools.adoc#use-bcc-tools[cachestat] or |
| xref:use_tools.adoc#use-vmtouch[vmtouch] to dive deeper. |
| |
| === Network Latency and Reliability |
| |
| Whenever Cassandra does writes or reads that involve other replicas, |
| `LOCAL_QUORUM` reads for example, one of the dominant effects on latency |
| is network latency. When trying to debug issues with multi machine |
| operations, the network can be an important resource to investigate. You |
| can determine internode latency using tools like `ping` and `traceroute` |
| or most effectively `mtr`: |
| |
| [source, bash] |
| ---- |
| $ mtr -nr www.google.com |
| Start: Sun Jul 22 13:10:28 2018 |
| HOST: hostname Loss% Snt Last Avg Best Wrst StDev |
| 1.|-- 192.168.1.1 0.0% 10 2.0 1.9 1.1 3.7 0.7 |
| 2.|-- 96.123.29.15 0.0% 10 11.4 11.0 9.0 16.4 1.9 |
| 3.|-- 68.86.249.21 0.0% 10 10.6 10.7 9.0 13.7 1.1 |
| 4.|-- 162.141.78.129 0.0% 10 11.5 10.6 9.6 12.4 0.7 |
| 5.|-- 162.151.78.253 0.0% 10 10.9 12.1 10.4 20.2 2.8 |
| 6.|-- 68.86.143.93 0.0% 10 12.4 12.6 9.9 23.1 3.8 |
| 7.|-- 96.112.146.18 0.0% 10 11.9 12.4 10.6 15.5 1.6 |
| 9.|-- 209.85.252.250 0.0% 10 13.7 13.2 12.5 13.9 0.0 |
| 10.|-- 108.170.242.238 0.0% 10 12.7 12.4 11.1 13.0 0.5 |
| 11.|-- 74.125.253.149 0.0% 10 13.4 13.7 11.8 19.2 2.1 |
| 12.|-- 216.239.62.40 0.0% 10 13.4 14.7 11.5 26.9 4.6 |
| 13.|-- 108.170.242.81 0.0% 10 14.4 13.2 10.9 16.0 1.7 |
| 14.|-- 72.14.239.43 0.0% 10 12.2 16.1 11.0 32.8 7.1 |
| 15.|-- 216.58.195.68 0.0% 10 25.1 15.3 11.1 25.1 4.8 |
| ---- |
| |
| In this example of `mtr`, we can rapidly assess the path that your |
| packets are taking, as well as what their typical loss and latency are. |
| Packet loss typically leads to between `200ms` and `3s` of additional |
| latency, so that can be a common cause of latency issues. |
| |
| === Network Throughput |
| |
| As Cassandra is sensitive to outgoing bandwidth limitations, sometimes |
| it is useful to determine if network throughput is limited. One handy |
| tool to do this is |
| https://www.systutorials.com/docs/linux/man/8-iftop/[iftop] which shows |
| both bandwidth usage as well as connection information at a glance. An |
| example showing traffic during a stress run against a local `ccm` |
| cluster: |
| |
| [source, bash] |
| ---- |
| $ # remove the -t for ncurses instead of pure text |
| $ sudo iftop -nNtP -i lo |
| interface: lo |
| IP address is: 127.0.0.1 |
| MAC address is: 00:00:00:00:00:00 |
| Listening on lo |
| # Host name (port/service if enabled) last 2s last 10s last 40s cumulative |
| -------------------------------------------------------------------------------------------- |
| 1 127.0.0.1:58946 => 869Kb 869Kb 869Kb 217KB |
| 127.0.0.3:9042 <= 0b 0b 0b 0B |
| 2 127.0.0.1:54654 => 736Kb 736Kb 736Kb 184KB |
| 127.0.0.1:9042 <= 0b 0b 0b 0B |
| 3 127.0.0.1:51186 => 669Kb 669Kb 669Kb 167KB |
| 127.0.0.2:9042 <= 0b 0b 0b 0B |
| 4 127.0.0.3:9042 => 3.30Kb 3.30Kb 3.30Kb 845B |
| 127.0.0.1:58946 <= 0b 0b 0b 0B |
| 5 127.0.0.1:9042 => 2.79Kb 2.79Kb 2.79Kb 715B |
| 127.0.0.1:54654 <= 0b 0b 0b 0B |
| 6 127.0.0.2:9042 => 2.54Kb 2.54Kb 2.54Kb 650B |
| 127.0.0.1:51186 <= 0b 0b 0b 0B |
| 7 127.0.0.1:36894 => 1.65Kb 1.65Kb 1.65Kb 423B |
| 127.0.0.5:7000 <= 0b 0b 0b 0B |
| 8 127.0.0.1:38034 => 1.50Kb 1.50Kb 1.50Kb 385B |
| 127.0.0.2:7000 <= 0b 0b 0b 0B |
| 9 127.0.0.1:56324 => 1.50Kb 1.50Kb 1.50Kb 383B |
| 127.0.0.1:7000 <= 0b 0b 0b 0B |
| 10 127.0.0.1:53044 => 1.43Kb 1.43Kb 1.43Kb 366B |
| 127.0.0.4:7000 <= 0b 0b 0b 0B |
| -------------------------------------------------------------------------------------------- |
| Total send rate: 2.25Mb 2.25Mb 2.25Mb |
| Total receive rate: 0b 0b 0b |
| Total send and receive rate: 2.25Mb 2.25Mb 2.25Mb |
| -------------------------------------------------------------------------------------------- |
| Peak rate (sent/received/total): 2.25Mb 0b 2.25Mb |
| Cumulative (sent/received/total): 576KB 0B 576KB |
| ============================================================================================ |
| ---- |
| |
| In this case we can see that bandwidth is fairly shared between many |
| peers, but if the total was getting close to the rated capacity of the |
| NIC or was focussed on a single client, that may indicate a clue as to |
| what issue is occurring. |
| |
| == Advanced tools |
| |
| Sometimes as an operator you may need to really dive deep. This is where |
| advanced OS tooling can come in handy. |
| |
| [[use-bcc-tools]] |
| === bcc-tools |
| |
| Most modern Linux distributions (kernels newer than `4.1`) support |
| https://github.com/iovisor/bcc[bcc-tools] for diving deep into |
| performance problems. First install `bcc-tools`, e.g. via `apt` on |
| Debian: |
| |
| [source, bash] |
| ---- |
| $ apt install bcc-tools |
| ---- |
| |
| Then you can use all the tools that `bcc-tools` contains. One of the |
| most useful tools is `cachestat` |
| (https://github.com/iovisor/bcc/blob/master/tools/cachestat_example.txt[cachestat |
| examples]) which allows you to determine exactly how many OS page cache |
| hits and misses are happening: |
| |
| [source, bash] |
| ---- |
| $ sudo /usr/share/bcc/tools/cachestat -T 1 |
| TIME TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB |
| 18:44:08 66 66 0 64 88 4427 |
| 18:44:09 40 40 0 75 88 4427 |
| 18:44:10 4353 45 4308 203 88 4427 |
| 18:44:11 84 77 7 13 88 4428 |
| 18:44:12 2511 14 2497 14 88 4428 |
| 18:44:13 101 98 3 18 88 4428 |
| 18:44:14 16741 0 16741 58 88 4428 |
| 18:44:15 1935 36 1899 18 88 4428 |
| 18:44:16 89 34 55 18 88 4428 |
| ---- |
| |
| In this case there are not too many page cache `MISSES` which indicates |
| a reasonably sized cache. These metrics are the most direct measurement |
| of your Cassandra node's "hot" dataset. If you don't have enough cache, |
| `MISSES` will be high and performance will be slow. If you have enough |
| cache, `MISSES` will be low and performance will be fast (as almost all |
| reads are being served out of memory). |
| |
| You can also measure disk latency distributions using `biolatency` |
| (https://github.com/iovisor/bcc/blob/master/tools/biolatency_example.txt[biolatency |
| examples]) to get an idea of how slow Cassandra will be when reads miss |
| the OS page Cache and have to hit disks: |
| |
| [source, bash] |
| ---- |
| $ sudo /usr/share/bcc/tools/biolatency -D 10 |
| Tracing block device I/O... Hit Ctrl-C to end. |
| |
| |
| disk = 'sda' |
| usecs : count distribution |
| 0 -> 1 : 0 | | |
| 2 -> 3 : 0 | | |
| 4 -> 7 : 0 | | |
| 8 -> 15 : 0 | | |
| 16 -> 31 : 12 |****************************************| |
| 32 -> 63 : 9 |****************************** | |
| 64 -> 127 : 1 |*** | |
| 128 -> 255 : 3 |********** | |
| 256 -> 511 : 7 |*********************** | |
| 512 -> 1023 : 2 |****** | |
| |
| disk = 'sdc' |
| usecs : count distribution |
| 0 -> 1 : 0 | | |
| 2 -> 3 : 0 | | |
| 4 -> 7 : 0 | | |
| 8 -> 15 : 0 | | |
| 16 -> 31 : 0 | | |
| 32 -> 63 : 0 | | |
| 64 -> 127 : 41 |************ | |
| 128 -> 255 : 17 |***** | |
| 256 -> 511 : 13 |*** | |
| 512 -> 1023 : 2 | | |
| 1024 -> 2047 : 0 | | |
| 2048 -> 4095 : 0 | | |
| 4096 -> 8191 : 56 |***************** | |
| 8192 -> 16383 : 131 |****************************************| |
| 16384 -> 32767 : 9 |** | |
| ---- |
| |
| In this case most ios on the data drive (`sdc`) are fast, but many take |
| between 8 and 16 milliseconds. |
| |
| Finally `biosnoop` |
| (https://github.com/iovisor/bcc/blob/master/tools/biosnoop_example.txt[examples]) |
| can be used to dive even deeper and see per IO latencies: |
| |
| [source, bash] |
| ---- |
| $ sudo /usr/share/bcc/tools/biosnoop | grep java | head |
| 0.000000000 java 17427 sdc R 3972458600 4096 13.58 |
| 0.000818000 java 17427 sdc R 3972459408 4096 0.35 |
| 0.007098000 java 17416 sdc R 3972401824 4096 5.81 |
| 0.007896000 java 17416 sdc R 3972489960 4096 0.34 |
| 0.008920000 java 17416 sdc R 3972489896 4096 0.34 |
| 0.009487000 java 17427 sdc R 3972401880 4096 0.32 |
| 0.010238000 java 17416 sdc R 3972488368 4096 0.37 |
| 0.010596000 java 17427 sdc R 3972488376 4096 0.34 |
| 0.011236000 java 17410 sdc R 3972488424 4096 0.32 |
| 0.011825000 java 17427 sdc R 3972488576 16384 0.65 |
| ... time passes |
| 8.032687000 java 18279 sdc R 10899712 122880 3.01 |
| 8.033175000 java 18279 sdc R 10899952 8192 0.46 |
| 8.073295000 java 18279 sdc R 23384320 122880 3.01 |
| 8.073768000 java 18279 sdc R 23384560 8192 0.46 |
| ---- |
| |
| With `biosnoop` you see every single IO and how long they take. This |
| data can be used to construct the latency distributions in `biolatency` |
| but can also be used to better understand how disk latency affects |
| performance. For example this particular drive takes ~3ms to service a |
| memory mapped read due to the large default value (`128kb`) of |
| `read_ahead_kb`. To improve point read performance you may may want to |
| decrease `read_ahead_kb` on fast data volumes such as SSDs while keeping |
| the a higher value like `128kb` value is probably right for HDs. There |
| are tradeoffs involved, see |
| https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt[queue-sysfs] |
| docs for more information, but regardless `biosnoop` is useful for |
| understanding _how_ Cassandra uses drives. |
| |
| [[use-vmtouch]] |
| === vmtouch |
| |
| Sometimes it's useful to know how much of the Cassandra data files are |
| being cached by the OS. A great tool for answering this question is |
| https://github.com/hoytech/vmtouch[vmtouch]. |
| |
| First install it: |
| |
| [source, bash] |
| ---- |
| $ git clone https://github.com/hoytech/vmtouch.git |
| $ cd vmtouch |
| $ make |
| ---- |
| |
| Then run it on the Cassandra data directory: |
| |
| [source, bash] |
| ---- |
| $ ./vmtouch /var/lib/cassandra/data/ |
| Files: 312 |
| Directories: 92 |
| Resident Pages: 62503/64308 244M/251M 97.2% |
| Elapsed: 0.005657 seconds |
| ---- |
| |
| In this case almost the entire dataset is hot in OS page Cache. |
| Generally speaking the percentage doesn't really matter unless reads are |
| missing the cache (per e.g. xref:cql/troubleshooting/use_tools.adoc#use-bcc-tools[cachestat] in which case |
| having additional memory may help read performance. |
| |
| === CPU Flamegraphs |
| |
| Cassandra often uses a lot of CPU, but telling _what_ it is doing can |
| prove difficult. One of the best ways to analyze Cassandra on CPU time |
| is to use |
| http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html[CPU |
| Flamegraphs] which display in a useful way which areas of Cassandra code |
| are using CPU. This may help narrow down a compaction problem to a |
| "compaction problem dropping tombstones" or just generally help you |
| narrow down what Cassandra is doing while it is having an issue. To get |
| CPU flamegraphs follow the instructions for |
| http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java[Java |
| Flamegraphs]. |
| |
| Generally: |
| |
| [arabic] |
| . Enable the `-XX:+PreserveFramePointer` option in Cassandra's |
| `jvm.options` configuation file. This has a negligible performance |
| impact but allows you actually see what Cassandra is doing. |
| . Run `perf` to get some data. |
| . Send that data through the relevant scripts in the FlameGraph toolset |
| and convert the data into a pretty flamegraph. View the resulting SVG |
| image in a browser or other image browser. |
| |
| For example just cloning straight off github we first install the |
| `perf-map-agent` to the location of our JVMs (assumed to be |
| `/usr/lib/jvm`): |
| |
| [source, bash] |
| ---- |
| $ sudo bash |
| $ export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ |
| $ cd /usr/lib/jvm |
| $ git clone --depth=1 https://github.com/jvm-profiling-tools/perf-map-agent |
| $ cd perf-map-agent |
| $ cmake . |
| $ make |
| ---- |
| |
| Now to get a flamegraph: |
| |
| [source, bash] |
| ---- |
| $ git clone --depth=1 https://github.com/brendangregg/FlameGraph |
| $ sudo bash |
| $ cd FlameGraph |
| $ # Record traces of Cassandra and map symbols for all java processes |
| $ perf record -F 49 -a -g -p <CASSANDRA PID> -- sleep 30; ./jmaps |
| $ # Translate the data |
| $ perf script > cassandra_stacks |
| $ cat cassandra_stacks | ./stackcollapse-perf.pl | grep -v cpu_idle | \ |
| ./flamegraph.pl --color=java --hash > cassandra_flames.svg |
| ---- |
| |
| The resulting SVG is searchable, zoomable, and generally easy to |
| introspect using a browser. |
| |
| === Packet Capture |
| |
| Sometimes you have to understand what queries a Cassandra node is |
| performing _right now_ to troubleshoot an issue. For these times trusty |
| packet capture tools like `tcpdump` and |
| https://www.wireshark.org/[Wireshark] can be very helpful to dissect |
| packet captures. Wireshark even has native |
| https://www.wireshark.org/docs/dfref/c/cql.html[CQL support] although it |
| sometimes has compatibility issues with newer Cassandra protocol |
| releases. |
| |
| To get a packet capture first capture some packets: |
| |
| [source, bash] |
| ---- |
| $ sudo tcpdump -U -s0 -i <INTERFACE> -w cassandra.pcap -n "tcp port 9042" |
| ---- |
| |
| Now open it up with wireshark: |
| |
| [source, bash] |
| ---- |
| $ wireshark cassandra.pcap |
| ---- |
| |
| If you don't see CQL like statements try telling to decode as CQL by |
| right clicking on a packet going to 9042 -> `Decode as` -> select CQL |
| from the dropdown for port 9042. |
| |
| If you don't want to do this manually or use a GUI, you can also use |
| something like https://github.com/jolynch/cqltrace[cqltrace] to ease |
| obtaining and parsing CQL packet captures. |