| <?xml version="1.0" encoding="UTF-8"?> |
| <chapter version="5.0" xml:id="performance" |
| xmlns="http://docbook.org/ns/docbook" |
| xmlns:xlink="http://www.w3.org/1999/xlink" |
| xmlns:xi="http://www.w3.org/2001/XInclude" |
| xmlns:svg="http://www.w3.org/2000/svg" |
| xmlns:m="http://www.w3.org/1998/Math/MathML" |
| xmlns:html="http://www.w3.org/1999/xhtml" |
| xmlns:db="http://docbook.org/ns/docbook"> |
| <!-- |
| /** |
| * Licensed to the Apache Software Foundation (ASF) under one |
| * or more contributor license agreements. See the NOTICE file |
| * distributed with this work for additional information |
| * regarding copyright ownership. The ASF licenses this file |
| * to you under the Apache License, Version 2.0 (the |
| * "License"); you may not use this file except in compliance |
| * with the License. You may obtain a copy of the License at |
| * |
| * http://www.apache.org/licenses/LICENSE-2.0 |
| * |
| * Unless required by applicable law or agreed to in writing, software |
| * distributed under the License is distributed on an "AS IS" BASIS, |
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| * See the License for the specific language governing permissions and |
| * limitations under the License. |
| */ |
| --> |
| <title>Apache HBase Performance Tuning</title> |
| |
| <section xml:id="perf.os"> |
| <title>Operating System</title> |
| <section xml:id="perf.os.ram"> |
| <title>Memory</title> |
| <para>RAM, RAM, RAM. Don't starve HBase.</para> |
| </section> |
| <section xml:id="perf.os.64"> |
| <title>64-bit</title> |
| <para>Use a 64-bit platform (and 64-bit JVM).</para> |
| </section> |
| <section xml:id="perf.os.swap"> |
| <title>Swapping</title> |
| <para>Watch out for swapping. Set swappiness to 0.</para> |
| </section> |
| </section> |
| <section xml:id="perf.network"> |
| <title>Network</title> |
| <para> |
| Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware |
| that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more). |
| </para> |
| <para> |
| Important items to consider: |
| <itemizedlist> |
| <listitem>Switching capacity of the device</listitem> |
| <listitem>Number of systems connected</listitem> |
| <listitem>Uplink capacity</listitem> |
| </itemizedlist> |
| </para> |
| <section xml:id="perf.network.1switch"> |
| <title>Single Switch</title> |
| <para>The single most important factor in this configuration is that the switching capacity of the hardware is capable of |
| handling the traffic which can be generated by all systems connected to the switch. Some lower priced commodity hardware |
| can have a slower switching capacity than could be utilized by a full switch. |
| </para> |
| </section> |
| <section xml:id="perf.network.2switch"> |
| <title>Multiple Switches</title> |
| <para>Multiple switches are a potential pitfall in the architecture. The most common configuration of lower priced hardware is a |
| simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication. |
| Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated. |
| </para> |
| <para>Mitigation of this issue is fairly simple and can be accomplished in multiple ways: |
| <itemizedlist> |
| <listitem>Use appropriate hardware for the scale of the cluster which you're attempting to build.</listitem> |
| <listitem>Use larger single switch configurations i.e. single 48 port as opposed to 2x 24 port</listitem> |
| <listitem>Configure port trunking for uplinks to utilize multiple interfaces to increase cross switch bandwidth.</listitem> |
| </itemizedlist> |
| </para> |
| </section> |
| <section xml:id="perf.network.multirack"> |
| <title>Multiple Racks</title> |
| <para>Multiple rack configurations carry the same potential issues as multiple switches, and can suffer performance degradation from two main areas: |
| <itemizedlist> |
| <listitem>Poor switch capacity performance</listitem> |
| <listitem>Insufficient uplink to another rack</listitem> |
| </itemizedlist> |
| If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing |
| more of your cluster across racks. The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks. |
| The downside of this method however, is in the overhead of ports that could potentially be used. An example of this is, creating an 8Gbps port channel from rack |
| A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster. |
| </para> |
| <para>Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to |
| save your ports for machines as opposed to uplinks. |
| </para> |
| </section> |
| <section xml:id="perf.network.ints"> |
| <title>Network Interfaces</title> |
| <para>Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in <xref linkend="casestudies.slownode"/>. |
| </para> |
| </section> |
| </section> <!-- network --> |
| |
| <section xml:id="jvm"> |
| <title>Java</title> |
| |
| <section xml:id="gc"> |
| <title>The Garbage Collector and Apache HBase</title> |
| |
| <section xml:id="gcpause"> |
| <title>Long GC pauses</title> |
| |
| <para xml:id="mslab">In his presentation, <link |
| xlink:href="http://www.slideshare.net/cloudera/hbase-hug-presentation">Avoiding |
| Full GCs with MemStore-Local Allocation Buffers</link>, Todd Lipcon |
| describes two cases of stop-the-world garbage collections common in |
| HBase, especially during loading; CMS failure modes and old generation |
| heap fragmentation brought. To address the first, start the CMS |
| earlier than default by adding |
| <code>-XX:CMSInitiatingOccupancyFraction</code> and setting it down |
| from defaults. Start at 60 or 70 percent (The lower you bring down the |
| threshold, the more GCing is done, the more CPU used). To address the |
| second fragmentation issue, Todd added an experimental facility, |
| <indexterm><primary>MSLAB</primary></indexterm>, that |
| must be explicitly enabled in Apache HBase 0.90.x (Its defaulted to be on in |
| Apache 0.92.x HBase). See <code>hbase.hregion.memstore.mslab.enabled</code> |
| to true in your <classname>Configuration</classname>. See the cited |
| slides for background and detail<footnote><para>The latest jvms do better |
| regards fragmentation so make sure you are running a recent release. |
| Read down in the message, |
| <link xlink:href="http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html">Identifying concurrent mode failures caused by fragmentation</link>.</para></footnote>. |
| Be aware that when enabled, each MemStore instance will occupy at least |
| an MSLAB instance of memory. If you have thousands of regions or lots |
| of regions each with many column families, this allocation of MSLAB |
| may be responsible for a good portion of your heap allocation and in |
| an extreme case cause you to OOME. Disable MSLAB in this case, or |
| lower the amount of memory it uses or float less regions per server. |
| </para> |
| <para>If you have a write-heavy workload, check out |
| <link xlink:href="https://issues.apache.org/jira/browse/HBASE-8163">HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using MSLAB</link>. |
| It describes configurations to lower the amount of young GC during write-heavy loadings. If you do not have HBASE-8163 installed, and you are |
| trying to improve your young GC times, one trick to consider -- courtesy of our Liang Xie -- is to set the GC config <varname>-XX:PretenureSizeThreshold</varname> in <filename>hbase-env.sh</filename> |
| to be just smaller than the size of <varname>hbase.hregion.memstore.mslab.chunksize</varname> so MSLAB allocations happen in the |
| tenured space directly rather than first in the young gen. You'd do this because these MSLAB allocations are going to likely make it |
| to the old gen anyways and rather than pay the price of a copies between s0 and s1 in eden space followed by the copy up from |
| young to old gen after the MSLABs have achieved sufficient tenure, save a bit of YGC churn and allocate in the old gen directly. |
| </para> |
| <para>For more information about GC logs, see <xref linkend="trouble.log.gc" />. |
| </para> |
| </section> |
| </section> |
| </section> |
| |
| <section xml:id="perf.configurations"> |
| <title>HBase Configurations</title> |
| |
| <para>See <xref linkend="recommended_configurations" />.</para> |
| |
| <section xml:id="perf.compactions.and.splits"> |
| <title>Managing Compactions</title> |
| |
| <para>For larger systems, managing <link |
| linkend="disable.splitting">compactions and splits</link> may be |
| something you want to consider.</para> |
| </section> |
| |
| <section xml:id="perf.handlers"> |
| <title><varname>hbase.regionserver.handler.count</varname></title> |
| <para>See <xref linkend="hbase.regionserver.handler.count"/>. |
| </para> |
| </section> |
| <section xml:id="perf.hfile.block.cache.size"> |
| <title><varname>hfile.block.cache.size</varname></title> |
| <para>See <xref linkend="hfile.block.cache.size"/>. |
| A memory setting for the RegionServer process. |
| </para> |
| </section> |
| <section xml:id="perf.rs.memstore.upperlimit"> |
| <title><varname>hbase.regionserver.global.memstore.upperLimit</varname></title> |
| <para>See <xref linkend="hbase.regionserver.global.memstore.upperLimit"/>. |
| This memory setting is often adjusted for the RegionServer process depending on needs. |
| </para> |
| </section> |
| <section xml:id="perf.rs.memstore.lowerlimit"> |
| <title><varname>hbase.regionserver.global.memstore.lowerLimit</varname></title> |
| <para>See <xref linkend="hbase.regionserver.global.memstore.lowerLimit"/>. |
| This memory setting is often adjusted for the RegionServer process depending on needs. |
| </para> |
| </section> |
| <section xml:id="perf.hstore.blockingstorefiles"> |
| <title><varname>hbase.hstore.blockingStoreFiles</varname></title> |
| <para>See <xref linkend="hbase.hstore.blockingStoreFiles"/>. |
| If there is blocking in the RegionServer logs, increasing this can help. |
| </para> |
| </section> |
| <section xml:id="perf.hregion.memstore.block.multiplier"> |
| <title><varname>hbase.hregion.memstore.block.multiplier</varname></title> |
| <para>See <xref linkend="hbase.hregion.memstore.block.multiplier"/>. |
| If there is enough RAM, increasing this can help. |
| </para> |
| </section> |
| <section xml:id="hbase.regionserver.checksum.verify"> |
| <title><varname>hbase.regionserver.checksum.verify</varname></title> |
| <para>Have HBase write the checksum into the datablock and save |
| having to do the checksum seek whenever you read.</para> |
| |
| <para>See <xref linkend="hbase.regionserver.checksum.verify"/>, |
| <xref linkend="hbase.hstore.bytes.per.checksum"/> and <xref linkend="hbase.hstore.checksum.algorithm"/> |
| For more information see the |
| release note on <link xlink:href="https://issues.apache.org/jira/browse/HBASE-5074">HBASE-5074 support checksums in HBase block cache</link>. |
| </para> |
| </section> |
| |
| </section> |
| |
| |
| |
| |
| <section xml:id="perf.zookeeper"> |
| <title>ZooKeeper</title> |
| <para>See <xref linkend="zookeeper"/> for information on configuring ZooKeeper, and see the part |
| about having a dedicated disk. |
| </para> |
| </section> |
| <section xml:id="perf.schema"> |
| <title>Schema Design</title> |
| |
| <section xml:id="perf.number.of.cfs"> |
| <title>Number of Column Families</title> |
| <para>See <xref linkend="number.of.cfs" />.</para> |
| </section> |
| <section xml:id="perf.schema.keys"> |
| <title>Key and Attribute Lengths</title> |
| <para>See <xref linkend="keysize" />. See also <xref linkend="perf.compression.however" /> for |
| compression caveats.</para> |
| </section> |
| <section xml:id="schema.regionsize"><title>Table RegionSize</title> |
| <para>The regionsize can be set on a per-table basis via <code>setFileSize</code> on |
| <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link> in the |
| event where certain tables require different regionsizes than the configured default regionsize. |
| </para> |
| <para>See <xref linkend="ops.capacity.regions"/> for more information. |
| </para> |
| </section> |
| <section xml:id="schema.bloom"> |
| <title>Bloom Filters</title> |
| <para>Bloom Filters can be enabled per-ColumnFamily. |
| Use <code>HColumnDescriptor.setBloomFilterType(NONE | ROW | |
| ROWCOL)</code> to enable blooms per Column Family. Default = |
| <varname>NONE</varname> for no bloom filters. If |
| <varname>ROW</varname>, the hash of the row will be added to the bloom |
| on each insert. If <varname>ROWCOL</varname>, the hash of the row + |
| column family name + column family qualifier will be added to the bloom on |
| each key insert.</para> |
| <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> and |
| <xref linkend="blooms"/> for more information or this answer up in quora, |
| <link xlink:href="http://www.quora.com/How-are-bloom-filters-used-in-HBase">How are bloom filters used in HBase?</link>. |
| </para> |
| </section> |
| <section xml:id="schema.cf.blocksize"><title>ColumnFamily BlockSize</title> |
| <para>The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. |
| There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting |
| indexes should be roughly halved). |
| </para> |
| <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> |
| and <xref linkend="store"/>for more information. |
| </para> |
| </section> |
| <section xml:id="cf.in.memory"> |
| <title>In-Memory ColumnFamilies</title> |
| <para>ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. |
| In-memory blocks have the highest priority in the <xref linkend="block.cache" />, but it is not a guarantee that the entire table |
| will be in memory. |
| </para> |
| <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information. |
| </para> |
| </section> |
| <section xml:id="perf.compression"> |
| <title>Compression</title> |
| <para>Production systems should use compression with their ColumnFamily definitions. See <xref linkend="compression" /> for more information. |
| </para> |
| <section xml:id="perf.compression.however"><title>However...</title> |
| <para>Compression deflates data <emphasis>on disk</emphasis>. When it's in-memory (e.g., in the |
| MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated. |
| So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate |
| the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names. |
| </para> |
| <para>See <xref linkend="keysize" /> on for schema design tips, and <xref linkend="keyvalue"/> for more information on HBase stores data internally. |
| </para> |
| </section> |
| </section> |
| </section> <!-- perf schema --> |
| |
| <section xml:id="perf.general"> |
| <title>HBase General Patterns</title> |
| <section xml:id="perf.general.constants"> |
| <title>Constants</title> |
| <para>When people get started with HBase they have a tendency to write code that looks like this: |
| <programlisting> |
| Get get = new Get(rowkey); |
| Result r = htable.get(get); |
| byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value |
| </programlisting> |
| But especially when inside loops (and MapReduce jobs), converting the columnFamily and column-names |
| to byte-arrays repeatedly is surprisingly expensive. |
| It's better to use constants for the byte-arrays, like this: |
| <programlisting> |
| public static final byte[] CF = "cf".getBytes(); |
| public static final byte[] ATTR = "attr".getBytes(); |
| ... |
| Get get = new Get(rowkey); |
| Result r = htable.get(get); |
| byte[] b = r.getValue(CF, ATTR); // returns current version of value |
| </programlisting> |
| </para> |
| </section> |
| |
| </section> |
| <section xml:id="perf.writing"> |
| <title>Writing to HBase</title> |
| |
| <section xml:id="perf.batch.loading"> |
| <title>Batch Loading</title> |
| <para>Use the bulk load tool if you can. See |
| <xref linkend="arch.bulk.load"/>. |
| Otherwise, pay attention to the below. |
| </para> |
| </section> <!-- batch loading --> |
| |
| <section xml:id="precreate.regions"> |
| <title> |
| Table Creation: Pre-Creating Regions |
| </title> |
| <para> |
| Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region |
| until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. |
| Be somewhat conservative in this, because too-many regions can actually degrade performance. |
| </para> |
| <para>There are two different approaches to pre-creating splits. The first approach is to rely on the default <code>HBaseAdmin</code> strategy |
| (which is implemented in <code>Bytes.split</code>)... |
| </para> |
| <programlisting> |
| byte[] startKey = ...; // your lowest keuy |
| byte[] endKey = ...; // your highest key |
| int numberOfRegions = ...; // # of regions to create |
| admin.createTable(table, startKey, endKey, numberOfRegions); |
| </programlisting> |
| <para>And the other approach is to define the splits yourself... |
| </para> |
| <programlisting> |
| byte[][] splits = ...; // create your own splits |
| admin.createTable(table, splits); |
| </programlisting> |
| <para> |
| See <xref linkend="rowkey.regionsplits"/> for issues related to understanding your keyspace and pre-creating regions. |
| </para> |
| </section> |
| <section xml:id="def.log.flush"> |
| <title> |
| Table Creation: Deferred Log Flush |
| </title> |
| <para> |
| The default behavior for Puts using the Write Ahead Log (WAL) is that <classname>HLog</classname> edits will be written immediately. If deferred log flush is used, |
| WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous <classname>HLog</classname>- writes, but the potential downside is that if |
| the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts. |
| </para> |
| <para> |
| Deferred log flush can be configured on tables via <link |
| xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link>. The default value of <varname>hbase.regionserver.optionallogflushinterval</varname> is 1000ms. |
| </para> |
| </section> |
| |
| <section xml:id="perf.hbase.client.autoflush"> |
| <title>HBase Client: AutoFlush</title> |
| |
| <para>When performing a lot of Puts, make sure that setAutoFlush is set |
| to false on your <link |
| xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link> |
| instance. Otherwise, the Puts will be sent one at a time to the |
| RegionServer. Puts added via <code> htable.add(Put)</code> and <code> htable.add( <List> Put)</code> |
| wind up in the same write buffer. If <code>autoFlush = false</code>, |
| these messages are not sent until the write-buffer is filled. To |
| explicitly flush the messages, call <methodname>flushCommits</methodname>. |
| Calling <methodname>close</methodname> on the <classname>HTable</classname> |
| instance will invoke <methodname>flushCommits</methodname>.</para> |
| </section> |
| <section xml:id="perf.hbase.client.putwal"> |
| <title>HBase Client: Turn off WAL on Puts</title> |
| <para>A frequently discussed option for increasing throughput on <classname>Put</classname>s is to call <code>writeToWAL(false)</code>. Turning this off means |
| that the RegionServer will <emphasis>not</emphasis> write the <classname>Put</classname> to the Write Ahead Log, |
| only into the memstore, HOWEVER the consequence is that if there |
| is a RegionServer failure <emphasis>there will be data loss</emphasis>. |
| If <code>writeToWAL(false)</code> is used, do so with extreme caution. You may find in actuality that |
| it makes little difference if your load is well distributed across the cluster. |
| </para> |
| <para>In general, it is best to use WAL for Puts, and where loading throughput |
| is a concern to use <link linkend="perf.batch.loading">bulk loading</link> techniques instead. |
| </para> |
| </section> |
| <section xml:id="perf.hbase.client.regiongroup"> |
| <title>HBase Client: Group Puts by RegionServer</title> |
| <para>In addition to using the writeBuffer, grouping <classname>Put</classname>s by RegionServer can reduce the number of client RPC calls per writeBuffer flush. |
| There is a utility <classname>HTableUtil</classname> currently on TRUNK that does this, but you can either copy that or implement your own verison for |
| those still on 0.90.x or earlier. |
| </para> |
| </section> |
| <section xml:id="perf.hbase.write.mr.reducer"> |
| <title>MapReduce: Skip The Reducer</title> |
| <para>When writing a lot of data to an HBase table from a MR job (e.g., with <link |
| xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>), and specifically where Puts are being emitted |
| from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other |
| Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase. |
| </para> |
| <para>For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). |
| This is a different processing problem than from the the above case. |
| </para> |
| </section> |
| |
| <section xml:id="perf.one.region"> |
| <title>Anti-Pattern: One Hot Region</title> |
| <para>If all your data is being written to one region at a time, then re-read the |
| section on processing <link linkend="timeseries">timeseries</link> data.</para> |
| <para>Also, if you are pre-splitting regions and all your data is <emphasis>still</emphasis> winding up in a single region even though |
| your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a |
| variety of reasons that regions may appear "well split" but won't work with your data. As |
| the HBase client communicates directly with the RegionServers, this can be obtained via |
| <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[]%29">HTable.getRegionLocation</link>. |
| </para> |
| <para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para> |
| </section> |
| |
| </section> <!-- writing --> |
| |
| <section xml:id="perf.reading"> |
| <title>Reading from HBase</title> |
| <para>The mailing list can help if you are having performance issues. |
| For example, here is a good general thread on what to look at addressing |
| read-time issues: <link xlink:href="http://search-hadoop.com/m/qOo2yyHtCC1">HBase Random Read latency > 100ms</link></para> |
| <section xml:id="perf.hbase.client.caching"> |
| <title>Scan Caching</title> |
| |
| <para>If HBase is used as an input source for a MapReduce job, for |
| example, make sure that the input <link |
| xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link> |
| instance to the MapReduce job has <methodname>setCaching</methodname> set to something greater |
| than the default (which is 1). Using the default value means that the |
| map-task will make call back to the region-server for every record |
| processed. Setting this value to 500, for example, will transfer 500 |
| rows at a time to the client to be processed. There is a cost/benefit to |
| have the cache value be large because it costs more in memory for both |
| client and RegionServer, so bigger isn't always better.</para> |
| <section xml:id="perf.hbase.client.caching.mr"> |
| <title>Scan Caching in MapReduce Jobs</title> |
| <para>Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) |
| in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the |
| next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process |
| rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), |
| then set caching lower. |
| </para> |
| <para>Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the |
| processing that is often performed in MapReduce jobs tends to exacerbate this issue. |
| </para> |
| </section> |
| </section> |
| <section xml:id="perf.hbase.client.selection"> |
| <title>Scan Attribute Selection</title> |
| |
| <para>Whenever a Scan is used to process large numbers of rows (and especially when used |
| as a MapReduce source), be aware of which attributes are selected. If <code>scan.addFamily</code> is called |
| then <emphasis>all</emphasis> of the attributes in the specified ColumnFamily will be returned to the client. |
| If only a small number of the available attributes are to be processed, then only those attributes should be specified |
| in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets. |
| </para> |
| </section> |
| <section xml:id="perf.hbase.mr.input"> |
| <title>MapReduce - Input Splits</title> |
| <para>For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to |
| have the same Input Split (i.e., the RegionServer serving the data), see the |
| Troubleshooting Case Study in <xref linkend="casestudies.slownode"/>. |
| </para> |
| </section> |
| |
| <section xml:id="perf.hbase.client.scannerclose"> |
| <title>Close ResultScanners</title> |
| |
| <para>This isn't so much about improving performance but rather |
| <emphasis>avoiding</emphasis> performance problems. If you forget to |
| close <link |
| xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html">ResultScanners</link> |
| you can cause problems on the RegionServers. Always have ResultScanner |
| processing enclosed in try/catch blocks... <programlisting> |
| Scan scan = new Scan(); |
| // set attrs... |
| ResultScanner rs = htable.getScanner(scan); |
| try { |
| for (Result r = rs.next(); r != null; r = rs.next()) { |
| // process result... |
| } finally { |
| rs.close(); // always close the ResultScanner! |
| } |
| htable.close();</programlisting></para> |
| </section> |
| |
| <section xml:id="perf.hbase.client.blockcache"> |
| <title>Block Cache</title> |
| |
| <para><link |
| xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link> |
| instances can be set to use the block cache in the RegionServer via the |
| <methodname>setCacheBlocks</methodname> method. For input Scans to MapReduce jobs, this should be |
| <varname>false</varname>. For frequently accessed rows, it is advisable to use the block |
| cache.</para> |
| </section> |
| <section xml:id="perf.hbase.client.rowkeyonly"> |
| <title>Optimal Loading of Row Keys</title> |
| <para>When performing a table <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">scan</link> |
| where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a |
| <varname>MUST_PASS_ALL</varname> operator to the scanner using <methodname>setFilter</methodname>. The filter list |
| should include both a <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html">FirstKeyOnlyFilter</link> |
| and a <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html">KeyOnlyFilter</link>. |
| Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk |
| and minimal network traffic to the client for a single row. |
| </para> |
| </section> |
| <section xml:id="perf.hbase.read.dist"> |
| <title>Concurrency: Monitor Data Spread</title> |
| <para>When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have |
| too few regions then the reads could likely be served from too few nodes. </para> |
| <para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para> |
| </section> |
| <section xml:id="blooms"> |
| <title>Bloom Filters</title> |
| <para>Enabling Bloom Filters can save your having to go to disk and |
| can help improve read latencies.</para> |
| <para><link xlink:href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</link> were developed over in <link |
| xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBase-1200 |
| Add bloomfilters</link>.<footnote> |
| <para>For description of the development process -- why static blooms |
| rather than dynamic -- and for an overview of the unique properties |
| that pertain to blooms in HBase, as well as possible future |
| directions, see the <emphasis>Development Process</emphasis> section |
| of the document <link |
| xlink:href="https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf">BloomFilters |
| in HBase</link> attached to <link |
| xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBase-1200</link>.</para> |
| </footnote><footnote> |
| <para>The bloom filters described here are actually version two of |
| blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom |
| option based on work done by the <link |
| xlink:href="http://www.one-lab.org">European Commission One-Lab |
| Project 034819</link>. The core of the HBase bloom work was later |
| pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile. |
| Version 1 of HBase blooms never worked that well. Version 2 is a |
| rewrite from scratch though again it starts with the one-lab |
| work.</para> |
| </footnote></para> |
| <para>See also <xref linkend="schema.bloom" />. |
| </para> |
| |
| <section xml:id="bloom_footprint"> |
| <title>Bloom StoreFile footprint</title> |
| |
| <para>Bloom filters add an entry to the <classname>StoreFile</classname> |
| general <classname>FileInfo</classname> data structure and then two |
| extra entries to the <classname>StoreFile</classname> metadata |
| section.</para> |
| |
| <section> |
| <title>BloomFilter in the <classname>StoreFile</classname> |
| <classname>FileInfo</classname> data structure</title> |
| |
| <para><classname>FileInfo</classname> has a |
| <varname>BLOOM_FILTER_TYPE</varname> entry which is set to |
| <varname>NONE</varname>, <varname>ROW</varname> or |
| <varname>ROWCOL.</varname></para> |
| </section> |
| |
| <section> |
| <title>BloomFilter entries in <classname>StoreFile</classname> |
| metadata</title> |
| |
| <para><varname>BLOOM_FILTER_META</varname> holds Bloom Size, Hash |
| Function used, etc. Its small in size and is cached on |
| <classname>StoreFile.Reader</classname> load</para> |
| <para><varname>BLOOM_FILTER_DATA</varname> is the actual bloomfilter |
| data. Obtained on-demand. Stored in the LRU cache, if it is enabled |
| (Its enabled by default).</para> |
| </section> |
| </section> |
| <section xml:id="config.bloom"> |
| <title>Bloom Filter Configuration</title> |
| <section> |
| <title><varname>io.hfile.bloom.enabled</varname> global kill |
| switch</title> |
| |
| <para><code>io.hfile.bloom.enabled</code> in |
| <classname>Configuration</classname> serves as the kill switch in case |
| something goes wrong. Default = <varname>true</varname>.</para> |
| </section> |
| |
| <section> |
| <title><varname>io.hfile.bloom.error.rate</varname></title> |
| |
| <para><varname>io.hfile.bloom.error.rate</varname> = average false |
| positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1 |
| bit per bloom entry.</para> |
| </section> |
| |
| <section> |
| <title><varname>io.hfile.bloom.max.fold</varname></title> |
| |
| <para><varname>io.hfile.bloom.max.fold</varname> = guaranteed minimum |
| fold rate. Most people should leave this alone. Default = 7, or can |
| collapse to at least 1/128th of original size. See the |
| <emphasis>Development Process</emphasis> section of the document <link |
| xlink:href="https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf">BloomFilters |
| in HBase</link> for more on what this option means.</para> |
| </section> |
| </section> |
| </section> <!-- bloom --> |
| |
| </section> <!-- reading --> |
| |
| <section xml:id="perf.deleting"> |
| <title>Deleting from HBase</title> |
| <section xml:id="perf.deleting.queue"> |
| <title>Using HBase Tables as Queues</title> |
| <para>HBase tables are sometimes used as queues. In this case, special care must be taken to regularly perform major compactions on tables used in |
| this manner. As is documented in <xref linkend="datamodel" />, marking rows as deleted creates additional StoreFiles which then need to be processed |
| on reads. Tombstones only get cleaned up with major compactions. |
| </para> |
| <para>See also <xref linkend="compaction" /> and <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#majorCompact%28java.lang.String%29">HBaseAdmin.majorCompact</link>. |
| </para> |
| </section> |
| <section xml:id="perf.deleting.rpc"> |
| <title>Delete RPC Behavior</title> |
| <para>Be aware that <code>htable.delete(Delete)</code> doesn't use the writeBuffer. It will execute an RegionServer RPC with each invocation. |
| For a large number of deletes, consider <code>htable.delete(List)</code>. |
| </para> |
| <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29"></link> |
| </para> |
| </section> |
| </section> <!-- deleting --> |
| |
| <section xml:id="perf.hdfs"><title>HDFS</title> |
| <para>Because HBase runs on <xref linkend="arch.hdfs" /> it is important to understand how it works and how it affects |
| HBase. |
| </para> |
| <section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title> |
| <para>The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. |
| With the increased adoption of Apache HBase this is changing, and several improvements are already in development. |
| See the |
| <link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira Ticket for HDFS Improvements for HBase</link>. |
| </para> |
| </section> |
| <section xml:id="perf.hdfs.configs.localread"> |
| <title>Leveraging local data</title> |
| <para>Since Hadoop 1.0.0 (also 0.22.1, 0.23.1, CDH3u3 and HDP 1.0) via |
| <link xlink:href="https://issues.apache.org/jira/browse/HDFS-2246">HDFS-2246</link>, |
| it is possible for the DFSClient to take a "short circuit" and |
| read directly from disk instead of going through the DataNode when the |
| data is local. What this means for HBase is that the RegionServers can |
| read directly off their machine's disks instead of having to open a |
| socket to talk to the DataNode, the former being generally much |
| faster<footnote><para>See JD's <link xlink:href="http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf">Performance Talk</link></para></footnote>. |
| Also see <link xlink:href="http://search-hadoop.com/m/zV6dKrLCVh1">HBase, mail # dev - read short circuit</link> thread for |
| more discussion around short circuit reads. |
| </para> |
| <para>To enable "short circuit" reads, it will depend on your version of Hadoop. |
| The original shortcircuit read patch was much improved upon in Hadoop 2 in |
| <link xlink:href="https://issues.apache.org/jira/browse/HDFS-347">HDFS-347</link>. |
| See <link xlink:href="http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/"></link> for details |
| on the difference between the old and new implementations. See |
| <link xlink:href="http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Hadoop shortcircuit reads configuration page</link> |
| for how to enable the later version of shortcircuit. |
| </para> |
| <para>If you are running on an old Hadoop, one that is without |
| <link xlink:href="https://issues.apache.org/jira/browse/HDFS-347">HDFS-347</link> but that |
| has |
| <link xlink:href="https://issues.apache.org/jira/browse/HDFS-2246">HDFS-2246</link>, |
| you must set two configurations. |
| First, the hdfs-site.xml needs to be amended. Set |
| the property <varname>dfs.block.local-path-access.user</varname> |
| to be the <emphasis>only</emphasis> user that can use the shortcut. |
| This has to be the user that started HBase. Then in hbase-site.xml, |
| set <varname>dfs.client.read.shortcircuit</varname> to be <varname>true</varname> |
| </para> |
| <para> |
| For optimal performance when short-circuit reads are enabled, it is recommended that HDFS checksums are disabled. |
| To maintain data integrity with HDFS checksums disabled, HBase can be configured to write its own checksums into |
| its datablocks and verify against these. See <xref linkend="hbase.regionserver.checksum.verify" />. When both |
| local short-circuit reads and hbase level checksums are enabled, you SHOULD NOT disable configuration parameter |
| "dfs.client.read.shortcircuit.skip.checksum", which will cause skipping checksum on non-hfile reads. HBase already |
| manages that setting under the covers. |
| </para> |
| <para> |
| The DataNodes need to be restarted in order to pick up the new |
| configuration. Be aware that if a process started under another |
| username than the one configured here also has the shortcircuit |
| enabled, it will get an Exception regarding an unauthorized access but |
| the data will still be read. |
| </para> |
| <note xml:id="dfs.client.read.shortcircuit.buffer.size"> |
| <title>dfs.client.read.shortcircuit.buffer.size</title> |
| <para>The default for this value is too high when running on a highly trafficed HBase. Set it down from its |
| 1M default down to 128k or so. Put this configuration in the HBase configs (its a HDFS client-side configuration). |
| The Hadoop DFSClient in HBase will allocate a direct byte buffer of this size for <emphasis>each</emphasis> |
| block it has open; given HBase keeps its HDFS files open all the time, this can add up quickly.</para> |
| </note> |
| </section> |
| |
| <section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase vs. HDFS</title> |
| <para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as |
| a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, |
| returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this |
| processing context. There is room for improvement and this gap will, over time, be reduced, but HDFS |
| will always be faster in this use-case. |
| </para> |
| </section> |
| </section> |
| |
| <section xml:id="perf.ec2"><title>Amazon EC2</title> |
| <para>Performance questions are common on Amazon EC2 environments because it is a shared environment. You will |
| not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same |
| reason (i.e., it's a shared environment and you don't know what else is happening on the server). |
| </para> |
| <para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that |
| because EC2 issues are practically a separate class of performance issues. |
| </para> |
| </section> |
| |
| <section xml:id="perf.hbase.mr.cluster"><title>Collocating HBase and MapReduce</title> |
| <para>It is often recommended to have different clusters for HBase and MapReduce. A better qualification of this is: |
| don't collocate a HBase that serves live requests with a heavy MR workload. OLTP and OLAP-optimized systems have |
| conflicting requirements and one will lose to the other, usually the former. For example, short latency-sensitive |
| disk reads will have to wait in line behind longer reads that are trying to squeeze out as much throughput as |
| possible. MR jobs that write to HBase will also generate flushes and compactions, which will in turn invalidate |
| blocks in the <xref linkend="block.cache"/>. |
| </para> |
| <para>If you need to process the data from your live HBase cluster in MR, you can ship the deltas with <xref linkend="copy.table"/> |
| or use replication to get the new data in real time on the OLAP cluster. In the worst case, if you really need to |
| collocate both, set MR to use less Map and Reduce slots than you'd normally configure, possibly just one. |
| </para> |
| <para>When HBase is used for OLAP operations, it's preferable to set it up in a hardened way like configuring the ZooKeeper session |
| timeout higher and giving more memory to the MemStores (the argument being that the Block Cache won't be used much |
| since the workloads are usually long scans). |
| </para> |
| </section> |
| |
| <section xml:id="perf.casestudy"><title>Case Studies</title> |
| <para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>. |
| </para> |
| </section> |
| </chapter> |