blob: d27c5d6e488f8562dd9ad99bf2b5ed2e35c85e73 [file] [log] [blame]
////
/**
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
////
[[ops_mgt]]
= Apache HBase Operational Management
:doctype: book
:numbered:
:toc: left
:icons: font
:experimental:
This chapter will cover operational tools and practices required of a running Apache HBase cluster.
The subject of operations is related to the topics of <<trouble>>, <<performance>>, and <<configuration>> but is a distinct topic in itself.
[[tools]]
== HBase Tools and Utilities
HBase provides several tools for administration, analysis, and debugging of your cluster.
The entry-point to most of these tools is the _bin/hbase_ command, though some tools are available in the _dev-support/_ directory.
To see usage instructions for _bin/hbase_ command, run it with no arguments, or with the `-h` argument.
These are the usage instructions for HBase 0.98.x.
Some commands, such as `version`, `pe`, `ltt`, `clean`, are not available in previous versions.
----
$ bin/hbase
Usage: hbase [<options>] <command> [<args>]
Options:
--config DIR Configuration direction to use. Default: ./conf
--hosts HOSTS Override the list in 'regionservers' file
--auth-as-server Authenticate to ZooKeeper using servers configuration
Commands:
Some commands take arguments. Pass no args or -h for usage.
shell Run the HBase shell
hbck Run the HBase 'fsck' tool. Defaults read-only hbck1.
Pass '-j /path/to/HBCK2.jar' to run hbase-2.x HBCK2.
snapshot Tool for managing snapshots
wal Write-ahead-log analyzer
hfile Store file analyzer
zkcli Run the ZooKeeper shell
master Run an HBase HMaster node
regionserver Run an HBase HRegionServer node
zookeeper Run a ZooKeeper server
rest Run an HBase REST server
thrift Run the HBase Thrift server
thrift2 Run the HBase Thrift2 server
clean Run the HBase clean up script
classpath Dump hbase CLASSPATH
mapredcp Dump CLASSPATH entries required by mapreduce
pe Run PerformanceEvaluation
ltt Run LoadTestTool
canary Run the Canary tool
version Print the version
backup Backup tables for recovery
restore Restore tables from existing backup image
regionsplitter Run RegionSplitter tool
rowcounter Run RowCounter tool
cellcounter Run CellCounter tool
CLASSNAME Run the class named CLASSNAME
----
Some of the tools and utilities below are Java classes which are passed directly to the _bin/hbase_ command, as referred to in the last line of the usage instructions.
Others, such as `hbase shell` (<<shell>>), `hbase upgrade` (<<upgrading>>), and `hbase thrift` (<<thrift>>), are documented elsewhere in this guide.
=== Canary
The Canary tool can help users "canary-test" the HBase cluster status.
The default "region mode" fetches a row from every column-family of every regions.
In "regionserver mode", the Canary tool will fetch a row from a random
region on each of the cluster's RegionServers. In "zookeeper mode", the
Canary will read the root znode on each member of the zookeeper ensemble.
To see usage, pass the `-help` parameter (if you pass no
parameters, the Canary tool starts executing in the default
region "mode" fetching a row from every region in the cluster).
----
2018-10-16 13:11:27,037 INFO [main] tool.Canary: Execution thread count=16
Usage: canary [OPTIONS] [<TABLE1> [<TABLE2]...] | [<REGIONSERVER1> [<REGIONSERVER2]..]
Where [OPTIONS] are:
-h,-help show this help and exit.
-regionserver set 'regionserver mode'; gets row from random region on server
-allRegions get from ALL regions when 'regionserver mode', not just random one.
-zookeeper set 'zookeeper mode'; grab zookeeper.znode.parent on each ensemble member
-daemon continuous check at defined intervals.
-interval <N> interval between checks in seconds
-e consider table/regionserver argument as regular expression
-f <B> exit on first error; default=true
-failureAsError treat read/write failure as error
-t <N> timeout for canary-test run; default=600000ms
-writeSniffing enable write sniffing
-writeTable the table used for write sniffing; default=hbase:canary
-writeTableTimeout <N> timeout for writeTable; default=600000ms
-readTableTimeouts <tableName>=<read timeout>,<tableName>=<read timeout>,...
comma-separated list of table read timeouts (no spaces);
logs 'ERROR' if takes longer. default=600000ms
-permittedZookeeperFailures <N> Ignore first N failures attempting to
connect to individual zookeeper nodes in ensemble
-D<configProperty>=<value> to assign or override configuration params
-Dhbase.canary.read.raw.enabled=<true/false> Set to enable/disable raw scan; default=false
Canary runs in one of three modes: region (default), regionserver, or zookeeper.
To sniff/probe all regions, pass no arguments.
To sniff/probe all regions of a table, pass tablename.
To sniff/probe regionservers, pass -regionserver, etc.
See http://hbase.apache.org/book.html#_canary for Canary documentation.
----
[NOTE]
The `Sink` class is instantiated using the `hbase.canary.sink.class` configuration property.
This tool will return non zero error codes to user for collaborating with other monitoring tools,
such as Nagios. The error code definitions are:
[source,java]
----
private static final int USAGE_EXIT_CODE = 1;
private static final int INIT_ERROR_EXIT_CODE = 2;
private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
private static final int ERROR_EXIT_CODE = 4;
private static final int FAILURE_EXIT_CODE = 5;
----
Here are some examples based on the following given case: given two Table objects called test-01
and test-02 each with two column family cf1 and cf2 respectively, deployed on 3 RegionServers.
See the following table.
[cols="1,1,1", options="header"]
|===
| RegionServer
| test-01
| test-02
| rs1 | r1 | r2
| rs2 | r2 |
| rs3 | r2 | r1
|===
Following are some example outputs based on the previous given case.
==== Canary test for every column family (store) of every region of every table
----
$ ${HBASE_HOME}/bin/hbase canary
3/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf1 in 2ms
13/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf2 in 2ms
13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf1 in 4ms
13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf2 in 1ms
...
13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf1 in 5ms
13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf2 in 3ms
13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf1 in 31ms
13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf2 in 8ms
----
So you can see, table test-01 has two regions and two column families, so the Canary tool in the
default "region mode" will pick 4 small piece of data from 4 (2 region * 2 store) different stores.
This is a default behavior.
==== Canary test for every column family (store) of every region of a specific table(s)
You can also test one or more specific tables by passing table names.
----
$ ${HBASE_HOME}/bin/hbase canary test-01 test-02
----
==== Canary test with RegionServer granularity
In "regionserver mode", the Canary tool will pick one small piece of data
from each RegionServer (You can also pass one or more RegionServer names as arguments
to the canary-test when in "regionserver mode").
----
$ ${HBASE_HOME}/bin/hbase canary -regionserver
13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs2 in 72ms
13/12/09 06:05:17 INFO tool.Canary: Read from table:test-02 on region server:rs3 in 34ms
13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs1 in 56ms
----
==== Canary test with regular expression pattern
You can pass regexes for table names when in "region mode" or for servernames when
in "regionserver mode". The below will test both table test-01 and test-02.
----
$ ${HBASE_HOME}/bin/hbase canary -e test-0[1-2]
----
==== Run canary test as a "daemon"
Run repeatedly with an interval defined via the option `-interval` (default value is 60 seconds).
This daemon will stop itself and return non-zero error code if any error occur. To have
the daemon keep running across errors, pass the -f flag with its value set to false
(see usage above).
----
$ ${HBASE_HOME}/bin/hbase canary -daemon
----
To run repeatedly with 5 second intervals and not stop on errors, do the following.
----
$ ${HBASE_HOME}/bin/hbase canary -daemon -interval 5 -f false
----
==== Force timeout if canary test stuck
In some cases the request is stuck and no response is sent back to the client. This
can happen with dead RegionServers which the master has not yet noticed.
Because of this we provide a timeout option to kill the canary test and return a
non-zero error code. The below sets the timeout value to 60 seconds (the default value
is 600 seconds).
----
$ ${HBASE_HOME}/bin/hbase canary -t 60000
----
==== Enable write sniffing in canary
By default, the canary tool only checks read operations. To enable the write sniffing,
you can run the canary with the `-writeSniffing` option set. When write sniffing is
enabled, the canary tool will create an hbase table and make sure the
regions of the table are distributed to all region servers. In each sniffing period,
the canary will try to put data to these regions to check the write availability of
each region server.
----
$ ${HBASE_HOME}/bin/hbase canary -writeSniffing
----
The default write table is `hbase:canary` and can be specified with the option `-writeTable`.
----
$ ${HBASE_HOME}/bin/hbase canary -writeSniffing -writeTable ns:canary
----
The default value size of each put is 10 bytes. You can set it via the config key:
`hbase.canary.write.value.size`.
==== Treat read / write failure as error
By default, the canary tool only logs read failures -- due to e.g. RetriesExhaustedException, etc. --
and will return the 'normal' exit code. To treat read/write failure as errors, you can run canary
with the `-treatFailureAsError` option. When enabled, read/write failures will result in an
error exit code.
----
$ ${HBASE_HOME}/bin/hbase canary -treatFailureAsError
----
==== Running Canary in a Kerberos-enabled Cluster
To run the Canary in a Kerberos-enabled cluster, configure the following two properties in _hbase-site.xml_:
* `hbase.client.keytab.file`
* `hbase.client.kerberos.principal`
Kerberos credentials are refreshed every 30 seconds when Canary runs in daemon mode.
To configure the DNS interface for the client, configure the following optional properties in _hbase-site.xml_.
* `hbase.client.dns.interface`
* `hbase.client.dns.nameserver`
.Canary in a Kerberos-Enabled Cluster
====
This example shows each of the properties with valid values.
[source,xml]
----
<property>
<name>hbase.client.kerberos.principal</name>
<value>hbase/_HOST@YOUR-REALM.COM</value>
</property>
<property>
<name>hbase.client.keytab.file</name>
<value>/etc/hbase/conf/keytab.krb5</value>
</property>
<!-- optional params -->
<property>
<name>hbase.client.dns.interface</name>
<value>default</value>
</property>
<property>
<name>hbase.client.dns.nameserver</name>
<value>default</value>
</property>
----
====
=== RegionSplitter
----
usage: bin/hbase regionsplitter <TABLE> <SPLITALGORITHM>
SPLITALGORITHM is the java class name of a class implementing
SplitAlgorithm, or one of the special strings
HexStringSplit or DecimalStringSplit or
UniformSplit, which are built-in split algorithms.
HexStringSplit treats keys as hexadecimal ASCII, and
DecimalStringSplit treats keys as decimal ASCII, and
UniformSplit treats keys as arbitrary bytes.
-c <region count> Create a new table with a pre-split number of
regions
-D <property=value> Override HBase Configuration Settings
-f <family:family:...> Column Families to create with new table.
Required with -c
--firstrow <arg> First Row in Table for Split Algorithm
-h Print this usage help
--lastrow <arg> Last Row in Table for Split Algorithm
-o <count> Max outstanding splits that have unfinished
major compactions
-r Perform a rolling split of an existing region
--risky Skip verification steps to complete
quickly. STRONGLY DISCOURAGED for production
systems.
----
For additional detail, see <<manual_region_splitting_decisions>>.
[[health.check]]
=== Health Checker
You can configure HBase to run a script periodically and if it fails N times (configurable), have the server exit.
See _HBASE-7351 Periodic health check script_ for configurations and detail.
=== Driver
Several frequently-accessed utilities are provided as `Driver` classes, and executed by the _bin/hbase_ command.
These utilities represent MapReduce jobs which run on your cluster.
They are run in the following way, replacing _UtilityName_ with the utility you want to run.
This command assumes you have set the environment variable `HBASE_HOME` to the directory where HBase is unpacked on your server.
----
${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.UtilityName
----
The following utilities are available:
`LoadIncrementalHFiles`::
Complete a bulk data load.
`CopyTable`::
Export a table from the local cluster to a peer cluster.
`Export`::
Write table data to HDFS.
`Import`::
Import data written by a previous `Export` operation.
`ImportTsv`::
Import data in TSV format.
`RowCounter`::
Count rows in an HBase table.
`CellCounter`::
Count cells in an HBase table.
`replication.VerifyReplication`::
Compare the data from tables in two different clusters.
WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed.
Note that this command is in a different package than the others.
Each command except `RowCounter` and `CellCounter` accept a single `--help` argument to print usage instructions.
[[hbck]]
=== HBase `hbck`
The `hbck` tool that shipped with hbase-1.x has been made read-only in hbase-2.x. It is not able to repair
hbase-2.x clusters as hbase internals have changed. Nor should its assessments in read-only mode be
trusted as it does not understand hbase-2.x operation.
A new tool, <<HBCK2>>, described in the next section, replaces `hbck`.
[[HBCK2]]
=== HBase `HBCK2`
`HBCK2` is the successor to <<hbck>>, the hbase-1.x fix tool (A.K.A `hbck1`). Use it in place of `hbck1`
making repairs against hbase-2.x installs.
`HBCK2` does not ship as part of hbase. It can be found as a subproject of the companion
link:https://github.com/apache/hbase-operator-tools[hbase-operator-tools] repository at
link:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2[Apache HBase HBCK2 Tool].
`HBCK2` was moved out of hbase so it could evolve at a cadence apart from that of hbase core.
See the [https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2](HBCK2) Home Page
for how `HBCK2` differs from `hbck1`, and for how to build and use it.
Once built, you can run `HBCK2` as follows:
```
$ hbase hbck -j /path/to/HBCK2.jar
```
This will generate `HBCK2` usage describing commands and options.
[[hfile_tool2]]
=== HFile Tool
See <<hfile_tool>>.
=== WAL Tools
For bulk replaying WAL files or _recovered.edits_ files, see
<<walplayer>>. For reading/verifying individual files, read on.
[[hlog_tool.prettyprint]]
==== WALPrettyPrinter
The `WALPrettyPrinter` is a tool with configurable options to print the contents of a WAL
or a _recovered.edits_ file. You can invoke it via the HBase cli with the 'wal' command.
----
$ ./bin/hbase wal hdfs://example.org:8020/hbase/WALs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
----
.WAL Printing in older versions of HBase
[NOTE]
====
Prior to version 2.0, the `WALPrettyPrinter` was called the `HLogPrettyPrinter`, after an internal name for HBase's write ahead log.
In those versions, you can print the contents of a WAL using the same configuration as above, but with the 'hlog' command.
----
$ ./bin/hbase hlog hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
----
====
[[compression.tool]]
=== Compression Tool
See <<compression.test,compression.test>>.
[[copy.table]]
=== CopyTable
CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster.
The target table must first exist.
The usage is as follows:
----
$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
Options:
rs.class hbase.regionserver.class of the peer cluster,
specify if different from current cluster
rs.impl hbase.regionserver.impl of the peer cluster,
startrow the start row
stoprow the stop row
starttime beginning of the time range (unixtime in millis)
without endtime means from starttime to forever
endtime end of the time range. Ignored if no starttime specified.
versions number of cell versions to copy
new.name new table's name
peer.adr Address of the peer cluster given in the format
hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
families comma-separated list of families to copy
To copy from cf1 to cf2, give sourceCfName:destCfName.
To keep the same name, just give "cfName"
all.cells also copy delete markers and deleted cells
Args:
tablename Name of the table to copy
Examples:
To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
For performance consider the following general options:
It is recommended that you set the following to >=100. A higher value uses more memory but
decreases the round trip time to the server and may increase performance.
-Dhbase.client.scanner.caching=100
The following should always be set to false, to prevent writing data twice, which may produce
inaccurate results.
-Dmapred.map.tasks.speculative.execution=false
----
.Scanner Caching
[NOTE]
====
Caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
====
.Versions
[NOTE]
====
By default, CopyTable utility only copies the latest version of row cells unless `--versions=n` is explicitly specified in the command.
====
.Data Load
[NOTE]
====
CopyTable does not perform a diff, it copies all Cells in between the specified startrow/stoprow starttime/endtime range.
This means that already existing cells with same values will still be copied.
====
See Jonathan Hsieh's link:https://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
HBase Backups with CopyTable] blog post for more on `CopyTable`.
[[hashtable.synctable]]
=== HashTable/SyncTable
HashTable/SyncTable is a two steps tool for synchronizing table data, where each of the steps are implemented as MapReduce jobs.
Similarly to CopyTable, it can be used for partial or entire table data syncing, under same or remote cluster.
However, it performs the sync in a more efficient way than CopyTable. Instead of copying all cells
in specified row key/time period range, HashTable (the first step) creates hashed indexes for batch of cells on source table and output those as results.
On the next stage, SyncTable scans the source table and now calculates hash indexes for table cells,
compares these hashes with the outputs of HashTable, then it just scans (and compares) cells for diverging hashes, only updating
mismatching cells. This results in less network traffic/data transfers, which can be impacting when syncing large tables on remote clusters.
==== Step 1, HashTable
First, run HashTable on the source table cluster (this is the table whose state will be copied to its counterpart).
Usage:
----
$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --help
Usage: HashTable [options] <tablename> <outputpath>
Options:
batchsize the target amount of bytes to hash in each batch
rows are added to the batch until this size is reached
(defaults to 8000 bytes)
numhashfiles the number of hash files to create
if set to fewer than number of regions then
the job will create this number of reducers
(defaults to 1/100 of regions -- at least 1)
startrow the start row
stoprow the stop row
starttime beginning of the time range (unixtime in millis)
without endtime means from starttime to forever
endtime end of the time range. Ignored if no starttime specified.
scanbatch scanner batch size to support intra row scans
versions number of cell versions to include
families comma-separated list of families to include
ignoreTimestamps if true, ignores cell timestamps
Args:
tablename Name of the table to hash
outputpath Filesystem path to put the output data
Examples:
To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable
----
The *batchsize* property defines how much cell data for a given region will be hashed together in a single hash value.
Sizing this properly has a direct impact on the sync efficiency, as it may lead to less scans executed by mapper tasks
of SyncTable (the next step in the process). The rule of thumb is that, the smaller the number of cells out of sync
(lower probability of finding a diff), larger batch size values can be determined.
==== Step 2, SyncTable
Once HashTable has completed on source cluster, SyncTable can be ran on target cluster.
Just like replication and other synchronization jobs, it requires that all RegionServers/DataNodes
on source cluster be accessible by NodeManagers on the target cluster (where SyncTable job tasks will be running).
Usage:
----
$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help
Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>
Options:
sourcezkcluster ZK cluster key of the source table
(defaults to cluster in classpath's config)
targetzkcluster ZK cluster key of the target table
(defaults to cluster in classpath's config)
dryrun if true, output counters but no writes
(defaults to false)
doDeletes if false, does not perform deletes
(defaults to true)
doPuts if false, does not perform puts
(defaults to true)
ignoreTimestamps if true, ignores cells timestamps while comparing
cell values. Any missing cell on target then gets
added with current time as timestamp
(defaults to false)
Args:
sourcehashdir path to HashTable output dir for source table
(see org.apache.hadoop.hbase.mapreduce.HashTable)
sourcetable Name of the source table to sync from
targettable Name of the target table to sync to
Examples:
For a dry run SyncTable of tableA from a remote source cluster
to a local target cluster:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA
----
Cell comparison takes ROW/FAMILY/QUALIFIER/TIMESTAMP/VALUE into account for equality. When syncing at the target, missing cells will be
added with original timestamp value from source. That may cause unexpected results after SyncTable completes, for example, if missing
cells on target have a delete marker with a timestamp T2 (say, a bulk delete performed by mistake), but source cells timestamps have an
older value T1, then those cells would still be unavailable at target because of the newer delete marker timestamp. Since cell timestamps
might not be relevant to all use cases, _ignoreTimestamps_ option adds the flexibility to avoid using cells timestamp in the comparison.
When using _ignoreTimestamps_ set to true, this option must be specified for both HashTable and SyncTable steps.
The *dryrun* option is useful when a read only, diff report is wanted, as it will produce only COUNTERS indicating the differences, but will not perform
any actual changes. It can be used as an alternative to VerifyReplication tool.
By default, SyncTable will cause target table to become an exact copy of source table (at least, for the specified startrow/stoprow or/and starttime/endtime).
Setting doDeletes to false modifies default behaviour to not delete target cells that are missing on source.
Similarly, setting doPuts to false modifies default behaviour to not add missing cells on target. Setting both doDeletes
and doPuts to false would give same effect as setting dryrun to true.
.Additional info on doDeletes/doPuts
[NOTE]
====
"doDeletes/doPuts" were only added by
link:https://jira.apache.org/jira/browse/HBASE-20305[HBASE-20305], so these may not be available on
all released versions.
For major 1.x versions, minimum minor release including it is *1.4.10*.
For major 2.x versions, minimum minor release including it is *2.1.5*.
====
.Additional info on ignoreTimestamps
[NOTE]
====
"ignoreTimestamps" was only added by
link:https://issues.apache.org/jira/browse/HBASE-24302[HBASE-24302], so it may not be available on
all released versions.
For major 1.x versions, minimum minor release including it is *1.4.14*.
For major 2.x versions, minimum minor release including it is *2.2.5*.
====
.Set doDeletes to false on Two-Way Replication scenarios
[NOTE]
====
On Two-Way Replication or other scenarios where both source and target clusters can have data ingested, it's advisable to always set doDeletes option to false,
as any additional cell inserted on SyncTable target cluster and not yet replicated to source would be deleted, and potentially lost permanently.
====
.Set sourcezkcluster to the actual source cluster ZK quorum
[NOTE]
====
Although not required, if sourcezkcluster is not set, SyncTable will connect to local HBase cluster for both source and target,
which does not give any meaningful result.
====
.Remote Clusters on different Kerberos Realms
[NOTE]
====
Often, remote clusters may be deployed on different Kerberos Realms.
link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586] added SyncTable support for
cross realm authentication, allowing a SyncTable process running on target cluster to connect to
source cluster and read both HashTable output files and the given HBase table when performing the
required comparisons.
====
[[export]]
=== Export
Export is a utility that will dump the contents of table to HDFS in a sequence file.
The Export can be run via a Coprocessor Endpoint or MapReduce. Invoke via:
*mapreduce-based Export*
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
----
*endpoint-based Export*
NOTE: Make sure the Export coprocessor is enabled by adding `org.apache.hadoop.hbase.coprocessor.Export` to `hbase.coprocessor.region.classes`.
----
$ bin/hbase org.apache.hadoop.hbase.coprocessor.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
----
The outputdir is a HDFS directory that does not exist prior to the export. When done, the exported files will be owned by the user invoking the export command.
*The Comparison of Endpoint-based Export And Mapreduce-based Export*
|===
||Endpoint-based Export|Mapreduce-based Export
|HBase version requirement
|2.0+
|0.2.1+
|Maven dependency
|hbase-endpoint
|hbase-mapreduce (2.0+), hbase-server(prior to 2.0)
|Requirement before dump
|mount the endpoint.Export on the target table
|deploy the MapReduce framework
|Read latency
|low, directly read the data from region
|normal, traditional RPC scan
|Read Scalability
|depend on number of regions
|depend on number of mappers (see TableInputFormatBase#getSplits)
|Timeout
|operation timeout. configured by hbase.client.operation.timeout
|scan timeout. configured by hbase.client.scanner.timeout.period
|Permission requirement
|READ, EXECUTE
|READ
|Fault tolerance
|no
|depend on MapReduce
|===
NOTE: To see usage instructions, run the command with no options. Available options include
specifying column families and applying filters during the export.
By default, the `Export` tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace *_<versions>_* with the desired number of versions.
Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
[[import]]
=== Import
Import is a utility that will load data that has been exported back into HBase.
Invoke via:
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
----
NOTE: To see usage instructions, run the command with no options.
To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property "hbase.import.version" when running the import command as below:
----
$ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
----
[[importtsv]]
=== ImportTsv
ImportTsv is a utility that will load data in TSV format into HBase.
It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the `completebulkload`.
To load data via Puts (i.e., non-bulk loading):
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
----
To generate StoreFiles for bulk-loading:
[source,bourne]
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
----
These generated StoreFiles can be loaded into HBase via <<completebulkload,completebulkload>>.
[[importtsv.options]]
==== ImportTsv Options
Running `ImportTsv` with no arguments prints brief usage information:
----
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
----
[[importtsv.example]]
==== ImportTsv Example
For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
Assume that an input file exists as follows:
----
row1 c1 c2
row2 c1 c2
row3 c1 c2
row4 c1 c2
row5 c1 c2
row6 c1 c2
row7 c1 c2
row8 c1 c2
row9 c1 c2
row10 c1 c2
----
For ImportTsv to use this input file, the command line needs to look like this:
----
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
----
\... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.
The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
[[importtsv.warning]]
==== ImportTsv Warning
If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
[[importtsv.also]]
==== See Also
For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>
[[completebulkload]]
=== CompleteBulkLoad
The `completebulkload` utility will move generated StoreFiles into an HBase table.
This utility is often used in conjunction with output from <<importtsv,importtsv>>.
There are two ways to invoke this utility, with explicit classname and via the driver:
.Explicit Classname
----
$ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
----
.Driver
----
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
----
[[completebulkload.warning]]
==== CompleteBulkLoad Warning
Data generated via MapReduce is often created with file permissions that are not compatible with the running HBase process.
Assuming you're running HDFS with permissions enabled, those permissions will need to be updated before you run CompleteBulkLoad.
For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>.
[[walplayer]]
=== WALPlayer
WALPlayer is a utility to replay WAL files into HBase.
The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables.
The output can optionally be mapped to another set of tables.
WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
Finally, you can use WALPlayer to replay the content of a Regions `recovered.edits` directory (the files under
`recovered.edits` directory have the same format as WAL files).
.WALPrettyPrinter
[NOTE]
====
To read or verify single WAL files or _recovered.edits_ files, since they share the WAL format,
see <<_wal_tools>>.
====
Invoke via:
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]>
----
For example:
----
$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
----
WALPlayer, by default, runs as a mapreduce job.
To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags `-Dmapreduce.jobtracker.address=local` on the command line.
[[walplayer.options]]
==== WALPlayer Options
Running `WALPlayer` with no arguments prints brief usage information:
----
Usage: WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]
<WAL inputdir> directory of WALs to replay.
<tables> comma separated list of tables. If no tables specified,
all are imported (even hbase:meta if present).
<tableMappings> WAL entries can be mapped to a new set of tables by passing
<tableMappings>, a comma separated list of target tables.
If specified, each table in <tables> must have a mapping.
To generate HFiles to bulk load instead of loading HBase directly, pass:
-Dwal.bulk.output=/path/for/output
Only one table can be specified, and no mapping allowed!
To specify a time range, pass:
-Dwal.start.time=[date|ms]
-Dwal.end.time=[date|ms]
The start and the end date of timerange (inclusive). The dates can be
expressed in milliseconds-since-epoch or yyyy-MM-dd'T'HH:mm:ss.SS format.
E.g. 1234567890120 or 2009-02-13T23:32:30.12
Other options:
-Dmapreduce.job.name=jobName
Use the specified mapreduce job name for the wal player
-Dwal.input.separator=' '
Change WAL filename separator (WAL dir names use default ','.)
For performance also consider the following options:
-Dmapreduce.map.speculative=false
-Dmapreduce.reduce.speculative=false
----
[[rowcounter]]
=== RowCounter
link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] is a mapreduce job to count all the rows of a table.
This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.
It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
It is possible to limit the time range of data to be scanned by using the `--starttime=[starttime]` and `--endtime=[endtime]` flags.
The scanned data can be limited based on keys using the `--range=[startKey],[endKey][;[startKey],[endKey]...]` option.
----
$ bin/hbase rowcounter [options] <tablename> [--starttime=<start> --endtime=<end>] [--range=[startKey],[endKey][;[startKey],[endKey]...]] [<column1> <column2>...]
----
RowCounter only counts one version per cell.
For performance consider to use `-Dhbase.client.scanner.caching=100` and `-Dmapreduce.map.speculative=false` options.
[[cellcounter]]
=== CellCounter
HBase ships another diagnostic mapreduce job called link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter].
Like RowCounter, it gathers more fine-grained statistics about your table.
The statistics gathered by CellCounter are more fine-grained and include:
* Total number of rows in the table.
* Total number of CFs across all rows.
* Total qualifiers across all rows.
* Total occurrence of each CF.
* Total occurrence of each qualifier.
* Total number of versions of each qualifier.
The program allows you to limit the scope of the run.
Provide a row regex or prefix to limit the rows to analyze.
Specify a time range to scan the table by using the `--starttime=<starttime>` and `--endtime=<endtime>` flags.
Use `hbase.mapreduce.scan.column.family` to specify scanning a single column family.
----
$ bin/hbase cellcounter <tablename> <outputDir> [reportSeparator] [regex or prefix] [--starttime=<starttime> --endtime=<endtime>]
----
Note: just like RowCounter, caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
=== mlockall
It is possible to optionally pin your servers in physical memory making them less likely to be swapped out in oversubscribed environments by having the servers call link:http://linux.die.net/man/2/mlockall[mlockall] on startup.
See link:https://issues.apache.org/jira/browse/HBASE-4391[HBASE-4391 Add ability to start RS as root and call mlockall] for how to build the optional library and have it run on startup.
[[compaction.tool]]
=== Offline Compaction Tool
*CompactionTool* provides a way of running compactions (either minor or major) as an independent
process from the RegionServer. It reuses same internal implementation classes executed by RegionServer
compaction feature. However, since this runs on a complete separate independent java process, it
releases RegionServers from the overhead involved in rewrite a set of hfiles, which can be critical
for latency sensitive use cases.
Usage:
----
$ ./bin/hbase org.apache.hadoop.hbase.regionserver.CompactionTool
Usage: java org.apache.hadoop.hbase.regionserver.CompactionTool \
[-compactOnce] [-major] [-mapred] [-D<property=value>]* files...
Options:
mapred Use MapReduce to run compaction.
compactOnce Execute just one compaction step. (default: while needed)
major Trigger major compaction.
Note: -D properties will be applied to the conf used.
For example:
To stop delete of compacted file, pass -Dhbase.compactiontool.delete=false
To set tmp dir, pass -Dhbase.tmp.dir=ALTERNATE_DIR
Examples:
To compact the full 'TestTable' using MapReduce:
$ hbase org.apache.hadoop.hbase.regionserver.CompactionTool -mapred hdfs://hbase/data/default/TestTable
To compact column family 'x' of the table 'TestTable' region 'abc':
$ hbase org.apache.hadoop.hbase.regionserver.CompactionTool hdfs://hbase/data/default/TestTable/abc/x
----
As shown by usage options above, *CompactionTool* can run as a standalone client or a mapreduce job.
When running as mapreduce job, each family dir is handled as an input split, and is processed
by a separate map task.
The *compactionOnce* parameter controls how many compaction cycles will be performed until
*CompactionTool* program decides to finish its work. If omitted, it will assume it should keep
running compactions on each specified family as determined by the given compaction policy
configured. For more info on compaction policy, see <<compaction,compaction>>.
If a major compaction is desired, *major* flag can be specified. If omitted, *CompactionTool* will
assume minor compaction is wanted by default.
It also allows for configuration overrides with `-D` flag. In the usage section above, for example,
`-Dhbase.compactiontool.delete=false` option will instruct compaction engine to not delete original
files from temp folder.
Files targeted for compaction must be specified as parent hdfs dirs. It allows for multiple dirs
definition, as long as each for these dirs are either a *family*, a *region*, or a *table* dir. If a
table or region dir is passed, the program will recursively iterate through related sub-folders,
effectively running compaction for each family found below the table/region level.
Since these dirs are nested under *hbase* hdfs directory tree, *CompactionTool* requires hbase super
user permissions in order to have access to required hfiles.
.Running in MapReduce mode
[NOTE]
====
MapReduce mode offers the ability to process each family dir in parallel, as a separate map task.
Generally, it would make sense to run in this mode when specifying one or more table dirs as targets
for compactions. The caveat, though, is that if number of families to be compacted become too large,
the related mapreduce job may have indirect impacts on *RegionServers* performance .
Since *NodeManagers* are normally co-located with RegionServers, such large jobs could
compete for IO/Bandwidth resources with the *RegionServers*.
====
.MajorCompaction completely disabled on RegionServers due performance impacts
[NOTE]
====
*Major compactions* can be a costly operation (see <<compaction,compaction>>), and can indeed
impact performance on RegionServers, leading operators to completely disable it for critical
low latency application. *CompactionTool* could be used as an alternative in such scenarios,
although, additional custom application logic would need to be implemented, such as deciding
scheduling and selection of tables/regions/families target for a given compaction run.
====
For additional details about CompactionTool, see also
link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/CompactionTool.html[CompactionTool].
=== `hbase clean`
The `hbase clean` command cleans HBase data from ZooKeeper, HDFS, or both.
It is appropriate to use for testing.
Run it with no options for usage instructions.
The `hbase clean` command was introduced in HBase 0.98.
----
$ bin/hbase clean
Usage: hbase clean (--cleanZk|--cleanHdfs|--cleanAll)
Options:
--cleanZk cleans hbase related data from zookeeper.
--cleanHdfs cleans hbase related data from hdfs.
--cleanAll cleans hbase related data from both zookeeper and hdfs.
----
=== `hbase pe`
The `hbase pe` command runs the PerformanceEvaluation tool, which is used for testing.
The PerformanceEvaluation tool accepts many different options and commands.
For usage instructions, run the command with no options.
The PerformanceEvaluation tool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLs and visibility labels, multiget support for RPC calls, increased sampling sizes, an option to randomly sleep during testing, and ability to "warm up" the cluster before testing starts.
=== `hbase ltt`
The `hbase ltt` command runs the LoadTestTool utility, which is used for testing.
You must specify either `-init_only` or at least one of `-write`, `-update`, or `-read`.
For general usage instructions, pass the `-h` option.
The LoadTestTool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLS and visibility labels, testing security-related features, ability to specify the number of regions per server, tests for multi-get RPC calls, and tests relating to replication.
[[ops.pre-upgrade]]
=== Pre-Upgrade validator
Pre-Upgrade validator tool can be used to check the cluster for known incompatibilities before upgrading from HBase 1 to HBase 2.
[source, bash]
----
$ bin/hbase pre-upgrade command ...
----
==== Coprocessor validation
HBase supports co-processors for a long time, but the co-processor API can be changed between major releases. Co-processor validator tries to determine
whether the old co-processors are still compatible with the actual HBase version.
[source, bash]
----
$ bin/hbase pre-upgrade validate-cp [-jar ...] [-class ... | -table ... | -config]
Options:
-e Treat warnings as errors.
-jar <arg> Jar file/directory of the coprocessor.
-table <arg> Table coprocessor(s) to check.
-class <arg> Coprocessor class(es) to check.
-config Scan jar for observers.
----
The co-processor classes can be explicitly declared by `-class` option, or they can be obtained from HBase configuration by `-config` option.
Table level co-processors can be also checked by `-table` option. The tool searches for co-processors on its classpath, but it can be extended
by the `-jar` option. It is possible to test multiple classes with multiple `-class`, multiple tables with multiple `-table` options as well as
adding multiple jars to the classpath with multiple `-jar` options.
The tool can report errors and warnings. Errors mean that HBase won't be able to load the coprocessor, because it is incompatible with the current version
of HBase. Warnings mean that the co-processors can be loaded, but they won't work as expected. If `-e` option is given, then the tool will also fail
for warnings.
Please note that this tool cannot validate every aspect of jar files, it just does some static checks.
For example:
[source, bash]
----
$ bin/hbase pre-upgrade validate-cp -jar my-coprocessor.jar -class MyMasterObserver -class MyRegionObserver
----
It validates `MyMasterObserver` and `MyRegionObserver` classes which are located in `my-coprocessor.jar`.
[source, bash]
----
$ bin/hbase pre-upgrade validate-cp -table .*
----
It validates every table level co-processors where the table name matches to `.*` regular expression.
==== DataBlockEncoding validation
HBase 2.0 removed `PREFIX_TREE` Data Block Encoding from column families. For further information
please check <<upgrade2.0.prefix-tree.removed,_prefix-tree_ encoding removed>>.
To verify that none of the column families are using incompatible Data Block Encodings in the cluster run the following command.
[source, bash]
----
$ bin/hbase pre-upgrade validate-dbe
----
This check validates all column families and print out any incompatibilities. For example:
----
2018-07-13 09:58:32,028 WARN [main] tool.DataBlockEncodingValidator: Incompatible DataBlockEncoding for table: t, cf: f, encoding: PREFIX_TREE
----
Which means that Data Block Encoding of table `t`, column family `f` is incompatible. To fix, use `alter` command in HBase shell:
----
alter 't', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
----
Please also validate HFiles, which is described in the next section.
==== HFile Content validation
Even though Data Block Encoding is changed from `PREFIX_TREE` it is still possible to have HFiles that contain data encoded that way.
To verify that HFiles are readable with HBase 2 please use _HFile content validator_.
[source, bash]
----
$ bin/hbase pre-upgrade validate-hfile
----
The tool will log the corrupt HFiles and details about the root cause.
If the problem is about PREFIX_TREE encoding it is necessary to change encodings before upgrading to HBase 2.
The following log message shows an example of incorrect HFiles.
----
2018-06-05 16:20:46,976 WARN [hfilevalidator-pool1-t3] hbck.HFileCorruptionChecker: Found corrupt HFile hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
...
Caused by: java.io.IOException: Invalid data block encoding type in file info: PREFIX_TREE
...
Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.PREFIX_TREE
...
2018-06-05 16:20:47,322 INFO [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
2018-06-05 16:20:47,383 INFO [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/archive/data/default/t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1
----
===== Fixing PREFIX_TREE errors
It's possible to get `PREFIX_TREE` errors after changing Data Block Encoding to a supported one. It can happen
because there are some HFiles which still encoded with `PREFIX_TREE` or there are still some snapshots.
For fixing HFiles, please run a major compaction on the table (it was `default:t` according to the log message):
----
major_compact 't'
----
HFiles can be referenced from snapshots, too. It's the case when the HFile is located under `archive/data`.
The first step is to determine which snapshot references that HFile (the name of the file was `29c641ae91c34fc3bee881f45436b6d1`
according to the logs):
[source, bash]
----
for snapshot in $(hbase snapshotinfo -list-snapshots 2> /dev/null | tail -n -1 | cut -f 1 -d \|);
do
echo "checking snapshot named '${snapshot}'";
hbase snapshotinfo -snapshot "${snapshot}" -files 2> /dev/null | grep 29c641ae91c34fc3bee881f45436b6d1;
done
----
The output of this shell script is:
----
checking snapshot named 't_snap'
1.0 K t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1 (archive)
----
Which means `t_snap` snapshot references the incompatible HFile. If the snapshot is still needed,
then it has to be recreated with HBase shell:
----
# creating a new namespace for the cleanup process
create_namespace 'pre_upgrade_cleanup'
# creating a new snapshot
clone_snapshot 't_snap', 'pre_upgrade_cleanup:t'
alter 'pre_upgrade_cleanup:t', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
major_compact 'pre_upgrade_cleanup:t'
# removing the invalid snapshot
delete_snapshot 't_snap'
# creating a new snapshot
snapshot 'pre_upgrade_cleanup:t', 't_snap'
# removing temporary table
disable 'pre_upgrade_cleanup:t'
drop 'pre_upgrade_cleanup:t'
drop_namespace 'pre_upgrade_cleanup'
----
For further information, please refer to
link:https://issues.apache.org/jira/browse/HBASE-20649?focusedCommentId=16535476#comment-16535476[HBASE-20649].
=== Data Block Encoding Tool
Tests various compression algorithms with different data block encoder for key compression on an existing HFile.
Useful for testing, debugging and benchmarking.
You must specify `-f` which is the full path of the HFile.
The result shows both the performance (MB/s) of compression/decompression and encoding/decoding, and the data savings on the HFile.
----
$ bin/hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
Usages: hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
Options:
-f HFile to analyse (REQUIRED)
-n Maximum number of key/value pairs to process in a single benchmark run.
-b Whether to run a benchmark to measure read throughput.
-c If this is specified, no correctness testing will be done.
-a What kind of compression algorithm use for test. Default value: GZ.
-t Number of times to run each benchmark. Default value: 12.
-omit Number of first runs of every benchmark to omit from statistics. Default value: 2.
----
[[ops.regionmgt]]
== Region Management
[[ops.regionmgt.majorcompact]]
=== Major Compaction
Major compactions can be requested via the HBase shell or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin.majorCompact].
Note: major compactions do NOT do region merges.
See <<compaction,compaction>> for more information about compactions.
[[ops.regionmgt.merge]]
=== Merge
Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).
[source,bourne]
----
$ bin/hbase org.apache.hadoop.hbase.util.Merge <tablename> <region1> <region2>
----
If you feel you have too many regions and want to consolidate them, Merge is the utility you need.
Merge must run be done when the cluster is down.
See the link:https://web.archive.org/web/20111231002503/http://ofps.oreilly.com/titles/9781449396107/performance.html[O'Reilly HBase
Book] for an example of usage.
You will need to pass 3 parameters to this application.
The first one is the table name.
The second one is the fully qualified name of the first region to merge, like "table_name,\x0A,1342956111995.7cef47f192318ba7ccc75b1bbf27a82b.". The third one is the fully qualified name for the second region to merge.
Additionally, there is a Ruby script attached to link:https://issues.apache.org/jira/browse/HBASE-1621[HBASE-1621] for region merging.
[[node.management]]
== Node Management
[[decommission]]
=== Node Decommission
You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
----
$ ./bin/hbase-daemon.sh stop regionserver
----
The RegionServer will first close all regions and then shut itself down.
On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire.
The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
.Disable the Load Balancer before Decommissioning a node
[NOTE]
====
If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer.
Avoid any problems by disabling the balancer first.
See <<lb,lb>> below.
====
.Kill Node Tool
[NOTE]
====
In hbase-2.0, in the bin directory, we added a script named _considerAsDead.sh_ that can be used to kill a regionserver.
Hardware issues could be detected by specialized monitoring tools before the zookeeper timeout has expired. _considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
It deletes all the znodes of the server, starting the recovery process.
Plug in the script into your monitoring/fault detection tools to initiate faster failover.
Be careful how you use this disruptive tool.
Copy the script if you need to make use of it in a version of hbase previous to hbase-2.0.
====
A downside to the above stop of a RegionServer is that regions could be offline for a good period of time.
Regions are closed in order.
If many regions on the server, the first region to close may not be back online until all regions close and after the master notices the RegionServer's znode gone.
In Apache HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down.
Apache HBase 0.90.2 added the _graceful_stop.sh_ script.
Here is its usage:
----
$ ./bin/graceful_stop.sh
Usage: graceful_stop.sh [--config <conf-dir>] [-e] [--restart [--reload]] [--thrift] [--rest] [-n |--noack] [--maxthreads <number of threads>] [--movetimeout <timeout in seconds>] [-nob |--nobalancer] [-d |--designatedfile <file path>] [-x |--excludefile <file path>] <hostname>
thrift If we should stop/start thrift before/after the hbase stop/start
rest If we should stop/start rest before/after the hbase stop/start
restart If we should restart after graceful stop
reload Move offloaded regions back on to the restarted server
n|noack Enable noAck mode in RegionMover. This is a best effort mode for moving regions
maxthreads xx Limit the number of threads used by the region mover. Default value is 1.
movetimeout xx Timeout for moving regions. If regions are not moved by the timeout value,exit with error. Default value is INT_MAX.
hostname Hostname of server we are to stop
e|failfast Set -e so exit immediately if any command exits with non-zero status
nob|nobalancer Do not manage balancer states. This is only used as optimization in rolling_restart.sh to avoid multiple calls to hbase shell
d|designatedfile xx Designated file with <hostname:port> per line as unload targets
x|excludefile xx Exclude file should have <hostname:port> per line. We do not unload regions to hostnames given in exclude file
----
To decommission a loaded RegionServer, run the following: +$
./bin/graceful_stop.sh HOSTNAME+ where `HOSTNAME` is the host carrying the RegionServer you would decommission.
.On `HOSTNAME`
[NOTE]
====
The `HOSTNAME` passed to _graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
Check the list of RegionServers in the master UI for how HBase is referring to servers.
It's usually hostname but can also be FQDN.
Whatever HBase is using, this is what you should pass the _graceful_stop.sh_ decommission script.
If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks if server is currently running; the graceful unloading of regions will not run.
====
The _graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions.
At this point, the _graceful_stop.sh_ tells the RegionServer `stop`.
The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.
[[lb]]
.Load Balancer
[NOTE]
====
It is assumed that the Region Load Balancer is disabled while the `graceful_stop` script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
[source]
----
hbase(main):001:0> balance_switch false
true
0 row(s) in 0.3590 seconds
----
This turns the balancer OFF.
To reenable, do:
[source]
----
hbase(main):001:0> balance_switch true
false
0 row(s) in 0.3590 seconds
----
The `graceful_stop` will check the balancer and if enabled, will turn it off before it goes to work.
If it exits prematurely because of error, it will not have reset the balancer.
Hence, it is better to manage the balancer apart from `graceful_stop` reenabling it after you are done w/ graceful_stop.
====
[[draining.servers]]
==== Decommissioning several Regions Servers concurrently
If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping multiple RegionServers concurrently.
To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state.
This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the _hbase_root/draining_ znode.
This znode has format `name,port,startcode` just like the regionserver entries under _hbase_root/rs_ znode.
Without this facility, decommissioning multiple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining.
Marking RegionServers to be in the draining state prevents this from happening.
See this link:http://inchoate-clatter.blogspot.com/2012/03/hbase-ops-automation.html[blog
post] for more details.
[[bad.disk]]
==== Bad or Failing Disk
It is good having <<dfs.datanode.failed.volumes.tolerated,dfs.datanode.failed.volumes.tolerated>> set if you have a decent number of disks per machine for the case where a disk plain dies.
But usually disks do the "John Wayne" -- i.e.
take a while to go down spewing errors in _dmesg_ -- or for some reason, run much slower than their companions.
In this case you want to decommission the disk.
You have two options.
You can link:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html[decommission
the datanode] or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
.Short Circuit Reads
[NOTE]
====
If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down.
Move the regions back after you restart the datanode.
====
[[rolling]]
=== Rolling Restart
Some cluster configuration changes require either the entire cluster, or the RegionServers, to be restarted in order to pick up the changes.
In addition, rolling restarts are supported for upgrading to a minor or maintenance release, and to a major release if at all possible.
See the release notes for release you want to upgrade to, to find out about limitations to the ability to perform a rolling upgrade.
There are multiple ways to restart your cluster nodes, depending on your situation.
These methods are detailed below.
==== Using the `rolling-restart.sh` Script
HBase ships with a script, _bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
The script is provided as a template for your own script, and is not explicitly tested.
It requires password-less SSH login to be configured and assumes that you have deployed using a tarball.
The script requires you to set some environment variables before running it.
Examine the script and modify it to suit your needs.
._rolling-restart.sh_ General Usage
----
$ ./bin/rolling-restart.sh --help
Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only] [--graceful] [--maxthreads xx]
----
Rolling Restart on RegionServers Only::
To perform a rolling restart on the RegionServers only, use the `--rs-only` option.
This might be necessary if you need to reboot the individual RegionServer or if you make a configuration change that only affects RegionServers and not the other HBase processes.
Rolling Restart on Masters Only::
To perform a rolling restart on the active and backup Masters, use the `--master-only` option.
You might use this if you know that your configuration change only affects the Master and not the RegionServers, or if you need to restart the server where the active Master is running.
Graceful Restart::
If you specify the `--graceful` option, RegionServers are restarted using the _bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
This is safer, but can delay the restart.
Limiting the Number of Threads::
To limit the rolling restart to using only a specific number of threads, use the `--maxthreads` option.
[[rolling.restart.manual]]
==== Manual Rolling Restart
To retain more control over the process, you may wish to manually do a rolling restart across your cluster.
This uses the `graceful-stop.sh` command <<decommission,decommission>>.
In this method, you can restart each RegionServer individually and then move its old regions back into place, retaining locality.
If you also need to restart the Master, you need to do it separately, and restart the Master before restarting the RegionServers using this method.
The following is an example of such a command.
You may need to tailor it to your environment.
This script does a rolling restart of RegionServers only.
It disables the load balancer before moving the regions.
----
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
----
Monitor the output of the _/tmp/log.txt_ file to follow the progress of the script.
==== Logic for Crafting Your Own Rolling Restart Script
Use the following guidelines if you want to create your own rolling restart script.
. Extract the new release, verify its configuration, and synchronize it to all nodes of your cluster using `rsync`, `scp`, or another secure synchronization mechanism.
. Restart the master first.
You may need to modify these commands if your new HBase directory is different from the old one, such as for an upgrade.
+
----
$ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
----
. Gracefully restart each RegionServer, using a script such as the following, from the Master.
+
----
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
----
+
If you are running Thrift or REST servers, pass the --thrift or --rest options.
For other available options, run the `bin/graceful-stop.sh --help` command.
+
It is important to drain HBase regions slowly when restarting multiple RegionServers.
Otherwise, multiple regions go offline simultaneously and must be reassigned to other nodes, which may also go offline soon.
This can negatively affect performance.
You can inject delays into the script above, for instance, by adding a Shell command such as `sleep`.
To wait for 5 minutes between each RegionServer restart, modify the above script to the following:
+
----
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i & sleep 5m; done &> /tmp/log.txt &
----
. Restart the Master again, to clear out the dead servers list and re-enable the load balancer.
[[adding.new.node]]
=== Adding a New Node
Adding a new regionserver in HBase is essentially free, you simply start it like this: `$ ./bin/hbase-daemon.sh start regionserver` and it will register itself with the master.
Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files.
If you rely on ssh to start your daemons, don't forget to add the new hostname in _conf/regionservers_ on the master.
At this point the region server isn't serving data because no regions have moved to it yet.
If the balancer is enabled, it will start moving regions to the new RS.
On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time.
It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).
The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests.
Apart from resulting in higher latency, it may also be able to use all of your network card's capacity.
For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than _100MB/s_.
In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.
[[hbase_metrics]]
== HBase Metrics
HBase emits metrics which adhere to the link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html[Hadoop Metrics] API.
Starting with HBase 0.95footnote:[The Metrics system was redone in
HBase 0.96. See Migration
to the New Metrics Hotness – Metrics2 by Elliot Clark for detail], HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds.
You can use HBase metrics in conjunction with Ganglia.
You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.
=== Metric Setup
For HBase 0.95 and newer, HBase ships with a default metrics configuration, or [firstterm]_sink_.
This includes a wide variety of individual metrics, and emits them every 10 seconds by default.
To configure metrics for a given region server, edit the _conf/hadoop-metrics2-hbase.properties_ file.
Restart the region server for the changes to take effect.
To change the sampling rate for the default sink, edit the line beginning with `*.period`.
To filter which metrics are emitted or to extend the metrics framework, see https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
.HBase Metrics and Ganglia
[NOTE]
====
By default, HBase emits a large number of metrics per region server.
Ganglia may have difficulty processing all these metrics.
Consider increasing the capacity of the Ganglia server or reducing the number of metrics emitted by HBase.
See link:https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html#filtering[Metrics Filtering].
====
=== Disabling Metrics
To disable metrics for a region server, edit the _conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
Restart the region server for the changes to take effect.
[[discovering.available.metrics]]
=== Discovering Available Metrics
Rather than listing each metric which HBase emits by default, you can browse through the available metrics, either as a JSON output or via JMX.
Different metrics are exposed for the Master process and each region server process.
.Procedure: Access a JSON Output of Available Metrics
. After starting HBase, access the region server's web UI, at pass:[http://REGIONSERVER_HOSTNAME:60030] by default (or port 16030 in HBase 1.0+).
. Click the [label]#Metrics Dump# link near the top.
The metrics for the region server are presented as a dump of the JMX bean in JSON format.
This will dump out all metrics names and their values.
To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60030/jmx?description=true].
Not all beans and attributes have descriptions.
. To view metrics for the Master, connect to the Master's web UI instead (defaults to pass:[http://localhost:60010] or port 16010 in HBase 1.0+) and click its [label]#Metrics
Dump# link.
To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60010/jmx?description=true].
Not all beans and attributes have descriptions.
You can use many different tools to view JMX content by browsing MBeans.
This procedure uses `jvisualvm`, which is an application usually available in the JDK.
.Procedure: Browse the JMX Output of Available Metrics
. Start HBase, if it is not already running.
. Run the command `jvisualvm` command on a host with a GUI display.
You can launch it from the command line or another method appropriate for your operating system.
. Be sure the [label]#VisualVM-MBeans# plugin is installed. Browse to *Tools -> Plugins*. Click [label]#Installed# and check whether the plugin is listed.
If not, click [label]#Available Plugins#, select it, and click btn:[Install].
When finished, click btn:[Close].
. To view details for a given HBase process, double-click the process in the [label]#Local# sub-tree in the left-hand panel.
A detailed view opens in the right-hand panel.
Click the [label]#MBeans# tab which appears as a tab in the top of the right-hand panel.
. To access the HBase metrics, navigate to the appropriate sub-bean:
.* Master:
.* RegionServer:
. The name of each metric and its current value is displayed in the [label]#Attributes# tab.
For a view which includes more details, including the description of each attribute, click the [label]#Metadata# tab.
=== Units of Measure for Metrics
Different metrics are expressed in different units, as appropriate.
Often, the unit of measure is in the name (as in the metric `shippedKBs`). Otherwise, use the following guidelines.
When in doubt, you may need to examine the source for a given metric.
* Metrics that refer to a point in time are usually expressed as a timestamp.
* Metrics that refer to an age (such as `ageOfLastShippedOp`) are usually expressed in milliseconds.
* Metrics that refer to memory sizes are in bytes.
* Sizes of queues (such as `sizeOfLogQueue`) are expressed as the number of items in the queue.
Determine the size by multiplying by the block size (default is 64 MB in HDFS).
* Metrics that refer to things like the number of a given type of operations (such as `logEditsRead`) are expressed as an integer.
[[master_metrics]]
=== Most Important Master Metrics
Note: Counts are usually over the last metrics reporting interval.
hbase.master.numRegionServers::
Number of live regionservers
hbase.master.numDeadRegionServers::
Number of dead regionservers
hbase.master.ritCount ::
The number of regions in transition
hbase.master.ritCountOverThreshold::
The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
hbase.master.ritOldestAge::
The age of the longest region in transition, in milliseconds
[[rs_metrics]]
=== Most Important RegionServer Metrics
Note: Counts are usually over the last metrics reporting interval.
hbase.regionserver.regionCount::
The number of regions hosted by the regionserver
hbase.regionserver.storeFileCount::
The number of store files on disk currently managed by the regionserver
hbase.regionserver.storeFileSize::
Aggregate size of the store files on disk
hbase.regionserver.hlogFileCount::
The number of write ahead logs not yet archived
hbase.regionserver.totalRequestCount::
The total number of requests received
hbase.regionserver.readRequestCount::
The number of read requests received
hbase.regionserver.writeRequestCount::
The number of write requests received
hbase.regionserver.numOpenConnections::
The number of open connections at the RPC layer
hbase.regionserver.numActiveHandler::
The number of RPC handlers actively servicing requests
hbase.regionserver.numCallsInGeneralQueue::
The number of currently enqueued user requests
hbase.regionserver.numCallsInReplicationQueue::
The number of currently enqueued operations received from replication
hbase.regionserver.numCallsInPriorityQueue::
The number of currently enqueued priority (internal housekeeping) requests
hbase.regionserver.flushQueueLength::
Current depth of the memstore flush queue.
If increasing, we are falling behind with clearing memstores out to HDFS.
hbase.regionserver.updatesBlockedTime::
Number of milliseconds updates have been blocked so the memstore can be flushed
hbase.regionserver.compactionQueueLength::
Current depth of the compaction request queue.
If increasing, we are falling behind with storefile compaction.
hbase.regionserver.blockCacheHitCount::
The number of block cache hits
hbase.regionserver.blockCacheMissCount::
The number of block cache misses
hbase.regionserver.blockCacheExpressHitPercent ::
The percent of the time that requests with the cache turned on hit the cache
hbase.regionserver.percentFilesLocal::
Percent of store file data that can be read from the local DataNode, 0-100
hbase.regionserver.<op>_<measure>::
Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
hbase.regionserver.slow<op>Count ::
The number of operations we thought were slow, where <op> is one of the list above
hbase.regionserver.GcTimeMillis::
Time spent in garbage collection, in milliseconds
hbase.regionserver.GcTimeMillisParNew::
Time spent in garbage collection of the young generation, in milliseconds
hbase.regionserver.GcTimeMillisConcurrentMarkSweep::
Time spent in garbage collection of the old generation, in milliseconds
hbase.regionserver.authenticationSuccesses::
Number of client connections where authentication succeeded
hbase.regionserver.authenticationFailures::
Number of client connection authentication failures
hbase.regionserver.mutationsWithoutWALCount ::
Count of writes submitted with a flag indicating they should bypass the write ahead log
[[rs_meta_metrics]]
=== Meta Table Load Metrics
HBase meta table metrics collection feature is available in HBase 1.4+ but it is disabled by default, as it can
affect the performance of the cluster. When it is enabled, it helps to monitor client access patterns by collecting
the following statistics:
* number of get, put and delete operations on the `hbase:meta` table
* number of get, put and delete operations made by the top-N clients
* number of operations related to each table
* number of operations related to the top-N regions
When to use the feature::
This feature can help to identify hot spots in the meta table by showing the regions or tables where the meta info is
modified (e.g. by create, drop, split or move tables) or retrieved most frequently. It can also help to find misbehaving
client applications by showing which clients are using the meta table most heavily, which can for example suggest the
lack of meta table buffering or the lack of re-using open client connections in the client application.
.Possible side-effects of enabling this feature
[WARNING]
====
Having large number of clients and regions in the cluster can cause the registration and tracking of a large amount of
metrics, which can increase the memory and CPU footprint of the HBase region server handling the `hbase:meta` table.
It can also cause the significant increase of the JMX dump size, which can affect the monitoring or log aggregation
system you use beside HBase. It is recommended to turn on this feature only during debugging.
====
Where to find the metrics in JMX::
Each metric attribute name will start with the ‘MetaTable_’ prefix. For all the metrics you will see five different
JMX attributes: count, mean rate, 1 minute rate, 5 minute rate and 15 minute rate. You will find these metrics in JMX
under the following MBean:
`Hadoop -> HBase -> RegionServer -> Coprocessor.Region.CP_org.apache.hadoop.hbase.coprocessor.MetaTableMetrics`.
.Examples: some Meta Table metrics you can see in your JMX dump
[source,json]
----
{
"MetaTable_get_request_count": 77309,
"MetaTable_put_request_mean_rate": 0.06339092997186495,
"MetaTable_table_MyTestTable_request_15min_rate": 1.1020599841623246,
"MetaTable_client_/172.30.65.42_lossy_request_count": 1786
"MetaTable_client_/172.30.65.45_put_request_5min_rate": 0.6189810954855728,
"MetaTable_region_1561131112259.c66e4308d492936179352c80432ccfe0._lossy_request_count": 38342,
"MetaTable_region_1561131043640.5bdffe4b9e7e334172065c853cf0caa6._lossy_request_1min_rate": 0.04925099917433935,
}
----
Configuration::
To turn on this feature, you have to enable a custom coprocessor by adding the following section to hbase-site.xml.
This coprocessor will run on all the HBase RegionServers, but will be active (i.e. consume memory / CPU) only on
the server, where the `hbase:meta` table is located. It will produce JMX metrics which can be downloaded from the
web UI of the given RegionServer or by a simple REST call. These metrics will not be present in the JMX dump of the
other RegionServers.
.Enabling the Meta Table Metrics feature
[source,xml]
----
<property>
<name>hbase.coprocessor.region.classes</name>
<value>org.apache.hadoop.hbase.coprocessor.MetaTableMetrics</value>
</property>
----
.How the top-N metrics are calculated?
[NOTE]
====
The 'top-N' type of metrics will be counted using the Lossy Counting Algorithm (as defined in
link:http://www.vldb.org/conf/2002/S10P03.pdf[Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams"]),
which is designed to identify elements in a data stream whose frequency count exceed a user-given threshold.
The frequency computed by this algorithm is not always accurate but has an error threshold that can be specified by the
user as a configuration parameter. The run time space required by the algorithm is inversely proportional to the
specified error threshold, hence larger the error parameter, the smaller the footprint and the less accurate are the
metrics.
You can specify the error rate of the algorithm as a floating-point value between 0 and 1 (exclusive), it's default
value is 0.02. Having the error rate set to `E` and having `N` as the total number of meta table operations, then
(assuming the uniform distribution of the activity of low frequency elements) at most `7 / E` meters will be kept and
each kept element will have a frequency higher than `E * N`.
An example: Let’s assume we are interested in the HBase clients that are most active in accessing the meta table.
When there was 1,000,000 operations on the meta table so far and the error rate parameter is set to 0.02, then we can
assume that only at most 350 client IP address related counters will be present in JMX and each of these clients
accessed the meta table at least 20,000 times.
[source,xml]
----
<property>
<name>hbase.util.default.lossycounting.errorrate</name>
<value>0.02</value>
</property>
----
====
[[ops.monitoring]]
== HBase Monitoring
[[ops.monitoring.overview]]
=== Overview
The following metrics are arguably the most important to monitor for each RegionServer for "macro monitoring", preferably with a system like link:http://opentsdb.net/[OpenTSDB].
If your cluster is having performance issues it's likely that you'll see something unusual with this group.
HBase::
* See <<rs_metrics,rs metrics>>
OS::
* IO Wait
* User CPU
Java::
* GC
For more information on HBase metrics, see <<hbase_metrics,hbase metrics>>.
[[ops.slow.query]]
=== Slow Query Log
The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output.
The thresholds for "too long to run" and "too much output" are configurable, as described below.
The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events.
It is also prepended with identifying tags `(responseTooSlow)`, `(responseTooLarge)`, `(operationTooSlow)`, and `(operationTooLarge)` in order to enable easy filtering with grep, in case the user desires to see only slow queries.
==== Configuration
There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
* `hbase.ipc.warn.response.time` Maximum number of milliseconds that a query can be run without being logged.
Defaults to 10000, or 10 seconds.
Can be set to -1 to disable logging by time.
* `hbase.ipc.warn.response.size` Maximum byte size of response that a query can return without being logged.
Defaults to 100 megabytes.
Can be set to -1 to disable logging by size.
==== Metrics
The slow query log exposes to metrics to JMX.
* `hadoop.regionserver_rpc_slowResponse` a global metric reflecting the durations of all responses that triggered logging.
* `hadoop.regionserver_rpc_methodName.aboveOneSec` A metric reflecting the durations of all responses that lasted for more than one second.
==== Output
The output is tagged with operation e.g. `(operationTooSlow)` if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for.
If not, it is tagged `(responseTooSlow)` and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. `TooLarge` is substituted for `TooSlow` if the response size triggered the logging, with `TooLarge` appearing even in the case that both size and duration triggered logging.
==== Example
[source]
----
2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}
----
Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port.
Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations.
In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
[[slow_log_responses]]
==== Get Slow Response Log from shell
When an individual RPC exceeds a configurable time bound we log a complaint
by way of the logging subsystem
e.g.
----
2019-10-02 10:10:22,195 WARN [,queue=15,port=60020] ipc.RpcServer - (responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: \"t1,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}
----
Unfortunately often the request parameters are truncated as per above Example.
The truncation is unfortunate because it eliminates much of the utility of
the warnings. For example, the region name, the start and end keys, and the
filter hierarchy are all important clues for debugging performance problems
caused by moderate to low selectivity queries or queries made at a high rate.
HBASE-22978 introduces maintaining an in-memory ring buffer of requests that were judged to
be too slow in addition to the responseTooSlow logging. The in-memory representation can be
complete. There is some chance a high rate of requests will cause information on other
interesting requests to be overwritten before it can be read. This is an acceptable trade off.
In order to enable the in-memory ring buffer at RegionServers, we need to enable
config:
----
hbase.regionserver.slowlog.buffer.enabled
----
One more config determines the size of the ring buffer:
----
hbase.regionserver.slowlog.ringbuffer.size
----
Check the config section for the detailed description.
This config would be disabled by default. Turn it on and these shell commands
would provide expected results from the ring-buffers.
shell commands to retrieve slowlog responses from RegionServers:
----
Retrieve latest SlowLog Responses maintained by each or specific RegionServers.
Specify '*' to include all RS otherwise array of server names for specific
RS. A server name is the host, port plus startcode of a RegionServer.
e.g.: host187.example.com,60020,1289493121758 (find servername in
master ui or when you do detailed status in shell)
Provide optional filter parameters as Hash.
Default Limit of each server for providing no of slow log records is 10. User can specify
more limit by 'LIMIT' param in case more than 10 records should be retrieved.
Examples:
hbase> get_slowlog_responses '*' => get slowlog responses from all RS
hbase> get_slowlog_responses '*', {'LIMIT' => 50} => get slowlog responses from all RS
with 50 records limit (default limit: 10)
hbase> get_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2'] => get slowlog responses from SERVER_NAME1,
SERVER_NAME2
hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
=> get slowlog responses only related to meta
region
hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1'} => get slowlog responses only related to t1 table
hbase> get_slowlog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
=> get slowlog responses with given client
IP address and get 100 records limit
(default limit: 10)
hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
=> get slowlog responses with given region name
or table name
hbase> get_slowlog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
=> get slowlog responses that match either
provided client IP address or user name
----
All of above queries with filters have default OR operation applied i.e. all
records with any of the provided filters applied will be returned. However,
we can also apply AND operator i.e. all records that match all (not any) of
the provided filters should be returned.
----
hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
=> get slowlog responses with given region name
and table name, both should match
hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
=> get slowlog responses with given region name
or table name, any one can match
hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
=> get slowlog responses with given region name
and client IP address, both should match
----
Since OR is the default filter operator, without providing 'FILTER_BY_OP', query will have
same result as providing 'FILTER_BY_OP' => 'OR'.
Sometimes output can be long pretty printed json for user to scroll in
a single screen and hence user might prefer
redirecting output of get_slowlog_responses to a file.
Example:
----
echo "get_slowlog_responses '*'" | hbase shell > xyz.out 2>&1
----
Similar to slow RPC logs, client can also retrieve large RPC logs.
Sometimes, slow logs important to debug perf issues turn out to be
larger in size.
----
hbase> get_largelog_responses '*' => get largelog responses from all RS
hbase> get_largelog_responses '*', {'LIMIT' => 50} => get largelog responses from all RS
with 50 records limit (default limit: 10)
hbase> get_largelog_responses ['SERVER_NAME1', 'SERVER_NAME2'] => get largelog responses from SERVER_NAME1,
SERVER_NAME2
hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
=> get largelog responses only related to meta
region
hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1'} => get largelog responses only related to t1 table
hbase> get_largelog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
=> get largelog responses with given client
IP address and get 100 records limit
(default limit: 10)
hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
=> get largelog responses with given region name
or table name
hbase> get_largelog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
=> get largelog responses that match either
provided client IP address or user name
hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
=> get largelog responses with given region name
and table name, both should match
hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
=> get largelog responses with given region name
or table name, any one can match
hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
=> get largelog responses with given region name
and client IP address, both should match
----
shell command to clear slow/largelog responses from RegionServer:
----
Clears SlowLog Responses maintained by each or specific RegionServers.
Specify array of server names for specific RS. A server name is
the host, port plus startcode of a RegionServer.
e.g.: host187.example.com,60020,1289493121758 (find servername in
master ui or when you do detailed status in shell)
Examples:
hbase> clear_slowlog_responses => clears slowlog responses from all RS
hbase> clear_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2'] => clears slowlog responses from SERVER_NAME1,
SERVER_NAME2
----
include::slow_log_responses_from_systable.adoc[]
=== Block Cache Monitoring
Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache.
To view the block cache reports, see the Block Cache section of the region server UI.
Following are a few examples of the reporting capabilities.
.Basic Info shows the cache implementation.
image::bc_basic.png[]
.Config shows all cache configuration options.
image::bc_config.png[]
.Stats shows statistics about the performance of the cache.
image::bc_stats.png[]
.L1 and L2 show information about the L1 and L2 caches.
image::bc_l1.png[]
This is not an exhaustive list of all the screens and reports available.
Have a look in the Web UI.
=== Snapshot Space Usage Monitoring
Starting with HBase 0.95, Snapshot usage information on individual snapshots was shown in the HBase Master Web UI. This was further enhanced starting with HBase 1.3 to show the total Storefile size of the Snapshot Set. The following metrics are shown in the Master Web UI with HBase 1.3 and later.
* Shared Storefile Size is the Storefile size shared between snapshots and active tables.
* Mob Storefile Size is the Mob Storefile size shared between snapshots and active tables.
* Archived Storefile Size is the Storefile size in Archive.
The format of Archived Storefile Size is NNN(MMM). NNN is the total Storefile size in Archive, MMM is the total Storefile size in Archive that is specific to the snapshot (not shared with other snapshots and tables).
.Master Snapshot Overview
image::master-snapshot.png[]
.Snapshot Storefile Stats Example 1
image::1-snapshot.png[]
.Snapshot Storefile Stats Example 2
image::2-snapshots.png[]
.Empty Snapshot Storfile Stats Example
image::empty-snapshots.png[]
== Cluster Replication
NOTE: This information was previously available at
link:https://hbase.apache.org/0.94/replication.html[Cluster Replication].
HBase provides a cluster replication mechanism which allows you to keep one cluster's state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate the changes.
Some use cases for cluster replication include:
* Backup and disaster recovery
* Data aggregation
* Geographic data distribution
* Online data ingestion combined with offline data analytics
NOTE: Replication is enabled at the granularity of the column family.
Before enabling replication for a column family, create the table and all column families to be replicated, on the destination cluster.
NOTE: Replication is asynchronous as we send WAL to another cluster in background, which means that when you want to do recovery through replication, you could loss some data. To address this problem, we have introduced a new feature called synchronous replication. As the mechanism is a bit different so we use a separated section to describe it. Please see
<<Synchronous Replication,Synchronous Replication>>.
=== Replication Overview
Cluster replication uses a source-push methodology.
An HBase cluster can be a source (also called master or active, meaning that it is the originator of new data), a destination (also called slave or passive, meaning that it receives data via replication), or can fulfill both roles at once.
Replication is asynchronous, and the goal of replication is eventual consistency.
When the source receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that for that column family on the RegionServer managing the relevant region.
When data is replicated from one cluster to another, the original source of the data is tracked via a cluster ID which is part of the metadata.
In HBase 0.96 and newer (link:https://issues.apache.org/jira/browse/HBASE-7709[HBASE-7709]), all clusters which have already consumed the data are also tracked.
This prevents replication loops.
The WALs for each region server must be kept in HDFS as long as they are needed to replicate data to any slave cluster.
Each region server reads from the oldest log it needs to replicate and keeps track of its progress processing WALs inside ZooKeeper to simplify failure recovery.
The position marker which indicates a slave cluster's progress, as well as the queue of WALs to process, may be different for every slave cluster.
The clusters participating in replication can be of different sizes.
The master cluster relies on randomization to attempt to balance the stream of replication on the slave clusters.
It is expected that the slave cluster has storage capacity to hold the replicated data, as well as any data it is responsible for ingesting.
If a slave cluster does run out of room, or is inaccessible for other reasons, it throws an error and the master retains the WAL and retries the replication at intervals.
.Consistency Across Replicated Clusters
[WARNING]
====
How your application builds on top of the HBase API matters when replication is in play. HBase's replication system provides at-least-once delivery of client edits for an enabled column family to each configured destination cluster. In the event of failure to reach a given destination, the replication system will retry sending edits in a way that might repeat a given message. HBase provides two ways of replication, one is the original replication and the other is serial replication. In the previous way of replication, there is not a guaranteed order of delivery for client edits. In the event of a RegionServer failing, recovery of the replication queue happens independent of recovery of the individual regions that server was previously handling. This means that it is possible for the not-yet-replicated edits to be serviced by a RegionServer that is currently slower to replicate than the one that handles edits from after the failure.
The combination of these two properties (at-least-once delivery and the lack of message ordering) means that some destination clusters may end up in a different state if your application makes use of operations that are not idempotent, e.g. Increments.
To solve the problem, HBase now supports serial replication, which sends edits to destination cluster as the order of requests from client. See <<Serial Replication,Serial Replication>>.
====
.Terminology Changes
[NOTE]
====
Previously, terms such as [firstterm]_master-master_, [firstterm]_master-slave_, and [firstterm]_cyclical_ were used to describe replication relationships in HBase.
These terms added confusion, and have been abandoned in favor of discussions about cluster topologies appropriate for different scenarios.
====
.Cluster Topologies
* A central source cluster might propagate changes out to multiple destination clusters, for failover or due to geographic distribution.
* A source cluster might push changes to a destination cluster, which might also push its own changes back to the original cluster.
* Many different low-latency clusters might push changes to one centralized cluster for backup or resource-intensive data analytics jobs.
The processed data might then be replicated back to the low-latency clusters.
Multiple levels of replication may be chained together to suit your organization's needs.
The following diagram shows a hypothetical scenario.
Use the arrows to follow the data paths.
.Example of a Complex Cluster Replication Configuration
image::hbase_replication_diagram.jpg[]
HBase replication borrows many concepts from the [firstterm]_statement-based replication_ design used by MySQL.
Instead of SQL statements, entire WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the clients) are replicated in order to maintain atomicity.
[[hbase.replication.management]]
=== Managing and Configuring Cluster Replication
.Cluster Configuration Overview
. Configure and start the source and destination clusters.
Create tables with the same names and column families on both the source and destination clusters, so that the destination cluster knows where to store data it will receive.
. All hosts in the source and destination clusters should be reachable to each other.
. If both clusters use the same ZooKeeper cluster, you must use a different `zookeeper.znode.parent`, because they cannot write in the same folder.
. On the source cluster, in HBase Shell, add the destination cluster as a peer, using the `add_peer` command.
. On the source cluster, in HBase Shell, enable the table replication, using the `enable_table_replication` command.
. Check the logs to see if replication is taking place. If so, you will see messages like the following, coming from the ReplicationSource.
----
LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
----
.Serial Replication Configuration
See <<Serial Replication,Serial Replication>>
.Cluster Management Commands
add_peer <ID> <CLUSTER_KEY>::
Adds a replication relationship between two clusters. +
* ID -- a unique string, which must not contain a hyphen.
* CLUSTER_KEY: composed using the following template, with appropriate place-holders: `hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent`. This value can be found on the Master UI info page.
* STATE(optional): ENABLED or DISABLED, default value is ENABLED
list_peers:: list all replication relationships known by this cluster
enable_peer <ID>::
Enable a previously-disabled replication relationship
disable_peer <ID>::
Disable a replication relationship. HBase will no longer send edits to that
peer cluster, but it still keeps track of all the new WALs that it will need
to replicate if and when it is re-enabled. WALs are retained when enabling or disabling
replication as long as peers exist.
remove_peer <ID>::
Disable and remove a replication relationship. HBase will no longer send edits to that peer cluster or keep track of WALs.
enable_table_replication <TABLE_NAME>::
Enable the table replication switch for all its column families. If the table is not found in the destination cluster then it will create one with the same name and column families.
disable_table_replication <TABLE_NAME>::
Disable the table replication switch for all its column families.
=== Serial Replication
Note: this feature is introduced in HBase 2.1
.Function of serial replication
Serial replication supports to push logs to the destination cluster in the same order as logs reach to the source cluster.
.Why need serial replication?
In replication of HBase, we push mutations to destination cluster by reading WAL in each region server. We have a queue for WAL files so we can read them in order of creation time. However, when region-move or RS failure occurs in source cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination.
This treatment can possibly lead to data inconsistency between source and destination clusters:
1. there are put and then delete written to source cluster.
2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster.
3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster, but in source cluster the put is masked by the delete, hence data inconsistency between source and destination clusters.
.Serial replication configuration
Set the serial flag to true for a repliation peer. And the default serial flag is false.
* Add a new replication peer which serial flag is true
[source,ruby]
----
hbase> add_peer '1', CLUSTER_KEY => "server1.cie.com:2181:/hbase", SERIAL => true
----
* Set a replication peer's serial flag to false
[source,ruby]
----
hbase> set_peer_serial '1', false
----
* Set a replication peer's serial flag to true
[source,ruby]
----
hbase> set_peer_serial '1', true
----
The serial replication feature had been done firstly in link:https://issues.apache.org/jira/browse/HBASE-9465[HBASE-9465] and then reverted and redone in link:https://issues.apache.org/jira/browse/HBASE-20046[HBASE-20046]. You can find more details in these issues.
=== Verifying Replicated Data
The `VerifyReplication` MapReduce job, which is included in HBase, performs a systematic comparison of replicated data between two different clusters. Run the VerifyReplication job on the master cluster, supplying it with the peer ID and table name to use for validation. You can limit the verification further by specifying a time range or specific families. The job's short name is `verifyrep`. To run the job, use a command like the following:
+
[source,bash]
----
$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` "${HADOOP_HOME}/bin/hadoop" jar "${HBASE_HOME}/hbase-mapreduce-VERSION.jar" verifyrep --starttime=<timestamp> --endtime=<timestamp> --families=<myFam> <ID> <tableName>
----
+
The `VerifyReplication` command prints out `GOODROWS` and `BADROWS` counters to indicate rows that did and did not replicate correctly.
=== Detailed Information About Cluster Replication
.Replication Architecture Overview
image::replication_overview.png[]
==== Life of a WAL Edit
A single WAL edit goes through several steps in order to be replicated to a slave cluster.
. An HBase client uses a Put or Delete operation to manipulate data in HBase.
. The region server writes the request to the WAL in a way allows it to be replayed if it is not written successfully.
. If the changed cell corresponds to a column family that is scoped for replication, the edit is added to the queue for replication.
. In a separate thread, the edit is read from the log, as part of a batch process.
Only the KeyValues that are eligible for replication are kept.
Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as `hbase:meta`, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
. The edit is tagged with the master's UUID and added to a buffer.
When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster.
. The region server reads the edits sequentially and separates them into buffers, one buffer per table.
After all edits are read, each buffer is flushed using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client.
The master's UUID and the UUIDs of slaves which have already consumed the data are preserved in the edits they are applied, in order to prevent replication loops.
. In the master, the offset for the WAL that is currently being replicated is registered in ZooKeeper.
. The first three steps, where the edit is inserted, are identical.
. Again in a separate thread, the region server reads, filters, and edits the log edits in the same way as above.
The slave region server does not answer the RPC call.
. The master sleeps and tries again a configurable number of times.
. If the slave region server is still not available, the master selects a new subset of region server to replicate to, and tries again to send the buffer of edits.
. Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper.
Logs that are [firstterm]_archived_ by their region server, by moving them from the region server's log directory to a central log directory, will update their paths in the in-memory queue of the replicating thread.
. When the slave cluster is finally available, the buffer is applied in the same way as during normal processing.
The master region server will then replicate the backlog of logs that accumulated during the outage.
.Spreading Queue Failover Load
When replication is active, a subset of region servers in the source cluster is responsible for shipping edits to the sink.
This responsibility must be failed over like all other region server functions should a process or node crash.
The following configuration settings are recommended for maintaining an even distribution of replication activity over the remaining live servers in the source cluster:
* Set `replication.source.maxretriesmultiplier` to `300`.
* Set `replication.source.sleepforretries` to `1` (1 second). This value, combined with the value of `replication.source.maxretriesmultiplier`, causes the retry cycle to last about 5 minutes.
* Set `replication.sleep.before.failover` to `30000` (30 seconds) in the source cluster site configuration.
[[cluster.replication.preserving.tags]]
.Preserving Tags During Replication
By default, the codec used for replication between clusters strips tags, such as cell-level ACLs, from cells.
To prevent the tags from being stripped, you can use a different codec which does not strip them.
Configure `hbase.replication.rpc.codec` to use `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`, on both the source and sink RegionServers involved in the replication.
This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-10322[HBASE-10322].
==== Replication Internals
Replication State in ZooKeeper::
HBase replication maintains its state in ZooKeeper.
By default, the state is contained in the base node _/hbase/replication_.
This node contains two child nodes, the `Peers` znode and the `RS` znode.
The `Peers` Znode::
The `peers` znode is stored in _/hbase/replication/peers_ by default.
It consists of a list of all peer replication clusters, along with the status of each of them.
The value of each peer is its cluster key, which is provided in the HBase Shell.
The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
The `RS` Znode::
The `rs` znode contains a list of WAL logs which need to be replicated.
This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
The rs znode has one child znode for each region server in the cluster.
The child znode name is the region server's hostname, client port, and start code.
This list includes both live and dead region servers.
==== Choosing Region Servers to Replicate To
When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the _rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
Because this selection is performed by each master region server, the probability that all slave region servers are used is very high, and this method works for clusters of any size.
For example, a master cluster of 10 machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the master cluster region servers to choose one machine each at random.
A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
This watch is used to monitor changes in the composition of the slave cluster.
When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
==== Keeping Track of Logs
Each master cluster region server has its own znode in the replication znodes hierarchy.
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
Each of these queues will track the WALs created by that region server, but they can differ in size.
For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed.
See <<rs.failover.details,rs.failover.details>> for an example.
When a source is instantiated, it contains the current WAL that the region server is writing to.
During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available.
This ensures that all the sources are aware that a new log exists before the region server is able to append edits into it, but this operations is now more expensive.
The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
A log can be archived if it is no longer used or if the number of logs exceeds `hbase.regionserver.maxlogs` because the insertion rate is faster than regions are flushed.
When a log is archived, the source threads are notified that the path for that log changed.
If a particular source has already finished with an archived log, it will just ignore the message.
If the log is in the queue, the path will be updated in memory.
If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
==== Reading, Filtering and Sending Edits
By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
Speed is limited by the filtering of log entries Only KeyValues that are scoped GLOBAL and that do not belong to catalog tables will be retained.
Speed is also limited by total size of the list of edits to replicate per slave, which is limited to 64 MB by default.
With this configuration, a master cluster region server with three slaves would use at most 192 MB to store data to replicate.
This does not account for the data which was filtered but not garbage collected.
Once the maximum size of edits has been buffered or the reader reaches the end of the WAL, the source thread stops reading and chooses at random a sink to replicate to (from the list that was generated by keeping only a subset of slave region servers). It directly issues a RPC to the chosen region server and waits for the method to return.
If the RPC was successful, the source determines whether the current file has been emptied or it contains more data which needs to be read.
If the file has been emptied, the source deletes the znode in the queue.
Otherwise, it registers the new offset in the log's znode.
If the RPC threw an exception, the source will retry 10 times before trying to find a different sink.
==== Cleaning Logs
If replication is not enabled, the master's log-cleaning thread deletes old logs using a configured TTL.
This TTL-based method does not work well with replication, because archived logs which have exceeded their TTL may still be in a queue.
The default behavior is augmented so that if a log is past its TTL, the cleaning thread looks up every queue until it finds the log, while caching queues it has found.
If the log is not found in any queues, the log will be deleted.
The next time the cleaning process needs to look for a log, it starts by using its cached list.
NOTE: WALs are saved when replication is enabled or disabled as long as peers exist.
[[rs.failover.details]]
==== Region Server Failover
When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
After queues are all transferred, they are deleted from the old location.
The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
The main difference is that those queues will never receive new data, since they do not belong to their new region server.
When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
The region servers' znodes all contain a `peers` znode which contains a single queue.
The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
----
/hbase/replication/rs/
1.1.1.1,60020,123456780/
2/
1.1.1.1,60020.1234 (Contains a position)
1.1.1.1,60020.1265
1.1.1.2,60020,123456790/
2/
1.1.1.2,60020.1214 (Contains a position)
1.1.1.2,60020.1248
1.1.1.2,60020.1312
1.1.1.3,60020, 123456630/
2/
1.1.1.3,60020.1280 (Contains a position)
----
Assume that 1.1.1.2 loses its ZooKeeper session.
The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins.
It will then start transferring all the queues to its local peers znode by appending the name of the dead server.
Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:
----
/hbase/replication/rs/
1.1.1.1,60020,123456780/
2/
1.1.1.1,60020.1234 (Contains a position)
1.1.1.1,60020.1265
1.1.1.2,60020,123456790/
lock
2/
1.1.1.2,60020.1214 (Contains a position)
1.1.1.2,60020.1248
1.1.1.2,60020.1312
1.1.1.3,60020,123456630/
2/
1.1.1.3,60020.1280 (Contains a position)
2-1.1.1.2,60020,123456790/
1.1.1.2,60020.1214 (Contains a position)
1.1.1.2,60020.1248
1.1.1.2,60020.1312
----
Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too.
Some new logs were also created in the normal queues.
The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
The new layout will be:
----
/hbase/replication/rs/
1.1.1.1,60020,123456780/
2/
1.1.1.1,60020.1378 (Contains a position)
2-1.1.1.3,60020,123456630/
1.1.1.3,60020.1325 (Contains a position)
1.1.1.3,60020.1401
2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
1.1.1.2,60020.1312 (Contains a position)
1.1.1.3,60020,123456630/
lock
2/
1.1.1.3,60020.1325 (Contains a position)
1.1.1.3,60020.1401
2-1.1.1.2,60020,123456790/
1.1.1.2,60020.1312 (Contains a position)
----
=== Replication Metrics
The following metrics are exposed at the global region server level and at the peer level:
`source.sizeOfLogQueue`::
number of WALs to process (excludes the one which is being processed) at the Replication source
`source.shippedOps`::
number of mutations shipped
`source.logEditsRead`::
number of mutations read from WALs at the replication source
`source.ageOfLastShippedOp`::
age of last batch that was shipped by the replication source
`source.completedLogs`::
The number of write-ahead-log files that have completed their acknowledged sending to the peer associated with this source. Increments to this metric are a part of normal operation of HBase replication.
`source.completedRecoverQueues`::
The number of recovery queues this source has completed sending to the associated peer. Increments to this metric are a part of normal recovery of HBase replication in the face of failed Region Servers.
`source.uncleanlyClosedLogs`::
The number of write-ahead-log files the replication system considered completed after reaching the end of readable entries in the face of an uncleanly closed file.
`source.ignoredUncleanlyClosedLogContentsInBytes`::
When a write-ahead-log file is not closed cleanly, there will likely be some entry that has been partially serialized. This metric contains the number of bytes of such entries the HBase replication system believes were remaining at the end of files skipped in the face of an uncleanly closed file. Those bytes should either be in different file or represent a client write that was not acknowledged.
`source.restartedLogReading`::
The number of times the HBase replication system detected that it failed to correctly parse a cleanly closed write-ahead-log file. In this circumstance, the system replays the entire log from the beginning, ensuring that no edits fail to be acknowledged by the associated peer. Increments to this metric indicate that the HBase replication system is having difficulty correctly handling failures in the underlying distributed storage system. No dataloss should occur, but you should check Region Server log files for details of the failures.
`source.repeatedLogFileBytes`::
When the HBase replication system determines that it needs to replay a given write-ahead-log file, this metric is incremented by the number of bytes the replication system believes had already been acknowledged by the associated peer prior to starting over.
`source.closedLogsWithUnknownFileLength`::
Incremented when the HBase replication system believes it is at the end of a write-ahead-log file but it can not determine the length of that file in the underlying distributed storage system. Could indicate dataloss since the replication system is unable to determine if the end of readable entries lines up with the expected end of the file. You should check Region Server log files for details of the failures.
=== Replication Configuration Options
[cols="1,1,1", options="header"]
|===
| Option
| Description
| Default
| zookeeper.znode.parent
| The name of the base ZooKeeper znode used for HBase
| /hbase
| zookeeper.znode.replication
| The name of the base znode used for replication
| replication
| zookeeper.znode.replication.peers
| The name of the peer znode
| peers
| zookeeper.znode.replication.peers.state
| The name of peer-state znode
| peer-state
| zookeeper.znode.replication.rs
| The name of the rs znode
| rs
| replication.sleep.before.failover
| How many milliseconds a worker should sleep before attempting to replicate
a dead region server's WAL queues.
|
| replication.executor.workers
| The number of region servers a given region server should attempt to
failover simultaneously.
| 1
|===
=== Monitoring Replication Status
You can use the HBase Shell command `status 'replication'` to monitor the replication status on your cluster. The command has three variations:
* `status 'replication'` -- prints the status of each source and its sinks, sorted by hostname.
* `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
* `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.
==== Understanding the output
The command output will vary according to the state of replication. For example right after a restart
and if destination peer is not reachable, no replication source threads would be running,
so no metrics would get displayed:
----
hbase01.home:
SOURCE: PeerID=1
Normal Queue: 1
No Reader/Shipper threads runnning yet.
SINK: TimeStampStarted=1591985197350, Waiting for OPs...
----
Under normal circumstances, a healthy, active-active replication deployment would
show the following:
----
hbase01.home:
SOURCE: PeerID=1
Normal Queue: 1
AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020
----
The definition for each of these metrics is detailed below:
[cols="1,1,1", options="header"]
|===
| Type
| Metric Name
| Description
| Source
| AgeOfLastShippedOp
| How long last successfully shipped edit took to effectively get replicated on target.
| Source
| TimeStampOfLastShippedOp
| The actual date of last successful edit shipment.
| Source
| SizeOfLogQueue
| Number of wal files on this given queue.
| Source
| EditsReadFromLogQueue
| How many edits have been read from this given queue since this source thread started.
| Source
| OpsShippedToTarget
| How many edits have been shipped to target since this source thread started.
| Source
| TimeStampOfNextToReplicate
| Date of the current edit been attempted to replicate.
| Source
| Replication Lag
| The elapsed time (in millis), since the last edit to replicate was read by this source
thread and effectively replicated to target
| Sink
| TimeStampStarted
| Date (in millis) of when this Sink thread started.
| Sink
| AgeOfLastAppliedOp
| How long it took to apply the last successful shipped edit.
| Sink
| TimeStampsOfLastAppliedOp
| Date of last successful applied edit.
|===
Growing values for `Source.TimeStampsOfLastAppliedOp` and/or
`Source.Replication Lag` would indicate replication delays. If those numbers keep going
up, while `Source.TimeStampOfLastShippedOp`, `Source.EditsReadFromLogQueue`,
`Source.OpsShippedToTarget` or `Source.TimeStampOfNextToReplicate` do not change at all,
then replication flow is failing to progress, and there might be problems within
clusters communication. This could also happen if replication is manually paused
(via hbase shell `disable_peer` command, for example), but date keeps getting ingested
in the source cluster tables.
== Running Multiple Workloads On a Single Cluster
HBase provides the following mechanisms for managing the performance of a cluster
handling multiple workloads:
. <<quota>>
. <<request_queues>>
. <<multiple-typed-queues>>
[[quota]]
=== Quotas
HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on
the following limits:
. <<request-quotas,The number or size of requests(read, write, or read+write) in a given timeframe>>
. <<namespace_quotas,The number of tables allowed in a namespace>>
These limits can be enforced for a specified user, table, or namespace.
.Enabling Quotas
Quotas are disabled by default. To enable the feature, set the `hbase.quota.enabled`
property to `true` in _hbase-site.xml_ file for all cluster nodes.
.General Quota Syntax
. THROTTLE_TYPE can be expressed as READ, WRITE, or the default type(read + write).
. Timeframes can be expressed in the following units: `sec`, `min`, `hour`, `day`
. Request sizes can be expressed in the following units: `B` (bytes), `K` (kilobytes),
`M` (megabytes), `G` (gigabytes), `T` (terabytes), `P` (petabytes)
. Numbers of requests are expressed as an integer followed by the string `req`
. Limits relating to time are expressed as req/time or size/time. For instance `10req/day`
or `100P/hour`.
. Numbers of tables or regions are expressed as integers.
[[request-quotas]]
.Setting Request Quotas
You can set quota rules ahead of time, or you can change the throttle at runtime. The change
will propagate after the quota refresh period has expired. This expiration period
defaults to 5 minutes. To change it, modify the `hbase.quota.refresh.period` property
in `hbase-site.xml`. This property is expressed in milliseconds and defaults to `300000`.
----
# Limit user u1 to 10 requests per second
hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10req/sec'
# Limit user u1 to 10 read requests per second
hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', LIMIT => '10req/sec'
# Limit user u1 to 10 M per day everywhere
hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10M/day'
# Limit user u1 to 10 M write size per sec
hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => WRITE, USER => 'u1', LIMIT => '10M/sec'
# Limit user u1 to 5k per minute on table t2
hbase> set_quota TYPE => THROTTLE, USER => 'u1', TABLE => 't2', LIMIT => '5K/min'
# Limit user u1 to 10 read requests per sec on table t2
hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', TABLE => 't2', LIMIT => '10req/sec'
# Remove an existing limit from user u1 on namespace ns2
hbase> set_quota TYPE => THROTTLE, USER => 'u1', NAMESPACE => 'ns2', LIMIT => NONE
# Limit all users to 10 requests per hour on namespace ns1
hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns1', LIMIT => '10req/hour'
# Limit all users to 10 T per hour on table t1
hbase> set_quota TYPE => THROTTLE, TABLE => 't1', LIMIT => '10T/hour'
# Remove all existing limits from user u1
hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => NONE
# List all quotas for user u1 in namespace ns2
hbase> list_quotas USER => 'u1, NAMESPACE => 'ns2'
# List all quotas for namespace ns2
hbase> list_quotas NAMESPACE => 'ns2'
# List all quotas for table t1
hbase> list_quotas TABLE => 't1'
# list all quotas
hbase> list_quotas
----
You can also place a global limit and exclude a user or a table from the limit by applying the
`GLOBAL_BYPASS` property.
----
hbase> set_quota NAMESPACE => 'ns1', LIMIT => '100req/min' # a per-namespace request limit
hbase> set_quota USER => 'u1', GLOBAL_BYPASS => true # user u1 is not affected by the limit
----
[[namespace_quotas]]
.Setting Namespace Quotas
You can specify the maximum number of tables or regions allowed in a given namespace, either
when you create the namespace or by altering an existing namespace, by setting the
`hbase.namespace.quota.maxtables property` on the namespace.
.Limiting Tables Per Namespace
----
# Create a namespace with a max of 5 tables
hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxtables'=>'5'}
# Alter an existing namespace to have a max of 8 tables
hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxtables'=>'8'}
# Show quota information for a namespace
hbase> describe_namespace 'ns2'
# Alter an existing namespace to remove a quota
hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=>'hbase.namespace.quota.maxtables'}
----
.Limiting Regions Per Namespace
----
# Create a namespace with a max of 10 regions
hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxregions'=>'10'
# Show quota information for a namespace
hbase> describe_namespace 'ns1'
# Alter an existing namespace to have a max of 20 tables
hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxregions'=>'20'}
# Alter an existing namespace to remove a quota
hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=> 'hbase.namespace.quota.maxregions'}
----
[[request_queues]]
=== Request Queues
If no throttling policy is configured, when the RegionServer receives multiple requests,
they are now placed into a queue waiting for a free execution slot (HBASE-6721).
The simplest queue is a FIFO queue, where each request waits for all previous requests in the queue
to finish before running. Fast or interactive queries can get stuck behind large requests.
If you are able to guess how long a request will take, you can reorder requests by
pushing the long requests to the end of the queue and allowing short requests to preempt
them. Eventually, you must still execute the large requests and prioritize the new
requests behind them. The short requests will be newer, so the result is not terrible,
but still suboptimal compared to a mechanism which allows large requests to be split
into multiple smaller ones.
HBASE-10993 introduces such a system for deprioritizing long-running scanners. There
are two types of queues, `fifo` and `deadline`. To configure the type of queue used,
configure the `hbase.ipc.server.callqueue.type` property in `hbase-site.xml`. There
is no way to estimate how long each request may take, so de-prioritization only affects
scans, and is based on the number of “next” calls a scan request has made. An assumption
is made that when you are doing a full table scan, your job is not likely to be interactive,
so if there are concurrent requests, you can delay long-running scans up to a limit tunable by
setting the `hbase.ipc.server.queue.max.call.delay` property. The slope of the delay is calculated
by a simple square root of `(numNextCall * weight)` where the weight is
configurable by setting the `hbase.ipc.server.scan.vtime.weight` property.
[[multiple-typed-queues]]
=== Multiple-Typed Queues
You can also prioritize or deprioritize different kinds of requests by configuring
a specified number of dedicated handlers and queues. You can segregate the scan requests
in a single queue with a single handler, and all the other available queues can service
short `Get` requests.
You can adjust the IPC queues and handlers based on the type of workload, using static
tuning options. This approach is an interim first step that will eventually allow
you to change the settings at runtime, and to dynamically adjust values based on the load.
.Multiple Queues
To avoid contention and separate different kinds of requests, configure the
`hbase.ipc.server.callqueue.handler.factor` property, which allows you to increase the number of
queues and control how many handlers can share the same queue., allows admins to increase the number
of queues and decide how many handlers share the same queue.
Using more queues reduces contention when adding a task to a queue or selecting it
from a queue. You can even configure one queue per handler. The trade-off is that
if some queues contain long-running tasks, a handler may need to wait to execute from that queue
rather than stealing from another queue which has waiting tasks.
.Read and Write Queues
With multiple queues, you can now divide read and write requests, giving more priority
(more queues) to one or the other type. Use the `hbase.ipc.server.callqueue.read.ratio`
property to choose to serve more reads or more writes.
.Get and Scan Queues
Similar to the read/write split, you can split gets and scans by tuning the `hbase.ipc.server.callqueue.scan.ratio`
property to give more priority to gets or to scans. A scan ratio of `0.1` will give
more queue/handlers to the incoming gets, which means that more gets can be processed
at the same time and that fewer scans can be executed at the same time. A value of
`0.9` will give more queue/handlers to scans, so the number of scans executed will
increase and the number of gets will decrease.
[[space-quotas]]
=== Space Quotas
link:https://issues.apache.org/jira/browse/HBASE-16961[HBASE-16961] introduces a new type of
quotas for HBase to leverage: filesystem quotas. These "space" quotas limit the amount of space
on the filesystem that HBase namespaces and tables can consume. If a user, malicious or ignorant,
has the ability to write data into HBase, with enough time, that user can effectively crash HBase
(or worse HDFS) by consuming all available space. When there is no filesystem space available,
HBase crashes because it can no longer create/sync data to the write-ahead log.
This feature allows a for a limit to be set on the size of a table or namespace. When a space quota is set
on a namespace, the quota's limit applies to the sum of usage of all tables in that namespace.
When a table with a quota exists in a namespace with a quota, the table quota takes priority
over the namespace quota. This allows for a scenario where a large limit can be placed on
a collection of tables, but a single table in that collection can have a fine-grained limit set.
The existing `set_quota` and `list_quota` HBase shell commands can be used to interact with
space quotas. Space quotas are quotas with a `TYPE` of `SPACE` and have `LIMIT` and `POLICY`
attributes. The `LIMIT` is a string that refers to the amount of space on the filesystem
that the quota subject (e.g. the table or namespace) may consume. For example, valid values
of `LIMIT` are `'10G'`, `'2T'`, or `'256M'`. The `POLICY` refers to the action that HBase will
take when the quota subject's usage exceeds the `LIMIT`. The following are valid `POLICY` values.
* `NO_INSERTS` - No new data may be written (e.g. `Put`, `Increment`, `Append`).
* `NO_WRITES` - Same as `NO_INSERTS` but `Deletes` are also disallowed.
* `NO_WRITES_COMPACTIONS` - Same as `NO_WRITES` but compactions are also disallowed.
** This policy only prevents user-submitted compactions. System can still run compactions.
* `DISABLE` - The table(s) are disabled, preventing all read/write access.
.Setting simple space quotas
----
# Sets a quota on the table 't1' with a limit of 1GB, disallowing Puts/Increments/Appends when the table exceeds 1GB
hbase> set_quota TYPE => SPACE, TABLE => 't1', LIMIT => '1G', POLICY => NO_INSERTS
# Sets a quota on the namespace 'ns1' with a limit of 50TB, disallowing Puts/Increments/Appends/Deletes
hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '50T', POLICY => NO_WRITES
# Sets a quota on the table 't3' with a limit of 2TB, disallowing any writes and compactions when the table exceeds 2TB.
hbase> set_quota TYPE => SPACE, TABLE => 't3', LIMIT => '2T', POLICY => NO_WRITES_COMPACTIONS
# Sets a quota on the table 't2' with a limit of 50GB, disabling the table when it exceeds 50GB
hbase> set_quota TYPE => SPACE, TABLE => 't2', LIMIT => '50G', POLICY => DISABLE
----
Consider the following scenario to set up quotas on a namespace, overriding the quota on tables in that namespace
.Table and Namespace space quotas
----
hbase> create_namespace 'ns1'
hbase> create 'ns1:t1'
hbase> create 'ns1:t2'
hbase> create 'ns1:t3'
hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '100T', POLICY => NO_INSERTS
hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t2', LIMIT => '200G', POLICY => NO_WRITES
hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t3', LIMIT => '20T', POLICY => NO_WRITES
----
In the above scenario, the tables in the namespace `ns1` will not be allowed to consume more than
100TB of space on the filesystem among each other. The table 'ns1:t2' is only allowed to be 200GB in size, and will
disallow all writes when the usage exceeds this limit. The table 'ns1:t3' is allowed to grow to 20TB in size
and also will disallow all writes then the usage exceeds this limit. Because there is no table quota
on 'ns1:t1', this table can grow up to 100TB, but only if 'ns1:t2' and 'ns1:t3' have a usage of zero bytes.
Practically, it's limit is 100TB less the current usage of 'ns1:t2' and 'ns1:t3'.
[[ops.space.quota.deletion]]
=== Disabling Automatic Space Quota Deletion
By default, if a table or namespace is deleted that has a space quota, the quota itself is
also deleted. In some cases, it may be desirable for the space quota to not be automatically deleted.
In these cases, the user may configure the system to not delete any space quota automatically via hbase-site.xml.
[source,java]
----
<property>
<name>hbase.quota.remove.on.table.delete</name>
<value>false</value>
</property>
----
The value is set to `true` by default.
=== HBase Snapshots with Space Quotas
One common area of unintended-filesystem-use with HBase is via HBase snapshots. Because snapshots
exist outside of the management of HBase tables, it is not uncommon for administrators to suddenly
realize that hundreds of gigabytes or terabytes of space is being used by HBase snapshots which were
forgotten and never removed.
link:https://issues.apache.org/jira/browse/HBASE-17748[HBASE-17748] is the umbrella JIRA issue which
expands on the original space quota functionality to also include HBase snapshots. While this is a confusing
subject, the implementation attempts to present this support in as reasonable and simple of a manner as
possible for administrators. This feature does not make any changes to administrator interaction with
space quotas, only in the internal computation of table/namespace usage. Table and namespace usage will
automatically incorporate the size taken by a snapshot per the rules defined below.
As a review, let's cover a snapshot's lifecycle: a snapshot is metadata which points to
a list of HFiles on the filesystem. This is why creating a snapshot is a very cheap operation; no HBase
table data is actually copied to perform a snapshot. Cloning a snapshot into a new table or restoring
a table is a cheap operation for the same reason; the new table references the files which already exist
on the filesystem without a copy. To include snapshots in space quotas, we need to define which table
"owns" a file when a snapshot references the file ("owns" refers to encompassing the filesystem usage
of that file).
Consider a snapshot which was made against a table. When the snapshot refers to a file and the table no
longer refers to that file, the "originating" table "owns" that file. When multiple snapshots refer to
the same file and no table refers to that file, the snapshot with the lowest-sorting name (lexicographically)
is chosen and the table which that snapshot was created from "owns" that file. HFiles are not "double-counted"
hen a table and one or more snapshots refer to that HFile.
When a table is "rematerialized" (via `clone_snapshot` or `restore_snapshot`), a similar problem of file
ownership arises. In this case, while the rematerialized table references a file which a snapshot also
references, the table does not "own" the file. The table from which the snapshot was created still "owns"
that file. When the rematerialized table is compacted or the snapshot is deleted, the rematerialized table
will uniquely refer to a new file and "own" the usage of that file. Similarly, when a table is duplicated via a snapshot
and `restore_snapshot`, the new table will not consume any quota size until the original table stops referring
to the files, either due to a compaction on the original table, a compaction on the new table, or the
original table being deleted.
One new HBase shell command was added to inspect the computed sizes of each snapshot in an HBase instance.
----
hbase> list_snapshot_sizes
SNAPSHOT SIZE
t1.s1 1159108
----
[[ops.backup]]
== HBase Backup
There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster.
Each approach has pros and cons.
For additional information, see link:http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup
Options] over on the Sematext Blog.
[[ops.backup.fullshutdown]]
=== Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages.
The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata.
The obvious con is that the cluster is down.
The steps include:
[[ops.backup.fullshutdown.stop]]
==== Stop HBase
[[ops.backup.fullshutdown.distcp]]
==== Distcp
Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files.
Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
[[ops.backup.fullshutdown.restore]]
==== Restore (if needed)
The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp.
The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
[[ops.backup.live.replication]]
=== Live Cluster Backup - Replication
This approach assumes that there is a second cluster.
See the HBase page on link:https://hbase.apache.org/book.html#_cluster_replication[replication] for more information.
[[ops.backup.live.copytable]]
=== Live Cluster Backup - CopyTable
The <<copy.table,copytable>> utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
Since the cluster is up, there is a risk that edits could be missed in the copy process.
[[ops.backup.live.export]]
=== Live Cluster Backup - Export
The <<export,export>> approach dumps the content of a table to HDFS on the same cluster.
To restore the data, the <<import,import>> utility would be used.
Since the cluster is up, there is a risk that edits could be missed in the export process. If you want to know more about HBase back-up and restore see the page on link:http://hbase.apache.org/book.html#backuprestore[Backup and Restore].
[[ops.snapshots]]
== HBase Snapshots
HBase Snapshots allow you to take a copy of a table (both contents and metadata)with a very small performance impact. A Snapshot is an immutable
collection of table metadata and a list of HFiles that comprised the table at the time the Snapshot was taken. A "clone"
of a snapshot creates a new table from that snapshot, and a "restore" of a snapshot returns the contents of a table to
what it was when the snapshot was created. The "clone" and "restore" operations do not require any data to be copied,
as the underlying HFiles (the files which contain the data for an HBase table) are not modified with either action.
Simiarly, exporting a snapshot to another cluster has little impact on RegionServers of the local cluster.
Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table.
The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
[[ops.snapshots.configuration]]
=== Configuration
To turn on the snapshot support just set the `hbase.snapshot.enabled` property to true.
(Snapshots are enabled by default in 0.95+ and off by default in 0.94.6+)
[source,java]
----
<property>
<name>hbase.snapshot.enabled</name>
<value>true</value>
</property>
----
[[ops.snapshots.takeasnapshot]]
=== Take a Snapshot
You can take a snapshot of a table regardless of whether it is enabled or disabled.
The snapshot operation doesn't involve any data copying.
----
$ ./bin/hbase shell
hbase> snapshot 'myTable', 'myTableSnapshot-122112'
----
.Take a Snapshot Without Flushing
The default behavior is to perform a flush of data in memory before the snapshot is taken.
This means that data in memory is included in the snapshot.
In most cases, this is the desired behavior.
However, if your set-up can tolerate data in memory being excluded from the snapshot, you can use the `SKIP_FLUSH` option of the `snapshot` command to disable and flushing while taking the snapshot.
----
hbase> snapshot 'mytable', 'snapshot123', {SKIP_FLUSH => true}
----
WARNING: There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled.
A snapshot is only a representation of a table during a window of time.
The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors.
There is also no way to know whether a given insert or update is in memory or has been flushed.
.Take a Snapshot With TTL
Snapshots have a lifecycle that is independent from the table from which they are created.
Although data in a table may be stored with TTL the data files containing them become
frozen by the snapshot. Space consumed by expired cells will not be reclaimed by normal
table housekeeping like compaction. While this is expected it can be inconvenient at scale.
When many snapshots are under management and the data in various tables is expired by
TTL some notion of optional TTL (and optional default TTL) for snapshots could be useful.
----
hbase> snapshot 'mytable', 'snapshot1234', {TTL => 86400}
----
The above command creates snapshot `snapshot1234` with TTL of 86400 sec(24 hours)
and hence, the snapshot is supposed to be cleaned up after 24 hours
.Default Snapshot TTL:
- FOREVER by default
- User specified Default TTL with config `hbase.master.snapshot.ttl`
While creating a Snapshot, if TTL in seconds is not specified, by default the snapshot
would not be deleted automatically. i.e. it would be retained forever until it is
manually deleted. However, the user can update this default TTL behavior by
providing default TTL in sec for key: `hbase.master.snapshot.ttl`.
Value 0 for this config indicates TTL: FOREVER
.Enable/Disable Snapshot Auto Cleanup on running cluster:
By default, snapshot auto cleanup based on TTL would be enabled
for any new cluster.
At any point in time, if snapshot cleanup is supposed to be stopped due to
some snapshot restore activity or any other reason, it is advisable
to disable it using shell command:
----
hbase> snapshot_cleanup_switch false
----
We can re-enable it using:
----
hbase> snapshot_cleanup_switch true
----
The shell command with switch false would disable snapshot auto
cleanup activity based on TTL and return the previous state of
the activity(true: running already, false: disabled already)
A sample output for above commands:
----
Previous snapshot cleanup state : true
Took 0.0069 seconds
=> "true"
----
We can query whether snapshot auto cleanup is enabled for
cluster using:
----
hbase> snapshot_cleanup_enabled
----
The command would return output in true/false.
[[ops.snapshots.list]]
=== Listing Snapshots
List all snapshots taken (by printing the names and relative information).
----
$ ./bin/hbase shell
hbase> list_snapshots
----
[[ops.snapshots.delete]]
=== Deleting Snapshots
You can remove a snapshot, and the files retained for that snapshot will be removed if no longer needed.
----
$ ./bin/hbase shell
hbase> delete_snapshot 'myTableSnapshot-122112'
----
[[ops.snapshots.clone]]
=== Clone a table from snapshot
From a snapshot you can create a new table (clone operation) with the same data that you had when the snapshot was taken.
The clone operation, doesn't involve data copies, and a change to the cloned table doesn't impact the snapshot or the original table.
----
$ ./bin/hbase shell
hbase> clone_snapshot 'myTableSnapshot-122112', 'myNewTestTable'
----
[[ops.snapshots.restore]]
=== Restore a snapshot
The restore operation requires the table to be disabled, and the table will be restored to the state at the time when the snapshot was taken, changing both data and schema if required.
----
$ ./bin/hbase shell
hbase> disable 'myTable'
hbase> restore_snapshot 'myTableSnapshot-122112'
----
NOTE: Since Replication works at log level and snapshots at file-system level, after a restore, the replicas will be in a different state from the master.
If you want to use restore, you need to stop replication and redo the bootstrap.
In case of partial data-loss due to misbehaving client, instead of a full restore that requires the table to be disabled, you can clone the table from the snapshot and use a Map-Reduce job to copy the data that you need, from the clone to the main one.
[[ops.snapshots.acls]]
=== Snapshots operations and ACLs
If you are using security with the AccessController Coprocessor (See <<hbase.accesscontrol.configuration,hbase.accesscontrol.configuration>>), only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights.
This means that restoring a table preserves the ACL rights of the existing table, while cloning a table creates a new table that has no ACL rights until the administrator adds them.
[[ops.snapshots.export]]
=== Export to another cluster
The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster.
The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs:///srv2:8082/hbase) using 16 mappers:
[source,bourne]
----
$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16
----
.Limiting Bandwidth Consumption
You can limit the bandwidth consumption when exporting a snapshot, by specifying the `-bandwidth` parameter, which expects an integer representing megabytes per second.
The following example limits the above example to 200 MB/sec.
[source,bourne]
----
$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200
----
[[snapshots_s3]]
=== Storing Snapshots in an Amazon S3 Bucket
You can store and retrieve snapshots from Amazon S3, using the following procedure.
NOTE: You can also store snapshots in Microsoft Azure Blob Storage. See <<snapshots_azure>>.
.Prerequisites
- You must be using HBase 1.0 or higher and Hadoop 2.6.1 or higher, which is the first
configuration that uses the Amazon AWS SDK.
- You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://`
and `s3://` protocols have various limitations and do not use the Amazon AWS SDK.
- The `s3a://` URI must be configured and available on the server where you run
the commands to export and restore the snapshot.
After you have fulfilled the prerequisites, take the snapshot like you normally would.
Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot`
command like the one below, substituting your own `s3a://` path in the `copy-from`
or `copy-to` directive and substituting or modifying other options as required:
----
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
-snapshot MySnapshot \
-copy-from hdfs://srv2:8082/hbase \
-copy-to s3a://<bucket>/<namespace>/hbase \
-chuser MyUser \
-chgroup MyGroup \
-chmod 700 \
-mappers 16
----
----
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
-snapshot MySnapshot
-copy-from s3a://<bucket>/<namespace>/hbase \
-copy-to hdfs://srv2:8082/hbase \
-chuser MyUser \
-chgroup MyGroup \
-chmod 700 \
-mappers 16
----
You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the
`-remote-dir` option.
----
$ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \
-remote-dir s3a://<bucket>/<namespace>/hbase \
-list-snapshots
----
[[snapshots_azure]]
== Storing Snapshots in Microsoft Azure Blob Storage
You can store snapshots in Microsoft Azure Blog Storage using the same techniques
as in <<snapshots_s3>>.
.Prerequisites
- You must be using HBase 1.2 or higher with Hadoop 2.7.1 or
higher. No version of HBase supports Hadoop 2.7.0.
- Your hosts must be configured to be aware of the Azure blob storage filesystem.
See https://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html.
After you meet the prerequisites, follow the instructions
in <<snapshots_s3>>, replacingthe protocol specifier with `wasb://` or `wasbs://`.
[[ops.capacity]]
== Capacity Planning and Region Sizing
There are several considerations when planning the capacity for an HBase cluster and performing the initial configuration.
Start with a solid understanding of how HBase handles data internally.
[[ops.capacity.nodes]]
=== Node count and hardware/VM configuration
[[ops.capacity.nodes.datasize]]
==== Physical data size
Physical data size on disk is distinct from logical size of your data and is affected by the following:
* Increased by HBase overhead
+
* See <<keyvalue,keyvalue>> and <<keysize,keysize>>.
At least 24 bytes per key-value (cell), can be more.
Small keys/values means more relative overhead.
* KeyValue instances are aggregated into blocks, which are indexed.
Indexes also have to be stored.
Blocksize is configurable on a per-ColumnFamily basis.
See <<regions.arch,regions.arch>>.
* Decreased by <<compression,compression>> and data block encoding, depending on data.
See also link:http://search-hadoop.com/m/lL12B1PFVhp1[this thread].
You might want to test what compression and encoding (if any) make sense for your data.
* Increased by size of region server <<wal,wal>> (usually fixed and negligible - less than half of RS memory size, per RS).
* Increased by HDFS replication - usually x3.
Aside from the disk space necessary to store the data, one RS may not be able to serve arbitrarily large amounts of data due to some practical limits on region count and size (see <<ops.capacity.regions,ops.capacity.regions>>).
[[ops.capacity.nodes.throughput]]
==== Read/Write throughput
Number of nodes can also be driven by required throughput for reads and/or writes.
The throughput one can get per node depends a lot on data (esp.
key/value sizes) and request patterns, as well as node and system configuration.
Planning should be done for peak load if it is likely that the load would be the main driver of the increase of the node count.
PerformanceEvaluation and <<ycsb,ycsb>> tools can be used to test single node or a test cluster.
For write, usually 5-15Mb/s per RS can be expected, since every region server has only one active WAL.
There's no good estimate for reads, as it depends vastly on data, requests, and cache hit rate. <<perf.casestudy,perf.casestudy>> might be helpful.
[[ops.capacity.nodes.gc]]
==== JVM GC limitations
RS cannot currently utilize very large heap due to cost of GC.
There's also no good way of running multiple RS-es per server (other than running several VMs per machine). Thus, ~20-24Gb or less memory dedicated to one RS is recommended.
GC tuning is required for large heap sizes.
See <<gcpause,gcpause>>, <<trouble.log.gc,trouble.log.gc>> and elsewhere (TODO: where?)
[[ops.capacity.regions]]
=== Determining region count and size
Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range.
The number of regions cannot be configured directly (unless you go for fully <<disable.splitting,disable.splitting>>); adjust the region size to achieve the target region size given table size.
When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands.
These settings will override the ones in `hbase-site.xml`.
That is useful if your tables have different workloads/use cases.
Also note that in the discussion of region sizes here, _HDFS replication factor is not (and should not be) taken into account, whereas
other factors <<ops.capacity.nodes.datasize,ops.capacity.nodes.datasize>> should be._ So, if your data is compressed and replicated 3 ways by HDFS, "9 Gb region" means 9 Gb of compressed data.
HDFS replication factor only affects your disk usage and is invisible to most HBase code.
==== Viewing the Current Number of Regions
You can view the current number of regions for a given table using the HMaster UI.
In the [label]#Tables# section, the number of online regions for each table is listed in the [label]#Online Regions# column.
This total only includes the in-memory state and does not include disabled or offline regions.
[[ops.capacity.regions.count]]
==== Number of regions per RS - upper bound
In production scenarios, where you have a lot of data, you are normally concerned with the maximum number of regions you can have per server. <<too_many_regions,too many regions>> has technical discussion on the subject.
Basically, the maximum number of regions is mostly determined by memstore memory usage.
Each region has its own memstores; these grow up to a configurable size; usually in 128-256 MB range, see <<hbase.hregion.memstore.flush.size,hbase.hregion.memstore.flush.size>>.
One memstore exists per column family (so there's only one per region if there's one CF in the table). The RS dedicates some fraction of total memory to its memstores (see <<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>). If this memory is exceeded (too much memstore usage), it can cause undesirable consequences such as unresponsive server or compaction storms.
A good starting point for the number of regions per RS (assuming one table) is:
[source]
----
((RS memory) * (total memstore fraction)) / ((memstore size)*(# column families))
----
This formula is pseudo-code.
Here are two formulas using the actual tunable parameters, first for HBase 0.98+ and second for HBase 0.94.x.
HBase 0.98.x::
----
((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
----
HBase 0.94.x::
----
((RS Xmx) * hbase.regionserver.global.memstore.upperLimit) / (hbase.hregion.memstore.flush.size * (# column families))+
----
If a given RegionServer has 16 GB of RAM, with default settings, the formula works out to 16384*0.4/128 ~ 51 regions per RS is a starting point.
The formula can be extended to multiple tables; if they all have the same configuration, just use the total number of families.
This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate.
If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count.
Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.
For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.
[[ops.capacity.regions.mincount]]
==== Number of regions per RS - lower bound
HBase scales by having regions across many servers.
Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle.
This really can't be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn't doing anything.
On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.
[[ops.capacity.regions.size]]
==== Maximum region size
For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp.
major, can degrade cluster performance.
Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal.
For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.
The size at which the region is split into two is generally configured via <<hbase.hregion.max.filesize,hbase.hregion.max.filesize>>; for details, see <<arch.region.splits,arch.region.splits>>.
If you cannot estimate the size of your tables well, when starting off, it's probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data.
See <<ops.stripe,ops.stripe>>.
[[ops.capacity.regions.total]]
==== Total data size per region server
According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases.
However, it is important to think about the data vs cache size ratio at the RS level.
With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.
[[ops.capacity.config]]
=== Initial configuration and tuning
First, see <<important_configurations,important configurations>>.
Note that some configurations, more than others, depend on specific scenarios.
Pay special attention to:
* <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>> - request handler thread count, vital for high-throughput workloads.
* <<config.wals,config.wals>> - the blocking number of WAL files depends on your memstore configuration and should be set accordingly to prevent potential blocking when doing high volume of writes.
Then, there are some considerations when setting up your cluster and tables.
[[ops.capacity.config.compactions]]
==== Compactions
Depending on read/write volume and latency requirements, optimal compaction settings may be different.
See <<compaction,compaction>> for some details.
When provisioning for large data sizes, however, it's good to keep in mind that compactions can affect write throughput.
Thus, for write-intensive workloads, you may opt for less frequent compactions and more store files per regions.
Minimum number of files for compactions (`hbase.hstore.compaction.min`) can be set to higher value; <<hbase.hstore.blockingStoreFiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
You may also consider manually managing compactions: <<managed.compactions,managed.compactions>>
[[ops.capacity.config.presplit]]
==== Pre-splitting the table
Based on the target number of the regions per RS (see <<ops.capacity.regions.count,ops.capacity.regions.count>>) and number of RSes, one can pre-split the table at creation time.
This would both avoid some costly splitting as the table starts to fill up, and ensure that the table starts out already distributed across many servers.
If the table is expected to grow large enough to justify that, at least one region per RS should be created.
It is not recommended to split immediately into the full target number of regions (e.g.
50 * number of RSes), but a low intermediate value can be chosen.
For multiple tables, it is recommended to be conservative with presplitting (e.g.
pre-split 1 region per RS at most), especially if you don't know how much each table will grow.
If you split too much, you may end up with too many regions, with some tables having too many small regions.
For pre-splitting howto, see <<manual_region_splitting_decisions,manual region splitting decisions>> and <<precreate.regions,precreate.regions>>.
[[table.rename]]
== Table Rename
In versions 0.90.x of hbase and earlier, we had a simple script that would rename the hdfs table directory and then do an edit of the hbase:meta table replacing all mentions of the old table name with the new.
The script was called `./bin/rename_table.rb`.
The script was deprecated and removed mostly because it was unmaintained and the operation performed by the script was brutal.
As of hbase 0.94.x, you can use the snapshot facility renaming a table.
Here is how you would do it using the hbase shell:
----
hbase shell> disable 'tableName'
hbase shell> snapshot 'tableName', 'tableSnapshot'
hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
hbase shell> delete_snapshot 'tableSnapshot'
hbase shell> drop 'tableName'
----
or in code it would be as follows:
[source,java]
----
void rename(Admin admin, String oldTableName, TableName newTableName) {
String snapshotName = randomName();
admin.disableTable(oldTableName);
admin.snapshot(snapshotName, oldTableName);
admin.cloneSnapshot(snapshotName, newTableName);
admin.deleteSnapshot(snapshotName);
admin.deleteTable(oldTableName);
}
----
[[rsgroup]]
== RegionServer Grouping
RegionServer Grouping (A.K.A `rsgroup`) is an advanced feature for
partitioning regionservers into distinctive groups for strict isolation. It
should only be used by users who are sophisticated enough to understand the
full implications and have a sufficient background in managing HBase clusters.
It was developed by Yahoo! and they run it at scale on their large grid cluster.
See link:http://www.slideshare.net/HBaseCon/keynote-apache-hbase-at-yahoo-scale[HBase at Yahoo! Scale].
RSGroups can be defined and managed with both admin methods and shell commands.
A server can be added to a group with hostname and port pair and tables
can be moved to this group so that only regionservers in the same rsgroup can
host the regions of the table. The group for a table is stored in its
TableDescriptor, the property name is `hbase.rsgroup.name`. You can also set
this property on a namespace, so it will cause all the tables under this
namespace to be placed into this group. RegionServers and tables can only
belong to one rsgroup at a time. By default, all tables and regionservers
belong to the `default` rsgroup. System tables can also be put into a
rsgroup using the regular APIs. A custom balancer implementation tracks
assignments per rsgroup and makes sure to move regions to the relevant
regionservers in that rsgroup. The rsgroup information is stored in a regular
HBase table, and a zookeeper-based read-only cache is used at cluster bootstrap
time.
To enable, add the following to your hbase-site.xml and restart your Master:
[source,xml]
----
<property>
<name>hbase.balancer.rsgroup.enabled</name>
<value>true</value>
</property>
----
Then use the admin/shell _rsgroup_ methods/commands to create and manipulate
RegionServer groups: e.g. to add a rsgroup and then add a server to it.
To see the list of rsgroup commands available in the hbase shell type:
[source, bash]
----
hbase(main):008:0> help 'rsgroup'
Took 0.5610 seconds
----
High level, you create a rsgroup that is other than the `default` group using
_add_rsgroup_ command. You then add servers and tables to this group with the
_move_servers_rsgroup_ and _move_tables_rsgroup_ commands. If necessary, run
a balance for the group if tables are slow to migrate to the groups dedicated
server with the _balance_rsgroup_ command (Usually this is not needed). To
monitor effect of the commands, see the `Tables` tab toward the end of the
Master UI home page. If you click on a table, you can see what servers it is
deployed across. You should see here a reflection of the grouping done with
your shell commands. View the master log if issues.
Here is example using a few of the rsgroup commands. To add a group, do as
follows:
[source, bash]
----
hbase(main):008:0> add_rsgroup 'my_group'
Took 0.5610 seconds
----
.RegionServer Groups must be Enabled
[NOTE]
====
If you have not enabled the rsgroup feature and you call any of the rsgroup
admin methods or shell commands the call will fail with a
`DoNotRetryIOException` with a detail message that says the rsgroup feature
is disabled.
====
Add a server (specified by hostname + port) to the just-made group using the
_move_servers_rsgroup_ command as follows:
[source, bash]
----
hbase(main):010:0> move_servers_rsgroup 'my_group',['k.att.net:51129']
----
.Hostname and Port vs ServerName
[NOTE]
====
The rsgroup feature refers to servers in a cluster with hostname and port only.
It does not make use of the HBase ServerName type identifying RegionServers;
i.e. hostname + port + starttime to distinguish RegionServer instances. The
rsgroup feature keeps working across RegionServer restarts so the starttime of
ServerName -- and hence the ServerName type -- is not appropriate.
Administration
====
Servers come and go over the lifetime of a Cluster. Currently, you must
manually align the servers referenced in rsgroups with the actual state of
nodes in the running cluster. What we mean by this is that if you decommission
a server, then you must update rsgroups as part of your server decommission
process removing references. Notice that, by calling `clearDeadServers`
manually will also remove the dead servers from any rsgroups, but the problem
is that we will lost track of the dead servers after master restarts, which
means you still need to update the rsgroup by your own.
Please use `Admin.removeServersFromRSGroup` or shell command
_remove_servers_rsgroup_ to remove decommission servers from rsgroup.
The `default` group is not like other rsgroups in that it is dynamic. Its server
list mirrors the current state of the cluster; i.e. if you shutdown a server that
was part of the `default` rsgroup, and then do a _get_rsgroup_ `default` to list
its content in the shell, the server will no longer be listed. For non-default
groups, though a mode may be offline, it will persist in the non-default group’s
list of servers. But if you move the offline server from the non-default rsgroup
to default, it will not show in the `default` list. It will just be dropped.
=== Best Practice
The authors of the rsgroup feature, the Yahoo! HBase Engineering team, have been
running it on their grid for a good while now and have come up with a few best
practices informed by their experience.
==== Isolate System Tables
Either have a system rsgroup where all the system tables are or just leave the
system tables in `default` rsgroup and have all user-space tables are in
non-default rsgroups.
==== Dead Nodes
Yahoo! Have found it useful at their scale to keep a special rsgroup of dead or
questionable nodes; this is one means of keeping them out of the running until repair.
Be careful replacing dead nodes in an rsgroup. Ensure there are enough live nodes
before you start moving out the dead. Move in good live nodes first if you have to.
=== Troubleshooting
Viewing the Master log will give you insight on rsgroup operation.
If it appears stuck, restart the Master process.
=== Remove RegionServer Grouping
Simply disable RegionServer Grouping feature is easy, just remove the
'hbase.balancer.rsgroup.enabled' from hbase-site.xml or explicitly set it to
false in hbase-site.xml.
[source,xml]
----
<property>
<name>hbase.balancer.rsgroup.enabled</name>
<value>false</value>
</property>
----
But if you change the 'hbase.balancer.rsgroup.enabled' to true, the old rsgroup
configs will take effect again. So if you want to completely remove the
RegionServer Grouping feature from a cluster, so that if the feature is
re-enabled in the future, the old meta data will not affect the functioning of
the cluster, there are more steps to do.
- Move all tables in non-default rsgroups to `default` regionserver group
[source,bash]
----
#Reassigning table t1 from non default group - hbase shell
hbase(main):005:0> move_tables_rsgroup 'default',['t1']
----
- Move all regionservers in non-default rsgroups to `default` regionserver group
[source, bash]
----
#Reassigning all the servers in the non-default rsgroup to default - hbase shell
hbase(main):008:0> move_servers_rsgroup 'default',['rs1.xxx.com:16206','rs2.xxx.com:16202','rs3.xxx.com:16204']
----
- Remove all non-default rsgroups. `default` rsgroup created implicitly doesn't have to be removed
[source,bash]
----
#removing non default rsgroup - hbase shell
hbase(main):009:0> remove_rsgroup 'group2'
----
- Remove the changes made in `hbase-site.xml` and restart the cluster
- Drop the table `hbase:rsgroup` from `hbase`
[source, bash]
----
#Through hbase shell drop table hbase:rsgroup
hbase(main):001:0> disable 'hbase:rsgroup'
0 row(s) in 2.6270 seconds
hbase(main):002:0> drop 'hbase:rsgroup'
0 row(s) in 1.2730 seconds
----
- Remove znode `rsgroup` from the cluster ZooKeeper using zkCli.sh
[source, bash]
----
#From ZK remove the node /hbase/rsgroup through zkCli.sh
rmr /hbase/rsgroup
----
=== ACL
To enable ACL, add the following to your hbase-site.xml and restart your Master:
[source,xml]
----
<property>
<name>hbase.security.authorization</name>
<value>true</value>
<property>
----
[[migrating.rsgroup]]
=== Migrating From Old Implementation
The coprocessor `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is
deprected, but for compatible, if you want the pre 3.0.0 hbase client/shell
to communicate with the new hbase cluster, you still need to add this
coprocessor to master.
The `hbase.rsgroup.grouploadbalancer.class` config has been deprecated, as now
the top level load balancer will always be `RSGroupBasedLoadBalaner`, and the
`hbase.master.loadbalancer.class` config is for configuring the balancer within
a group. This also means you should not set `hbase.master.loadbalancer.class`
to `RSGroupBasedLoadBalaner` any more even if rsgroup feature is enabled.
And we have done some special changes for compatibility. First, if coprocessor
`org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is specified, the
`hbase.balancer.rsgroup.enabled` flag will be set to true automatically to
enable rs group feature. Second, we will load
`hbase.rsgroup.grouploadbalancer.class` prior to
`hbase.master.loadbalancer.class`. And last, if you do not set
`hbase.rsgroup.grouploadbalancer.class` but only set
`hbase.master.loadbalancer.class` to `RSGroupBasedLoadBalancer`, we will load
the default load balancer to avoid infinite nesting. This means you do not need
to change anything when upgrading if you have already enabled rs group feature.
The main difference comparing to the old implementation is that, now the
rsgroup for a table is stored in `TableDescriptor`, instead of in
`RSGroupInfo`, so the `getTables` method of `RSGroupInfo` has been deprecated.
And if you use the `Admin` methods to get the `RSGroupInfo`, its `getTables`
method will always return empty. This is because that in the old
implementation, this method is a bit broken as you can set rsgroup on namespace
and make all the tables under this namespace into this group but you can not
get these tables through `RSGroupInfo.getTables`. Now you should use the two
new methods `listTablesInRSGroup` and
`getConfiguredNamespacesAndTablesInRSGroup` in `Admin` to get tables and
namespaces in a rsgroup.
Of course the behavior for the old RSGroupAdminEndpoint is not changed,
we will fill the tables field of the RSGroupInfo before returning, to make it
compatible with old hbase client/shell.
When upgrading, the migration between the RSGroupInfo and TableDescriptor will
be done automatically. It will take sometime, but it is fine to restart master
in the middle, the migration will continue after restart. And during the
migration, the rs group feature will still work and in most cases the region
will not be misplaced(since this is only a one time job and will not last too
long so we have not test it very seriously to make sure the region will not be
misplaced always, so we use the word 'in most cases'). The implementation is a
bit tricky, you can see the code in `RSGroupInfoManagerImpl.migrate` if
interested.
[[normalizer]]
== Region Normalizer
The Region Normalizer tries to make Regions all in a table about the same in
size. It does this by first calculating total table size and average size per
region. It splits any region that is larger than twice this size. Any region
that is much smaller is merged into an adjacent region. The Normalizer runs on
a regular schedule, which is configurable. It can also be disabled entirely via
a runtime "switch". It can be run manually via the shell or Admin API call.
Even if normally disabled, it is good to run manually after the cluster has
been running a while or say after a burst of activity such as a large delete.
The Normalizer works well for bringing a table's region boundaries into
alignment with the reality of data distribution after an initial effort at
pre-splitting a table. It is also a nice compliment to the data TTL feature
when the schema includes timestamp in the rowkey, as it will automatically
merge away regions whose contents have expired.
(The bulk of the below detail was copied wholesale from the blog by Romil Choksi at
link:https://community.hortonworks.com/articles/54987/hbase-region-normalizer.html[HBase Region Normalizer]).
The Region Normalizer is feature available since HBase-1.2. It runs a set of
pre-calculated merge/split actions to resize regions that are either too
large or too small compared to the average region size for a given table. Region
Normalizer when invoked computes a normalization 'plan' for all of the tables in
HBase. System tables (such as hbase:meta, hbase:namespace, Phoenix system tables
etc) and user tables with normalization disabled are ignored while computing the
plan. For normalization enabled tables, normalization plan is carried out in
parallel across multiple tables.
Normalizer can be enabled or disabled globally for the entire cluster using the
‘normalizer_switch’ command in the HBase shell. Normalization can also be
controlled on a per table basis, which is disabled by default when a table is
created. Normalization for a table can be enabled or disabled by setting the
NORMALIZATION_ENABLED table attribute to true or false.
To check normalizer status and enable/disable normalizer
[source,bash]
----
hbase(main):001:0> normalizer_enabled
true
0 row(s) in 0.4870 seconds
hbase(main):002:0> normalizer_switch false
true
0 row(s) in 0.0640 seconds
hbase(main):003:0> normalizer_enabled
false
0 row(s) in 0.0120 seconds
hbase(main):004:0> normalizer_switch true
false
0 row(s) in 0.0200 seconds
hbase(main):005:0> normalizer_enabled
true
0 row(s) in 0.0090 seconds
----
When enabled, Normalizer is invoked in the background every 5 mins (by default),
which can be configured using `hbase.normalization.period` in `hbase-site.xml`.
Normalizer can also be invoked manually/programmatically at will using HBase shell’s
`normalize` command. HBase by default uses `SimpleRegionNormalizer`, but users can
design their own normalizer as long as they implement the RegionNormalizer Interface.
Details about the logic used by `SimpleRegionNormalizer` to compute its normalization
plan can be found link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html[here].
The below example shows a normalization plan being computed for an user table, and
merge action being taken as a result of the normalization plan computed by SimpleRegionNormalizer.
Consider an user table with some pre-split regions having 3 equally large regions
(about 100K rows) and 1 relatively small region (about 25K rows). Following is the
snippet from an hbase meta table scan showing each of the pre-split regions for
the user table.
----
table_p8ddpd6q5z,,1469494305548.68b9892220865cb6048 column=info:regioninfo, timestamp=1469494306375, value={ENCODED => 68b9892220865cb604809c950d1adf48, NAME => 'table_p8ddpd6q5z,,1469494305548.68b989222 09c950d1adf48. 0865cb604809c950d1adf48.', STARTKEY => '', ENDKEY => '1'}
....
table_p8ddpd6q5z,1,1469494317178.867b77333bdc75a028 column=info:regioninfo, timestamp=1469494317848, value={ENCODED => 867b77333bdc75a028bb4c5e4b235f48, NAME => 'table_p8ddpd6q5z,1,1469494317178.867b7733 bb4c5e4b235f48. 3bdc75a028bb4c5e4b235f48.', STARTKEY => '1', ENDKEY => '3'}
....
table_p8ddpd6q5z,3,1469494328323.98f019a753425e7977 column=info:regioninfo, timestamp=1469494328486, value={ENCODED => 98f019a753425e7977ab8636e32deeeb, NAME => 'table_p8ddpd6q5z,3,1469494328323.98f019a7 ab8636e32deeeb. 53425e7977ab8636e32deeeb.', STARTKEY => '3', ENDKEY => '7'}
....
table_p8ddpd6q5z,7,1469494339662.94c64e748979ecbb16 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 94c64e748979ecbb166f6cc6550e25c6, NAME => 'table_p8ddpd6q5z,7,1469494339662.94c64e74 6f6cc6550e25c6. 8979ecbb166f6cc6550e25c6.', STARTKEY => '7', ENDKEY => '8'}
....
table_p8ddpd6q5z,8,1469494339662.6d2b3f5fd1595ab8e7 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 6d2b3f5fd1595ab8e7c031876057b1ee, NAME => 'table_p8ddpd6q5z,8,1469494339662.6d2b3f5f c031876057b1ee. d1595ab8e7c031876057b1ee.', STARTKEY => '8', ENDKEY => ''}
----
Invoking the normalizer using ‘normalize’ int the HBase shell, the below log snippet
from HMaster log shows the normalization plan computed as per the logic defined for
SimpleRegionNormalizer. Since the total region size (in MB) for the adjacent smallest
regions in the table is less than the average region size, the normalizer computes a
plan to merge these two regions.
----
2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto
normalization turned on
2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't have auto normalization turned on
2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: table_h2osxu3wat, as it's either system table or doesn't have autonormalization turned on
2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 5
2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 2.4
2016-07-26 07:08:26,929 INFO [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, small region size: 0 plus its neighbor size: 0, less thanthe avg size 2.4, merging them
2016-07-26 07:08:26,971 INFO [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.MergeNormalizationPlan: Executing merging normalization plan: MergeNormalizationPlan{firstRegion={ENCODED=> d51df2c58e9b525206b1325fd925a971, NAME => 'table_p8ddpd6q5z,,1469514755237.d51df2c58e9b525206b1325fd925a971.', STARTKEY => '', ENDKEY => '1'}, secondRegion={ENCODED => e69c6b25c7b9562d078d9ad3994f5330, NAME => 'table_p8ddpd6q5z,1,1469514767669.e69c6b25c7b9562d078d9ad3994f5330.',
STARTKEY => '1', ENDKEY => '3'}}
----
Region normalizer as per it’s computed plan, merged the region with start key as ‘’
and end key as ‘1’, with another region having start key as ‘1’ and end key as ‘3’.
Now, that these regions have been merged we see a single new region with start key
as ‘’ and end key as ‘3’
----
table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeA, timestamp=1469516907431,
value=PBUF\x08\xA5\xD9\x9E\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x00"\x011(\x000\x00 ea74d246741ba. 8\x00
table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeB, timestamp=1469516907431,
value=PBUF\x08\xB5\xBA\x9F\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x011"\x013(\x000\x0 ea74d246741ba. 08\x00
table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:regioninfo, timestamp=1469516907431, value={ENCODED => e06c9b83c4a252b130eea74d246741ba, NAME => 'table_p8ddpd6q5z,,1469516907210.e06c9b83c ea74d246741ba. 4a252b130eea74d246741ba.', STARTKEY => '', ENDKEY => '3'}
....
table_p8ddpd6q5z,3,1469514778736.bf024670a847c0adff column=info:regioninfo, timestamp=1469514779417, value={ENCODED => bf024670a847c0adffb74b2e13408b32, NAME => 'table_p8ddpd6q5z,3,1469514778736.bf024670 b74b2e13408b32. a847c0adffb74b2e13408b32.' STARTKEY => '3', ENDKEY => '7'}
....
table_p8ddpd6q5z,7,1469514790152.7c5a67bc755e649db2 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 7c5a67bc755e649db22f49af6270f1e1, NAME => 'table_p8ddpd6q5z,7,1469514790152.7c5a67bc 2f49af6270f1e1. 755e649db22f49af6270f1e1.', STARTKEY => '7', ENDKEY => '8'}
....
table_p8ddpd6q5z,8,1469514790152.58e7503cda69f98f47 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 58e7503cda69f98f4755178e74288c3a, NAME => 'table_p8ddpd6q5z,8,1469514790152.58e7503c 55178e74288c3a. da69f98f4755178e74288c3a.', STARTKEY => '8', ENDKEY => ''}
----
A similar example can be seen for an user table with 3 smaller regions and 1
relatively large region. For this example, we have an user table with 1 large region containing 100K rows, and 3 relatively smaller regions with about 33K rows each. As seen from the normalization plan, since the larger region is more than twice the average region size it ends being split into two regions – one with start key as ‘1’ and end key as ‘154717’ and the other region with start key as '154717' and end key as ‘3’
----
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 4
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 3.0
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: No normalization needed, regions look good for table: table_p8ddpd6q5z
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_h2osxu3wat, number of regions: 5
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, total aggregated regions size: 7
2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, average region size: 1.4
2016-07-26 07:39:45,636 INFO [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, large region table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db. has size 4, more than twice avg size, splitting
2016-07-26 07:39:45,640 INFO [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SplitNormalizationPlan: Executing splitting normalization plan: SplitNormalizationPlan{regionInfo={ENCODED => 27f2fdbb2b6612ea163eb6b40753c3db, NAME => 'table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db.', STARTKEY => '1', ENDKEY => '3'}, splitPoint=null}
2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto normalization turned on
2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't
have auto normalization turned on …..…..….
2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined 54de97dae764b864504704c1c8d3674a on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => 54de97dae764b864504704c1c8d3674a, NAME => 'table_h2osxu3wat,1,1469518785661.54de97dae764b864504704c1c8d3674a.', STARTKEY => '1', ENDKEY => '154717'}
2016-07-26 07:39:46,246 INFO [AM.ZK.Worker-pool2-t278] master.RegionStates: Transition {d6b5625df331cfec84dce4f1122c567f state=SPLITTING_NEW, ts=1469518786246, server=hbase-test-rc-5.openstacklocal,16020,1469419333913} to {d6b5625df331cfec84dce4f1122c567f state=OPEN, ts=1469518786246,
server=hbase-test-rc-5.openstacklocal,16020,1469419333913}
2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined d6b5625df331cfec84dce4f1122c567f on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => d6b5625df331cfec84dce4f1122c567f, NAME => 'table_h2osxu3wat,154717,1469518785661.d6b5625df331cfec84dce4f1122c567f.', STARTKEY => '154717', ENDKEY => '3'}
----
[[auto_reopen_regions]]
== Auto Region Reopen
We can leak store reader references if a coprocessor or core function somehow
opens a scanner, or wraps one, and then does not take care to call close on the
scanner or the wrapped instance. Leaked store files can not be removed even
after it is invalidated via compaction.
A reasonable mitigation for a reader reference
leak would be a fast reopen of the region on the same server.
This will release all resources, like the refcount, leases, etc.
The clients should gracefully ride over this like any other region in
transition.
By default this auto reopen of region feature would be disabled.
To enabled it, please provide high ref count value for config
`hbase.regions.recovery.store.file.ref.count`.
Please refer to config descriptions for
`hbase.master.regions.recovery.check.interval` and
`hbase.regions.recovery.store.file.ref.count`.