| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| <html> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <head> |
| <title>HBase</title> |
| </head> |
| <body bgcolor="white"> |
| <a href="http://hbase.org">HBase</a> is a scalable, distributed database built on <a href="http://hadoop.apache.org/core">Hadoop Core</a>. |
| |
| <h2>Table of Contents</h2> |
| <ul> |
| <li> |
| <a href="#requirements">Requirements</a> |
| <ul> |
| <li><a href="#windows">Windows</a></li> |
| </ul> |
| </li> |
| <li> |
| <a href="#getting_started" >Getting Started</a> |
| <ul> |
| <li><a href="#standalone">Standalone</a></li> |
| <li> |
| <a href="#distributed">Distributed Operation: Pseudo- and Fully-distributed modes</a> |
| <ul> |
| <li><a href="#pseudo-distrib">Pseudo-distributed</a></li> |
| <li><a href="#fully-distrib">Fully-distributed</a></li> |
| </ul> |
| </li> |
| </ul> |
| </li> |
| <li><a href="#runandconfirm">Running and Confirming Your Installation</a></li> |
| <li><a href="#upgrading" >Upgrading</a></li> |
| <li><a href="#client_example">Example API Usage</a></li> |
| <li><a href="#related" >Related Documentation</a></li> |
| </ul> |
| |
| <h2><a name="requirements">Requirements</a></h2> |
| <ul> |
| <li>Java 1.6.x, preferably from <a href="http://www.java.com/download/">Sun</a>. Use the latest version available except u18 (u19 is fine).</li> |
| <li>This version of HBase will only run on <a href="http://hadoop.apache.org/common/releases.html">Hadoop 0.20.x</a>. |
| HBase will lose data unless it is running on an HDFS that has a durable sync operation. |
| Currently only the <a href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</a> |
| branch has this attribute. No official releases have been made from this branch as of this writing |
| so you will have to build your own Hadoop from the tip of this branch |
| (or install Cloudera's <a href="http://archive.cloudera.com/docs/">CDH3</a> (as of this writing, it is in beta); it has the |
| 0.20-append patches needed to add a durable sync). |
| See <a href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt">CHANGES.txt</a> |
| in branch-0.20.-append to see list of patches involved.</li> |
| <li> |
| <em>ssh</em> must be installed and <em>sshd</em> must be running to use Hadoop's scripts to manage remote Hadoop daemons. |
| You must be able to ssh to all nodes, including your local node, using passwordless login |
| (Google "ssh passwordless login"). |
| </li> |
| <li> |
| HBase depends on <a href="http://hadoop.apache.org/zookeeper/">ZooKeeper</a> as of release 0.20.0. |
| HBase keeps the location of its root table, who the current master is, and what regions are |
| currently participating in the cluster in ZooKeeper. |
| Clients and Servers now must know their <em>ZooKeeper Quorum locations</em> before |
| they can do anything else (Usually they pick up this information from configuration |
| supplied on their CLASSPATH). By default, HBase will manage a single ZooKeeper instance for you. |
| In <em>standalone</em> and <em>pseudo-distributed</em> modes this is usually enough, but for |
| <em>fully-distributed</em> mode you should configure a ZooKeeper quorum (more info below). |
| </li> |
| <li>Hosts must be able to resolve the fully-qualified domain name of the master.</li> |
| <li> |
| The clocks on cluster members should be in basic alignments. Some skew is tolerable but |
| wild skew could generate odd behaviors. Run <a href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</a> |
| on your cluster, or an equivalent. |
| </li> |
| <li> |
| The default <b>ulimit -n</b> of 1024 on *nix systems will be insufficient. |
| Any significant amount of loading will lead you to |
| <a href="http://wiki.apache.org/hadoop/Hbase/FAQ#A6">FAQ: Why do I see "java.io.IOException...(Too many open files)" in my logs?</a>. |
| You will also notice errors like: |
| <pre> |
| 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException |
| 2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901 |
| </pre> |
| Do yourself a favor and change this to more than 10k. See the FAQ in the hbase wiki for how. |
| Also, HDFS has an upper bound of files that it can serve at the same time, |
| called xcievers (yes, this is <em>misspelled</em>). Again, before doing any loading, |
| make sure you configured Hadoop's conf/hdfs-site.xml with this: |
| <pre> |
| <property> |
| <name>dfs.datanode.max.xcievers</name> |
| <value>2047</value> |
| </property> |
| </pre> |
| See the background of this issue here: <a href="http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A5">Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 256"</a>. |
| Failure to follow these instructions will result in <b>data loss</b>. |
| </li> |
| </ul> |
| |
| <h3><a name="windows">Windows</a></h3> |
| If you are running HBase on Windows, you must install |
| <a href="http://cygwin.com/">Cygwin</a> |
| to have a *nix-like environment for the shell scripts. The full details |
| are explained in |
| the <a href="../cygwin.html">Windows Installation</a> |
| guide. |
| |
| |
| <h2><a name="getting_started" >Getting Started</a></h2> |
| <p>What follows presumes you have obtained a copy of HBase, |
| see <a href="http://hadoop.apache.org/hbase/releases.html">Releases</a>, and are installing |
| for the first time. If upgrading your HBase instance, see <a href="#upgrading">Upgrading</a>.</p> |
| |
| <p>Three modes are described: <em>standalone</em>, <em>pseudo-distributed</em> (where all servers are run on |
| a single host), and <em>fully-distributed</em>. If new to HBase start by following the standalone instructions.</p> |
| |
| <p>Begin by reading <a href="#requirements">Requirements</a>.</p> |
| |
| <p>Whatever your mode, define <code>${HBASE_HOME}</code> to be the location of the root of your HBase installation, e.g. |
| <code>/user/local/hbase</code>. Edit <code>${HBASE_HOME}/conf/hbase-env.sh</code>. In this file you can |
| set the heapsize for HBase, etc. At a minimum, set <code>JAVA_HOME</code> to point at the root of |
| your Java installation.</p> |
| |
| <h3><a name="standalone">Standalone mode</a></h3> |
| <p>If you are running a standalone operation, there should be nothing further to configure; proceed to |
| <a href="#runandconfirm">Running and Confirming Your Installation</a>. If you are running a distributed |
| operation, continue reading.</p> |
| |
| <h3><a name="distributed">Distributed Operation: Pseudo- and Fully-distributed modes</a></h3> |
| <p>Distributed modes require an instance of the <em>Hadoop Distributed File System</em> (DFS). |
| See the Hadoop <a href="http://hadoop.apache.org/common/docs/r0.20.1/api/overview-summary.html#overview_description"> |
| requirements and instructions</a> for how to set up a DFS.</p> |
| |
| <h4><a name="pseudo-distrib">Pseudo-distributed mode</a></h4> |
| <p>A pseudo-distributed mode is simply a distributed mode run on a single host. |
| Use this configuration testing and prototyping on hbase. Do not use this configuration |
| for production nor for evaluating HBase performance. |
| </p> |
| <p>Once you have confirmed your DFS setup, configuring HBase for use on one host requires modification of |
| <code>${HBASE_HOME}/conf/hbase-site.xml</code>, which needs to be pointed at the running Hadoop DFS instance. |
| Use <code>hbase-site.xml</code> to override the properties defined in |
| <code>${HBASE_HOME}/conf/hbase-default.xml</code> (<code>hbase-default.xml</code> itself |
| should never be modified) and for HDFS client configurations. |
| At a minimum, the <code>hbase.rootdir</code>, |
| which points HBase at the Hadoop filesystem to use, |
| and the <code>dfs.replication</code>, an hdfs client-side |
| configuration stipulating how many replicas to keep up, |
| should be redefined in <code>hbase-site.xml</code>. For example, |
| adding the properties below to your <code>hbase-site.xml</code> says that HBase |
| should use the <code>/hbase</code> |
| directory in the HDFS whose namenode is at port 9000 on your local machine, and that |
| it should run with one replica only (recommended for pseudo-distributed mode):</p> |
| <blockquote> |
| <pre> |
| <configuration> |
| ... |
| <property> |
| <name>hbase.rootdir</name> |
| <value>hdfs://localhost:9000/hbase</value> |
| <description>The directory shared by region servers. |
| </description> |
| </property> |
| <property> |
| <name>dfs.replication</name> |
| <value>1</value> |
| <description>The replication count for HLog & HFile storage. Should not be greater than HDFS datanode count. |
| </description> |
| </property> |
| ... |
| </configuration> |
| </pre> |
| </blockquote> |
| |
| <p>Note: Let HBase create the directory. If you don't, you'll get warning saying HBase |
| needs a migration run because the directory is missing files expected by HBase (it'll |
| create them if you let it).</p> |
| <p>Also Note: Above we bind to localhost. This means that a remote client cannot |
| connect. Amend accordingly, if you want to connect from a remote location.</p> |
| |
| <h4><a name="fully-distrib">Fully-Distributed Operation</a></h4> |
| <p>For running a fully-distributed operation on more than one host, the following |
| configurations must be made <em>in addition</em> to those described in the |
| <a href="#pseudo-distrib">pseudo-distributed operation</a> section above.</p> |
| |
| <p>In <code>hbase-site.xml</code>, set <code>hbase.cluster.distributed</code> to <code>true</code>.</p> |
| <blockquote> |
| <pre> |
| <configuration> |
| ... |
| <property> |
| <name>hbase.cluster.distributed</name> |
| <value>true</value> |
| <description>The mode the cluster will be in. Possible values are |
| false: standalone and pseudo-distributed setups with managed Zookeeper |
| true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) |
| </description> |
| </property> |
| ... |
| </configuration> |
| </pre> |
| </blockquote> |
| |
| <p>In fully-distributed mode, you probably want to change your <code>hbase.rootdir</code> |
| from localhost to the name of the node running the HDFS NameNode and you should set |
| the dfs.replication to be the number of datanodes you have in your cluster or 3, which |
| ever is the smaller. |
| </p> |
| <p>In addition |
| to <code>hbase-site.xml</code> changes, a fully-distributed mode requires that you |
| modify <code>${HBASE_HOME}/conf/regionservers</code>. |
| The <code>regionserver</code> file lists all hosts running <code>HRegionServer</code>s, one host per line |
| (This file in HBase is like the Hadoop slaves file at <code>${HADOOP_HOME}/conf/slaves</code>).</p> |
| |
| <p>A distributed HBase depends on a running ZooKeeper cluster. All participating nodes and clients |
| need to be able to get to the running ZooKeeper cluster. |
| HBase by default manages a ZooKeeper cluster for you, or you can manage it on your own and point HBase to it. |
| To toggle HBase management of ZooKeeper, use the <code>HBASE_MANAGES_ZK</code> variable in <code>${HBASE_HOME}/conf/hbase-env.sh</code>. |
| This variable, which defaults to <code>true</code>, tells HBase whether to |
| start/stop the ZooKeeper quorum servers alongside the rest of the servers.</p> |
| |
| <p>When HBase manages the ZooKeeper cluster, you can specify ZooKeeper configuration |
| using its canonical <code>zoo.cfg</code> file (see below), or |
| just specify ZookKeeper options directly in the <code>${HBASE_HOME}/conf/hbase-site.xml</code> |
| (If new to ZooKeeper, go the path of specifying your configuration in HBase's hbase-site.xml). |
| Every ZooKeeper configuration option has a corresponding property in the HBase hbase-site.xml |
| XML configuration file named <code>hbase.zookeeper.property.OPTION</code>. |
| For example, the <code>clientPort</code> setting in ZooKeeper can be changed by |
| setting the <code>hbase.zookeeper.property.clientPort</code> property. |
| For the full list of available properties, see ZooKeeper's <code>zoo.cfg</code>. |
| For the default values used by HBase, see <code>${HBASE_HOME}/conf/hbase-default.xml</code>.</p> |
| |
| <p>At minimum, you should set the list of servers that you want ZooKeeper to run |
| on using the <code>hbase.zookeeper.quorum</code> property. |
| This property defaults to <code>localhost</code> which is not suitable for a |
| fully distributed HBase (it binds to the local machine only and remote clients |
| will not be able to connect). |
| It is recommended to run a ZooKeeper quorum of 3, 5 or 7 machines, and give each |
| ZooKeeper server around 1GB of RAM, and if possible, its own dedicated disk. |
| For very heavily loaded clusters, run ZooKeeper servers on separate machines from the |
| Region Servers (DataNodes and TaskTrackers).</p> |
| |
| <p>To point HBase at an existing ZooKeeper cluster, add |
| a suitably configured <code>zoo.cfg</code> to the <code>CLASSPATH</code>. |
| HBase will see this file and use it to figure out where ZooKeeper is. |
| Additionally set <code>HBASE_MANAGES_ZK</code> in <code>${HBASE_HOME}/conf/hbase-env.sh</code> |
| to <code>false</code> so that HBase doesn't mess with your ZooKeeper setup:</p> |
| <blockquote> |
| <pre> |
| ... |
| # Tell HBase whether it should manage it's own instance of Zookeeper or not. |
| export HBASE_MANAGES_ZK=false |
| </pre> |
| </blockquote> |
| |
| <p>As an example, to have HBase manage a ZooKeeper quorum on nodes |
| <em>rs{1,2,3,4,5}.example.com</em>, bound to port 2222 (the default is 2181), use:</p> |
| <blockquote> |
| <pre> |
| ${HBASE_HOME}/conf/hbase-env.sh: |
| |
| ... |
| # Tell HBase whether it should manage it's own instance of Zookeeper or not. |
| export HBASE_MANAGES_ZK=true |
| |
| ${HBASE_HOME}/conf/hbase-site.xml: |
| |
| <configuration> |
| ... |
| <property> |
| <name>hbase.zookeeper.property.clientPort</name> |
| <value>2222</value> |
| <description>Property from ZooKeeper's config zoo.cfg. |
| The port at which the clients will connect. |
| </description> |
| </property> |
| ... |
| <property> |
| <name>hbase.zookeeper.quorum</name> |
| <value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value> |
| <description>Comma separated list of servers in the ZooKeeper Quorum. |
| For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". |
| By default this is set to localhost for local and pseudo-distributed modes |
| of operation. For a fully-distributed setup, this should be set to a full |
| list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh |
| this is the list of servers which we will start/stop ZooKeeper on. |
| </description> |
| </property> |
| ... |
| </configuration> |
| </pre> |
| </blockquote> |
| |
| <p>When HBase manages ZooKeeper, it will start/stop the ZooKeeper servers as a part |
| of the regular start/stop scripts. If you would like to run it yourself, you can |
| do:</p> |
| <blockquote> |
| <pre>${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper</pre> |
| </blockquote> |
| |
| <p>If you do let HBase manage ZooKeeper for you, make sure you configure |
| where it's data is stored. By default, it will be stored in /tmp which is |
| sometimes cleaned in live systems. Do modify this configuration:</p> |
| <pre> |
| <property> |
| <name>hbase.zookeeper.property.dataDir</name> |
| <value>${hbase.tmp.dir}/zookeeper</value> |
| <description>Property from ZooKeeper's config zoo.cfg. |
| The directory where the snapshot is stored. |
| </description> |
| </property> |
| |
| </pre> |
| |
| <p>Note that you can use HBase in this manner to spin up a ZooKeeper cluster, |
| unrelated to HBase. Just make sure to set <code>HBASE_MANAGES_ZK</code> to |
| <code>false</code> if you want it to stay up so that when HBase shuts down it |
| doesn't take ZooKeeper with it.</p> |
| |
| <p>For more information about setting up a ZooKeeper cluster on your own, see |
| the ZooKeeper <a href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting Started Guide</a>. |
| HBase currently uses ZooKeeper version 3.3.1, so any cluster setup with a |
| 3.x.x version of ZooKeeper should work.</p> |
| |
| <p>Of note, if you have made <em>HDFS client configuration</em> on your Hadoop cluster, HBase will not |
| see this configuration unless you do one of the following:</p> |
| <ul> |
| <li>Add a pointer to your <code>HADOOP_CONF_DIR</code> to <code>CLASSPATH</code> in <code>hbase-env.sh</code>.</li> |
| <li>Add a copy of <code>hdfs-site.xml</code> (or <code>hadoop-site.xml</code>) to <code>${HBASE_HOME}/conf</code>, or</li> |
| <li>if only a small set of HDFS client configurations, add them to <code>hbase-site.xml</code>.</li> |
| </ul> |
| |
| <p>An example of such an HDFS client configuration is <code>dfs.replication</code>. If for example, |
| you want to run with a replication factor of 5, hbase will create files with the default of 3 unless |
| you do the above to make the configuration available to HBase.</p> |
| |
| |
| <h2><a name="runandconfirm">Running and Confirming Your Installation</a></h2> |
| <p>If you are running in standalone, non-distributed mode, HBase by default uses the local filesystem.</p> |
| |
| <p>If you are running a distributed cluster you will need to start the Hadoop DFS daemons and |
| ZooKeeper Quorum before starting HBase and stop the daemons after HBase has shut down.</p> |
| |
| <p>Start and stop the Hadoop DFS daemons by running <code>${HADOOP_HOME}/bin/start-dfs.sh</code>. |
| You can ensure it started properly by testing the put and get of files into the Hadoop filesystem. |
| HBase does not normally use the mapreduce daemons. These do not need to be started.</p> |
| |
| <p>Start up your ZooKeeper cluster.</p> |
| |
| <p>Start HBase with the following command:</p> |
| <blockquote> |
| <pre>${HBASE_HOME}/bin/start-hbase.sh</pre> |
| </blockquote> |
| |
| <p>Once HBase has started, enter <code>${HBASE_HOME}/bin/hbase shell</code> to obtain a |
| shell against HBase from which you can execute commands. |
| Type 'help' at the shells' prompt to get a list of commands. |
| Test your running install by creating tables, inserting content, viewing content, and then dropping your tables. |
| For example:</p> |
| <blockquote> |
| <pre> |
| hbase> # Type "help" to see shell help screen |
| hbase> help |
| hbase> # To create a table named "mylittletable" with a column family of "mylittlecolumnfamily", type |
| hbase> create "mylittletable", "mylittlecolumnfamily" |
| hbase> # To see the schema for you just created "mylittletable" table and its single "mylittlecolumnfamily", type |
| hbase> describe "mylittletable" |
| hbase> # To add a row whose id is "myrow", to the column "mylittlecolumnfamily:x" with a value of 'v', do |
| hbase> put "mylittletable", "myrow", "mylittlecolumnfamily:x", "v" |
| hbase> # To get the cell just added, do |
| hbase> get "mylittletable", "myrow" |
| hbase> # To scan you new table, do |
| hbase> scan "mylittletable" |
| </pre> |
| </blockquote> |
| |
| <p>To stop HBase, exit the HBase shell and enter:</p> |
| <blockquote> |
| <pre>${HBASE_HOME}/bin/stop-hbase.sh</pre> |
| </blockquote> |
| |
| <p>If you are running a distributed operation, be sure to wait until HBase has shut down completely |
| before stopping the Hadoop daemons.</p> |
| |
| <p>The default location for logs is <code>${HBASE_HOME}/logs</code>.</p> |
| |
| <p>HBase also puts up a UI listing vital attributes. By default its deployed on the master host |
| at port 60010 (HBase RegionServers listen on port 60020 by default and put up an informational |
| http server at 60030).</p> |
| |
| <h2><a name="upgrading" >Upgrading</a></h2> |
| <p>After installing a new HBase on top of data written by a previous HBase version, before |
| starting your cluster, run the <code>${HBASE_DIR}/bin/hbase migrate</code> migration script. |
| It will make any adjustments to the filesystem data under <code>hbase.rootdir</code> necessary to run |
| the HBase version. It does not change your install unless you explicitly ask it to.</p> |
| |
| <h2><a name="client_example">Example API Usage</a></h2> |
| <p>For sample Java code, see <a href="org/apache/hadoop/hbase/client/package-summary.html#package_description">org.apache.hadoop.hbase.client</a> documentation.</p> |
| |
| <p>If your client is NOT Java, consider the Thrift or REST libraries.</p> |
| |
| <h2><a name="related" >Related Documentation</a></h2> |
| <ul> |
| <li><a href="http://hbase.org">HBase Home Page</a> |
| </li> |
| <li><a href="http://wiki.apache.org/hadoop/Hbase">HBase Wiki</a> |
| </li> |
| <li><a href="http://hadoop.apache.org/">Hadoop Home Page</a> |
| </li> |
| <li><a href="http://wiki.apache.org/hadoop/Hbase/MultipleMasters">Setting up Multiple HBase Masters</a> |
| </li> |
| <li><a href="http://wiki.apache.org/hadoop/Hbase/RollingRestart">Rolling Upgrades</a> |
| </li> |
| <li><a href="org/apache/hadoop/hbase/client/transactional/package-summary.html#package_description">Transactional HBase</a> |
| </li> |
| <li><a href="org/apache/hadoop/hbase/client/tableindexed/package-summary.html">Table Indexed HBase</a> |
| </li> |
| <li><a href="org/apache/hadoop/hbase/stargate/package-summary.html#package_description">Stargate</a> -- a RESTful Web service front end for HBase. |
| </li> |
| </ul> |
| |
| </body> |
| </html> |