blob: 5179a5e91b4ce4503c7a1a5877be7b90049e4241 [file] [log] [blame]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Overview </title>
</head>
<body>
<h1>Overview </h1>
<div id="front-matter">
<div id="minitoc-area">
<ul class="minitoc">
<li>
<a href="#HCatalog">HCatalog </a>
</li>
<li>
<a href="#HCatalog+Architecture">HCatalog Architecture</a>
<ul class="minitoc">
<li>
<a href="#Interfaces">Interfaces</a>
</li>
<li>
<a href="#Data+Model">Data Model</a>
</li>
</ul>
</li>
<li>
<a href="#Data+Flow+Example">Data Flow Example</a>
</li>
</ul>
</div>
</div>
<a name="HCatalog"></a>
<h2 class="h3">HCatalog </h2>
<div class="section">
<p>HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools &ndash; Pig, MapReduce, and Hive &ndash; to more easily read and write data on the grid. HCatalog&rsquo;s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored &ndash; RCFile format, text files, or SequenceFiles. </p>
<p>HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.</p>
<p></p>
<a name="HCatalog+Architecture"></a>
<h2 class="h3">HCatalog Architecture</h2>
<div class="section">
<p>HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive's command line interface for issuing data definition and metadata exploration commands.</p>
<p></p>
<a name="Interfaces"></a>
<h3 class="h4">Interfaces</h3>
<p>The HCatalog interface for Pig consists of HCatLoader and HCatStorer, which implement the Pig load and store interfaces respectively. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatLoader is implemented on top of HCatInputFormat and HCatStorer is implemented on top of HCatOutputFormat (see <a href="loadstore.html">HCatalog Load and Store</a>).</p>
<p>HCatInputFormat and HCatOutputFormat are HCatalog's interface for MapReduce; they implement Hadoop's InputFormat and OutputFormat, respectively. HCatInputFormat accepts a table to read data from and optionally a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. (See <a href="inputoutput.html">HCatalog Input and Output</a>.)</p>
<p>Note: There is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly.</p>
<p>Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. (Unsupported Hive DDL includes import/export, CREATE TABLE AS SELECT, ALTER TABLE options REBUILD and CONCATENATE, and ANALYZE TABLE ... COMPUTE STATISTICS.) The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc. (see the <a href="cli.html">HCatalog Command Line Interface</a>).</p>
<a name="Data+Model"></a>
<h3 class="h4">Data Model</h3>
<p>HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.</p>
<p>Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive (see <a href="loadstore.html">HCatalog Load and Store</a>). </p>
</div>
<a name="Data+Flow+Example"></a>
<h2 class="h3">Data Flow Example</h2>
<div class="section">
<p>This simple data flow example shows how HCatalog can help grid users share and access data.</p>
<p>
<strong>First</strong> Joe in data acquisition uses distcp to get data onto the grid.</p>
<pre class="code">
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
</pre>
<p>
<strong>Second</strong> Sally in data processing uses Pig to cleanse and prepare the data.</p>
<p>Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.</p>
<pre class="code">
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, &hellip;);
B = filter A by bot_finder(zeta) = 0;
&hellip;
store Z into 'data/processedevents/20100819/data';
</pre>
<p>With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.</p>
<pre class="code">
A = load 'rawevents' using HCatLoader;
B = filter A by date == '20100819' and by bot_finder(zeta) = 0;
&hellip;
store Z into 'processedevents' using HCatStorer("date=20100819");
</pre>
<p>
<strong>Third</strong> Robert in client management uses Hive to analyze his clients' results.</p>
<p>Without HCatalog, Robert must alter the table to add the required partition. </p>
<pre class="code">
alter table processedevents add partition 20100819 hdfs://data/processedevents/20100819/data
select advertiser_id, count(clicks)
from processedevents
where date = '20100819'
group by advertiser_id;
</pre>
<p>With HCatalog, Robert does not need to modify the table structure.</p>
<pre class="code">
select advertiser_id, count(clicks)
from processedevents
where date = &lsquo;20100819&rsquo;
group by advertiser_id;
</pre>
</div>
<div class="copyright">
Copyright &copy;
2012 <a href="http://www.apache.org/licenses/">The Apache Software Foundation</a>
</div>
</div>
</body>
</html>