blob: e64f9b8f603a719161e354ac3dd2da23317a72c8 [file] [log] [blame]
<?xml version="1.0"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<document>
<header>
<title>Gora Tutorial</title>
</header>
<body>
<p><hr/> <b>Author :</b> Enis Söztutar, enis [at] apache [dot] org<hr/> </p>
<section>
<title>Introduction</title>
<p>This is the official tutorial for Apache Gora. For this tutorial, we
will be implementing a system to store our web server logs in Apache HBase,
and analyze the results using Apache Hadoop and store the results either in HSQLDB or MySQL.</p>
<p> In this tutorial we will first look at how to set up the environment and
configure Gora and the data stores. Later, we will go over the data we will use and
define the data beans that will be used to interact with the persistency layer.
Next, we will go over the API of Gora to do some basic tasks such as storing objects,
fetching and querying objects, and deleting objects. Last, we will go over an example
program which uses Hadoop MapReduce to analyze the web server logs, and discuss the Gora
MapReduce API in some detail. </p>
<section>
<title>Introduction to Gora</title>
<p> The Apache Gora open source framework provides an in-memory data
model and persistence for big data. Gora supports persisting to
column stores, key value stores, document stores and RDBMSs, and
analyzing the data with extensive Apache Hadoop MapReduce support. In Avro, the
beans to hold the data and RPC interfaces are defined using a JSON
schema. In mapping the data beans to data store specific settings,
Gora depends on mapping files, which are specific to each data store.
Unlike other ORM implementations, Gora the data bean to data store
specific schema mapping is explicit. This has the advantage that,
when using data models such as HBase and Cassandra, you can always
know how the values are persisted. </p>
<p>Gora has a modular architecture. Most of the data stores in Gora,
has it's own module, such as <code>gora-hbase, gora-cassandra</code>,
and <code>gora-sql</code>. In your projects, you need to only include
the artifacts from the modules you use. You can consult the <a href="quickstart.html#Setting+up+your+project">
Setting up your project</a> section in the quick start guide.</p>
</section>
</section>
<section>
<title>Setting up the environment</title>
<section>
<title>Setting up Gora</title>
<p>As a first step, we need to download and compile the Gora source code. The source codes
for the tutorial is in the <code>gora-tutorial</code> module. If you have
already downloaded Gora, that's cool, otherwise, please go
over the steps at the <a href="site:quickstart">Quick Start</a> guide for
how to download and compile Gora. </p>
<p>
Now, after the source code for Gora is at hand, let's have a look at the files under the
directory <code>gora-tutorial</code>. </p>
<p>
<code>$ cd gora-tutorial</code><br/>
<code>$ tree</code><br/>
<source>
|-- build.xml
|-- conf
| |-- gora-hbase-mapping.xml
| |-- gora-sql-mapping.xml
| `-- gora.properties
|-- ivy
| `-- ivy.xml
`-- src
|-- examples
| `-- java
|-- main
| |-- avro
| | |-- metricdatum.json
| | `-- pageview.json
| |-- java
| | `-- org
| | `-- apache
| | `-- gora
| | `-- tutorial
| | `-- log
| | |-- KeyValueWritable.java
| | |-- LogAnalytics.java
| | |-- LogManager.java
| | |-- TextLong.java
| | `-- generated
| | |-- MetricDatum.java
| | `-- Pageview.java
| `-- resources
| `-- access.log.tar.gz
`-- test
|-- conf
`-- java
</source>
</p>
<p>Since gora-tutorial is a top level module of Gora, it depends on the directory
structure imposed by Gora's main build scripts (<code>build.xml</code> and
<code>build-common.xml</code> with Ivy and pom.xml for Maven). The Java source code resides in directory <code>
src/main/java/</code>, avro schemas in <code>src/main/avro/</code>, and data in
<code>src/main/resources/</code>.</p>
</section>
<section>
<title>Setting up HBase</title>
<p> For this tutorial we will be using <a href="ext:hbase"> HBase</a> to
store the logs. For those of you not familiar with HBase, it is a NoSQL
column store with an architecture very similar to Google's BigTable. </p>
<!-- TODO: Tutorial for SQL and Cassandra -->
<p> If you don't already have already HBase setup, you can go over the steps at
<a href="http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#overview_description"> HBase Overview </a>
documentation. Although Gora aims to support the most recent HBase versions, the above tutorial is
specifically for HBase 0.20.6 (don't worry the principals are the same), so download a version from
<a href="http://hbase.apache.org/releases.html">HBase releases</a>. After extracting
the file, cd to the hbase-${dist} directory and start the HBase server. </p>
<p><code>$ bin/start-hbase.sh</code> </p>
<p> and make sure that HBase is available by using the Hbase shell.
<p><code>$ bin/hbase shell</code> </p>
</p>
</section>
<section>
<title>Configuring Gora</title>
<p> Gora is configured through a file in the classpath named <code>gora.properties</code>.
We will be using the following file <code>gora-tutorial/conf/gora.properties</code> </p>
<p><source>
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
gora.datastore.autocreateschema=true
</source></p>
<p> This file states that the default store will be <code>HBaseStore</code>,
and schemas(tables) should be automatically created. </p>
<p> More information for configuring different settings in gora.properties
can be found <a href="site:gora-conf"> here </a>. </p>
</section>
</section>
<section>
<title> Modelling the data </title>
<section>
<title>Data for the tutorial</title>
<p>For this tutorial, we will be parsing and storing the logs of a web server.
Some example logs are at <code>src/main/resources/access.log.tar.gz</code>, which
belongs to the (now shutdown) server at http://www.buldinle.com/. Example logs contain 10,000 lines, between dates 2009/03/10 - 2009/03/15. <br/>
The first thing, we need to do is to extract the logs. </p>
<p><code>$ tar zxvf src/main/resources/access.log.tar.gz -C src/main/resources/</code></p>
<p> You can also use your own log files, given that the log
format is <a href="http://httpd.apache.org/docs/current/logs.html">
Combined Log Format</a>. Some example lines from the log are: </p>
<code>88.254.190.73 - - [10/Mar/2009:20:40:26 +0200] "GET / HTTP/1.1" 200 43 "http://www.buldinle.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB5; .NET CLR 2.0.50727; InfoPath.2)"</code><br/>
<code>78.179.56.27 - - [11/Mar/2009:00:07:40 +0200] "GET /index.php?i=3&amp;a=1__6x39kovbji8&amp;k=3750105 HTTP/1.1" 200 43 "http://www.buldinle.com/index.php?i=3&amp;a=1__6X39Kovbji8&amp;k=3750105" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; OfficeLiveConnector.1.3; OfficeLivePatch.0.0)"</code><br/>
<code>78.163.99.14 - - [12/Mar/2009:18:18:25 +0200] "GET /index.php?a=3__x7l72c&amp;k=4476881 HTTP/1.1" 200 43 "http://www.buldinle.com/index.php?a=3__x7l72c&amp;k=4476881" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1)"</code><br/>
<p>The first fields in order are: User's ip, ignored, ignored, Date and
time, HTTP method, URL, HTTP Method, HTTP status code, Number of bytes
returned, Referrer, and User Agent.</p>
</section>
<section>
<title>Defining data beans</title>
<p> Data beans are the main way to hold the data in memory and persist in Gora. Gora
needs to explicitly keep track of the status of the data in memory, so
we use <a href="ext:avro">Apache Avro</a> for defining the beans. Using
avro gives us the possibility to explicitly keep track object's persistency state,
and a way to serialize object's data. </p>
<p>Defining data beans is a very easy task, but for the exact syntax, please
consult to <a href="ext:avrospec"> Avro Specification</a>.</p>
<p> First, we need to define the bean <b><code>Pageview</code></b> to hold a
single URL access in the logs. Let's go over the class at <code> src/main/avro/pageview.json </code>
</p>
<p>
<source>
{
"type": "record",
"name": "Pageview",
"namespace": "org.apache.gora.tutorial.log.generated",
"fields" : [
{"name": "url", "type": "string"},
{"name": "timestamp", "type": "long"},
{"name": "ip", "type": "string"},
{"name": "httpMethod", "type": "string"},
{"name": "httpStatusCode", "type": "int"},
{"name": "responseSize", "type": "int"},
{"name": "referrer", "type": "string"},
{"name": "userAgent", "type": "string"}
]
}
</source>
</p>
<p>Avro schemas are declared in JSON.
<a href="http://avro.apache.org/docs/current/spec.html#schema_record">
Records</a> are defined with type
<code>"record"</code>, with a name as the name of the class, and a
namespace which is mapped to the package name in Java. The fields
are listed in the <code>"fields"</code> element. Each field is given
with its type. </p>
</section>
<section>
<title>Compiling Avro Schemas</title>
<p>The next step after defining the data beans is to compile the schemas
into Java classes. For that we will use <code>GoraCompiler</code>.
Invoking the Gora compiler by (from Gora top level directory) </p>
<p>
<code>
$ bin/gora compile
</code>
</p> results in:
<p>
<code>
$ Usage: SpecificCompiler &lt;schema file&gt; &lt;output dir&gt;
</code>
</p> <p>so we will issue :</p>
<p>
<code>
$ bin/gora compile gora-tutorial/src/main/avro/pageview.json gora-tutorial/src/main/java/
</code>
</p>
<p>to compile the Pageview class into
<code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/generated/Pageview.java</code>.
However, the tutorial java classes are already committed, so you do not need to do that
now. </p>
<p> Gora compiler extends Avro's <code>SpecificCompiler</code> to convert JSON definition
into a Java class. Generated classes extend
the <a href="ext:api/org/apache/gora/persistency/persistent">Persistent</a> interface.
Most of the methods of the <code>Persistent</code> interface deal with bookkeeping for
persistence, and state tracking, so most of the time they are not used explicitly by the
user. Now, let's look at the internals of the generated class <code>Pageview.java</code>.
</p>
<p>
<source>
public class Pageview extends PersistentBase {
private Utf8 url;
private long timestamp;
private Utf8 ip;
private Utf8 httpMethod;
private int httpStatusCode;
private int responseSize;
private Utf8 referrer;
private Utf8 userAgent;
...
public static final Schema _SCHEMA = Schema.parse("{\"type\":\"record\", ... ");
public static enum Field {
URL(0,"url"),
TIMESTAMP(1,"timestamp"),
IP(2,"ip"),
HTTP_METHOD(3,"httpMethod"),
HTTP_STATUS_CODE(4,"httpStatusCode"),
RESPONSE_SIZE(5,"responseSize"),
REFERRER(6,"referrer"),
USER_AGENT(7,"userAgent"),
;
private int index;
private String name;
Field(int index, String name) {this.index=index;this.name=name;}
public int getIndex() {return index;}
public String getName() {return name;}
public String toString() {return name;}
};
public static final String[] _ALL_FIELDS = {"url","timestamp","ip","httpMethod"
,"httpStatusCode","responseSize","referrer","userAgent",};
...
}
</source>
</p>
<p> We can see the actual field declarations in the class. Note that Avro uses <code>Utf8</code>
class as a placeholder for string fields. We can also see the embedded Avro
Schema declaration and an inner enum named <code>Field</code>. This enum and
the <code>_ALL_FIELDS</code> field will come in handy when we will use them
to query the datastore for specific fields.
</p>
</section>
<section>
<title>Defining data store mappings</title>
<p>Gora is designed to flexibly work with various types of data modeling,
including column stores(such as HBase, Cassandra, etc), SQL databases, flat files(binary,
JSON, XML encoded), and key-value stores. The mapping between the data bean and
the data store is thus defined in XML mapping files. Each data store has its own
mapping format, so that data-store specific settings can be leveraged more easily.
The mapping files declare how the fields of the classes declared in Avro schemas
are serialized and persisted to the data store.</p>
<section>
<title> HBase mappings </title>
<p> HBase mappings are stored at file named <code>gora-hbase-mappings.xml</code>.
For this tutorial we will be using the file <code>gora-tutorial/conf/gora-hbase-mappings.xml</code>.</p>
<!-- This is gora-sql-mapping.xml
<source>
&lt;gora-orm&gt;
&lt;class name="org.apache.gora.tutorial.log.generated.Pageview" keyClass="java.lang.Long" table="AccessLog"&gt;
&lt;primarykey column="line"/&gt;
&lt;field name="url" column="url" length="512" primarykey="true"/&gt;
&lt;field name="timestamp" column="timestamp"/&gt;
&lt;field name="ip" column="ip" length="16"/&gt;
&lt;field name="httpMethod" column="httpMethod" length="6"/&gt;
&lt;field name="httpStatusCode" column="httpStatusCode"/&gt;
&lt;field name="responseSize" column="responseSize"/&gt;
&lt;field name="referrer" column="referrer" length="512"/&gt;
&lt;field name="userAgent" column="userAgent" length="512"/&gt;
&lt;/class&gt;
...
&lt;/gora-orm&gt;
</source>
-->
<p><source>
&lt;gora-orm&gt;
&lt;table name="Pageview"&gt; &lt;!-- optional descriptors for tables --&gt;
&lt;family name="common"/&gt; &lt;!-- This can also have params like compression, bloom filters --&gt;
&lt;family name="http"/&gt;
&lt;family name="misc"/&gt;
&lt;/table&gt;
&lt;class name="org.apache.gora.tutorial.log.generated.Pageview" keyClass="java.lang.Long" table="AccessLog"&gt;
&lt;field name="url" family="common" qualifier="url"/&gt;
&lt;field name="timestamp" family="common" qualifier="timestamp"/&gt;
&lt;field name="ip" family="common" qualifier="ip" /&gt;
&lt;field name="httpMethod" family="http" qualifier="httpMethod"/&gt;
&lt;field name="httpStatusCode" family="http" qualifier="httpStatusCode"/&gt;
&lt;field name="responseSize" family="http" qualifier="responseSize"/&gt;
&lt;field name="referrer" family="misc" qualifier="referrer"/&gt;
&lt;field name="userAgent" family="misc" qualifier="userAgent"/&gt;
&lt;/class&gt;
...
&lt;/gora-orm&gt;
</source> </p>
<p>
Every mapping file starts with the top level element <code>&lt;gora-orm&gt;</code>.
Gora HBase mapping files can have two type of child elements, <code>table</code> and
<code>class</code> declarations. All of the table and class definitions should be
listed at this level.</p>
<p><code>table</code> declaration is optional and most of the time, Gora infers the table
declaration from the <code>class</code> sub elements. However, some of the HBase
specific table configuration such as compression, blockCache, etc can be given here,
if Gora is used to auto-create the tables. The exact syntax for the file can be found
<a href="gora-hbase.html#Gora+HBase+mappings">here</a>.</p>
<p>In Gora, data store access is always
done in a key-value data model, since most of the target backends support this model.
DataStore API expects to know the class names of the key and persistent classes, so that
they can be instantiated. The key value pair is declared in the <code>class</code> element.
The <code>name</code> attribute is the fully qualified name of the class,
and the <code>keyClass</code> attribute is
the fully qualified class name of the key class. </p>
<p>Children of the <code>&lt;class&gt;</code> element are <code>&lt;field&gt;</code>
elements. Each field element has a <code>name</code> and <code>family</code> attribute, and
an optional <code>qualifier</code> attribute. <code>name</code> attribute contains the name
of the field in the persistent class, and <code>family</code> declares the column family
of the HBase data model. If the qualifier is not given, the name of the field is used
as the column qualifier. Note that map and array type fields are stored in unique column
families, so the configuration should be list unique column families for each map and
array type, and no qualifier should be given. The exact data model is discussed further
at the <a href="site:gora-hbase">gora-hbase documentation</a>. </p>
</section>
</section>
</section>
<section>
<title> Basic API </title>
<section>
<title>Parsing the logs</title>
<p> Now that we have the basic setup, we can see Gora API in action. As you can notice below the API
is pretty simple to use. We will be using the class <code>LogManager</code> (which is located at
<code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/LogManager.java</code>) for parsing
and storing the logs, deleting some lines and querying.</p>
<p> First of all, let us look at the constructor. The only real thing it does is to call the
<code>init()</code> method. <code>init()</code> method constructs the
<code>DataStore</code> instance so that it can be used by the <code>LogManager</code>'s methods.</p>
<p><source>
public LogManager() {
try {
init();
} catch (IOException ex) {
throw new RuntimeException(ex);
}
}
private void init() throws IOException {
dataStore = DataStoreFactory.getDataStore(Long.class, Pageview.class);
}
</source></p>
<p> <a href="ext:api/org/apache/gora/store/datastore">DataStore</a> is probably the most important
class in the Gora API. <code>DataStore</code> handles actual object persistence. Objects can be persisted,
fetched, queried or deleted by the DataStore methods. Every data store that Gora supports, defines its own subclass
of the DataStore class. For example <code>gora-hbase</code> module defines <code>HBaseStore</code>, and
<code>gora-sql</code> module defines <code>SqlStore</code>. However, these subclasses are not explicitly
used by the user. </p>
<p> DataStores always have associated key and value(persistent) classes. Key class is the class of the keys of the
data store, and the value is the actual data bean's class. The value class is almost always generated by
Avro schema definitions using the Gora compiler. </p>
<p> Data store objects are created by <a href="ext:api/org/apache/gora/store/datastorefactory">DataStoreFactory</a>. It is necessary to
provide the key and value class. The datastore class is optional,
and if not specified it will be read from the configuration (gora.properties).</p>
<p> For this tutorial, we have already defined the avro schema to use and compiled
our data bean into <code>Pageview</code> class. For keys in the data store, we will be using <code>Long</code>s.
The keys will hold the line of the pageview in the data file. </p>
<p>Next, let's look at the main function of the <code>LogManager</code> class.</p>
<p><source>
public static void main(String[] args) throws Exception {
if(args.length &lg; 2) {
System.err.println(USAGE);
System.exit(1);
}
LogManager manager = new LogManager();
if("-parse".equals(args[0])) {
manager.parse(args[1]);
} else if("-query".equals(args[0])) {
if(args.length == 2)
manager.query(Long.parseLong(args[1]));
else
manager.query(Long.parseLong(args[1]), Long.parseLong(args[2]));
} else if("-delete".equals(args[0])) {
manager.delete(Long.parseLong(args[1]));
} else if("-deleteByQuery".equalsIgnoreCase(args[0])) {
manager.deleteByQuery(Long.parseLong(args[1]), Long.parseLong(args[2]));
} else {
System.err.println(USAGE);
System.exit(1);
}
manager.close();
}
</source></p>
<p>We can use the example log manager program from the command line (in the top level Gora directory): </p>
<p><code>
$ bin/gora logmanager
</code></p>
<p> which lists the usage as: </p>
<p><source>
LogManager -parse &lt;input_log_file&gt;
-get &lt;lineNum&gt;
-query &lt;lineNum&gt;
-query &lt;startLineNum&gt; &lt;endLineNum&gt;
-delete &lt;lineNum&gt;
-deleteByQuery &lt;startLineNum&gt; &lt;endLineNum&gt;
</source></p>
<p> So to parse and store our logs located at <code>gora-tutorial/src/main/resources/access.log</code>, we will issue: </p>
<p><code>
$ bin/gora logmanager -parse gora-tutorial/src/main/resources/access.log
</code></p>
<p> This should output something like: </p>
<p><source>
10/09/30 18:30:17 INFO log.LogManager: Parsing file:gora-tutorial/src/main/resources/access.log
10/09/30 18:30:23 INFO log.LogManager: finished parsing file. Total number of log lines:10000
</source></p>
<p> Now, let's look at the code which parses the data and stores the logs. </p>
<p><source>
private void parse(String input) throws IOException, ParseException {
BufferedReader reader = new BufferedReader(new FileReader(input));
long lineCount = 0;
try {
String line = reader.readLine();
do {
Pageview pageview = parseLine(line);
if(pageview != null) {
//store the pageview
storePageview(lineCount++, pageview);
}
line = reader.readLine();
} while(line != null);
} finally {
reader.close();
}
}
</source></p>
<p> The file is iterated line-by-line. Notice that the <code>parseLine(line)</code>
function does the actual parsing converting the string to a <code>Pageview</code> object
defined earlier. </p>
<p><source>
private Pageview parseLine(String line) throws ParseException {
StringTokenizer matcher = new StringTokenizer(line);
//parse the log line
String ip = matcher.nextToken();
...
//construct and return pageview object
Pageview pageview = new Pageview();
pageview.setIp(new Utf8(ip));
pageview.setTimestamp(timestamp);
...
return pageview;
}
</source></p>
<p><code>parseLine()</code> uses standard <code>StringTokenizer</code>s for the job
and constructs and returns a <code>Pageview</code> object.</p>
</section>
<section>
<title>Storing objects in the DataStore</title>
<p> If we look back at the <code>parse()</code> method above, we can see that the
<code>Pageview</code> objects returned by <code>parseLine() </code> are stored via
<code>storePageview()</code> method. </p>
<p> The storePageview() method is where magic happens, but if we look at the code,
we can see that it is dead simple. </p>
<p><source>
/** Stores the pageview object with the given key */
private void storePageview(long key, Pageview pageview) throws IOException {
dataStore.put(key, pageview);
}
</source></p>
<p> All we need to do is to call the <a href="ext:api/org/apache/gora/store/datastore/put">
put()</a> method, which expects a long as key and an instance of <code>Pageview</code>
as a value.</p>
</section>
<section>
<title> Closing the DataStore</title>
<p> <code>DataStore</code> implementations can do a lot of caching for performance.
However, this means that data is not always flushed to persistent storage all the times.
So we need to make sure that upon finishing storing objects, we need to close the datastore
instance by calling it's <a href="ext:api/org/apache/gora/store/datastore/close">close()</a> method.
LogManager always closes it's datastore in it's own <code>close()</code> method. </p>
<p><source>
private void close() throws IOException {
//It is very important to close the datastore properly, otherwise
//some data loss might occur.
if(dataStore != null)
dataStore.close();
}
</source></p>
<p>If you are pushing a lot of data, or if you want your data to be accessible before closing
the data store, you can also the <a href="ext:api/org/apache/gora/store/datastore/flush">flush()</a>
method which, as expected, flushes the data to the underlying data store. However, the actual flush
semantics can vary by the data store backend. For example, in SQL flush calls <code>commit()</code>
on the jdbc <code>Connection</code> object, whereas in Hbase, <code>HTable#flush()</code> is called.
Also note that even if you call <code>flush()</code> at the end of all data manipulation operations,
you still need to call the <code>close()</code> on the datastore.
</p>
</section>
<section>
<title>Persisted data in HBase</title>
<p>Now that we have stored the web access log data in HBase, we can look at
how the data is stored at HBase. For that, start the HBase shell.</p>
<p><code>$ cd ../hbase-0.20.6</code></p>
<p><code>$ bin/hbase shell</code></p>
<p> If you have a fresh HBase installation, there should be one table.</p>
<p><code>hbase(main):010:0> list</code></p>
<p><source>
AccessLog
1 row(s) in 0.0470 seconds
</source></p>
<p> Remember that AccessLog is the name of the table we specified at
<code>gora-hbase-mapping.xml</code>. Looking at the contents of the table: </p>
<p><code>hbase(main):010:0> scan 'AccessLog', {LIMIT=>1}</code></p>
<p><source>
ROW COLUMN+CELL
\x00\x00\x00\x00\x00\x00\x0 column=common:ip, timestamp=1285860617341, value=88.240.129.183
0\x00
\x00\x00\x00\x00\x00\x00\x0 column=common:timestamp, timestamp=1285860617341, value=\x00\x00\x01\x1F\xF1\xAEl
0\x00 P
\x00\x00\x00\x00\x00\x00\x0 column=common:url, timestamp=1285860617341, value=/index.php?a=1__wwv40pdxdpo&amp;k=2
0\x00 18978
\x00\x00\x00\x00\x00\x00\x0 column=http:httpMethod, timestamp=1285860617341, value=GET
0\x00
\x00\x00\x00\x00\x00\x00\x0 column=http:httpStatusCode, timestamp=1285860617341, value=\x00\x00\x00\xC8
0\x00
\x00\x00\x00\x00\x00\x00\x0 column=http:responseSize, timestamp=1285860617341, value=\x00\x00\x00+
0\x00
\x00\x00\x00\x00\x00\x00\x0 column=misc:referrer, timestamp=1285860617341, value=http://www.buldinle.com/inde
0\x00 x.php?a=1__WWV40pdxdpo&amp;k=218978
\x00\x00\x00\x00\x00\x00\x0 column=misc:userAgent, timestamp=1285860617341, value=Mozilla/4.0 (compatible; MS
0\x00 IE 6.0; Windows NT 5.1)
</source></p>
<p>The output shows all the columns matching the first line with key 0. We can see
the columns <code>common:ip, common:timestamp, common:url, </code> etc. Remember that
these are the columns that we have described in the <code>gora-hbase-mapping.xml</code>
file. </p>
<p> You can also count the number of entries in the table to make sure that all the records
have been stored.</p>
<p><code>hbase(main):010:0> count 'AccessLog'</code></p>
<p><source>
...
10000 row(s) in 1.0580 seconds
</source></p>
</section>
<section>
<title>Fetching objects from data store</title>
<p> Fetching objects from the data store is as easy as storing them. There are essentially
two methods for fetching objects. First one is to fetch a single object given it's key. The
second method is to run a query through the data store. </p>
<p>To fetch objects one by one, we can use one of the overloaded
<a href="ext:api/org/apache/gora/store/datastore/get">get()</a> methods.
The method with signature <code>get(K key)</code> returns the object corresponding to the given key fetching all the
fields. On the other hand <code>get(K key, String[] fields) </code> returns the object corresponding to the
given key, but fetching only the fields given as the second argument.</p>
<p>When run with the argument -get <code>LogManager</code> class fetches the pageview object
from the data store and prints the results. </p>
<p><source>
/** Fetches a single pageview object and prints it*/
private void get(long key) throws IOException {
Pageview pageview = dataStore.get(key);
printPageview(pageview);
}
</source></p>
<p> To display the 42nd line of the access log : </p>
<p><code>$ bin/gora logmanager -get 42 </code></p>
<p><source>
org.apache.gora.tutorial.log.generated.Pageview@321ce053 {
"url":"/index.php?i=0&amp;a=1__rntjt9z0q9w&amp;k=398179"
"timestamp":"1236710649000"
"ip":"88.240.129.183"
"httpMethod":"GET"
"httpStatusCode":"200"
"responseSize":"43"
"referrer":"http://www.buldinle.com/index.php?i=0&amp;a=1__RnTjT9z0Q9w&amp;k=398179"
"userAgent":"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
}
</source></p>
</section>
<section>
<title> Querying objects </title>
<p> DataStore API defines a <a href="ext:api/org/apache/gora/query/query">Query</a>
interface to query the objects at the data store. Each data store implementation
can use a specific implementation of the <code>Query</code> interface. Queries are
instantiated by calling <a href="ext:api/org/apache/gora/store/datastore/newquery">
DataStore#newQuery()</a>. When the query is run through the datastore, the results
are returned via the <a href="ext:api/org/apache/gora/query/result"> Result</a>
interface. Let's see how we can run a query and display the results below in the
the LogManager class. </p>
<p><source>
/** Queries and prints pageview object that have keys between startKey and endKey*/
private void query(long startKey, long endKey) throws IOException {
Query&lt;Long, Pageview&gt; query = dataStore.newQuery();
//set the properties of query
query.setStartKey(startKey);
query.setEndKey(endKey);
Result&lt;Long, Pageview&gt; result = query.execute();
printResult(result);
}
</source> </p>
<p> After constructing a <a href="ext:api/org/apache/gora/query/query">Query</a>, its properties
are set via the setter methods. Then calling
<a href="ext:api/org/apache/gora/query/query/execute">query.execute()</a> returns
the Result object.</p>
<p> <a href="ext:api/org/apache/gora/query/result"> Result</a> interface allows us to
iterate the results one by one by calling the <a href="ext:api/org/apache/gora/query/result/next">
next()</a> method. The <a href="ext:api/org/apache/gora/query/result/getkey">
getKey()</a> method returns the current key and <a href="ext:api/org/apache/gora/query/result/get">
get()</a> returns current persistent object. </p>
<p><source>
private void printResult(Result&lt;Long, Pageview&gt; result) throws IOException {
while(result.next()) { //advances the Result object and breaks if at end
long resultKey = result.getKey(); //obtain current key
Pageview resultPageview = result.get(); //obtain current value object
//print the results
System.out.println(resultKey + ":");
printPageview(resultPageview);
}
System.out.println("Number of pageviews from the query:" + result.getOffset());
}
</source> </p>
<p>With these functions defined, we can run the Log Manager class, to query the
access logs at HBase. For example, to display the log records between lines 10 and 12
we can use </p>
<p><code> bin/gora logmanager -query 10 12 </code></p>
<p>Which results in:</p>
<p> <source>
10:
org.apache.gora.tutorial.log.generated.Pageview@d38d0eaa {
"url":"/"
"timestamp":"1236710442000"
"ip":"144.122.180.55"
"httpMethod":"GET"
"httpStatusCode":"200"
"responseSize":"43"
"referrer":"http://buldinle.com/"
"userAgent":"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.10 (intrepid) Firefox/3.0.6"
}
11:
org.apache.gora.tutorial.log.generated.Pageview@b513110a {
"url":"/index.php?i=7&amp;a=1__gefuumyhl5c&amp;k=5143555"
"timestamp":"1236710453000"
"ip":"85.100.75.104"
"httpMethod":"GET"
"httpStatusCode":"200"
"responseSize":"43"
"referrer":"http://www.buldinle.com/index.php?i=7&amp;a=1__GeFUuMyHl5c&amp;k=5143555"
"userAgent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
}
</source></p>
</section>
<section>
<title>Deleting objects</title>
<p> Just like fetching objects, there are two main methods to delete
objects from the data store. The first one is to delete objects one by
one using the <a href="ext:api/org/apache/gora/store/datastore/delete">
DataStore#delete(K)</a> method, which takes the key of the object.
Alternatively we can delete all of the data that matches a given query by
calling the <a href="ext:api/org/apache/gora/store/datastore/deletebyquery">
DataStore#deleteByQuery(Query)</a> method. By using deleteByQuery, we can
do fine-grain deletes, for example deleting just a specific field
from several records. </p>
<p>Continueing from the LogManager class, the api's for both are given below.</p>
<p> <source>
/**Deletes the pageview with the given line number */
private void delete(long lineNum) throws Exception {
dataStore.delete(lineNum);
dataStore.flush(); //write changes may need to be flushed before
//they are committed
}
/** This method illustrates delete by query call */
private void deleteByQuery(long startKey, long endKey) throws IOException {
//Constructs a query from the dataStore. The matching rows to this query will be deleted
Query&lg;Long, Pageview&gt; query = dataStore.newQuery();
//set the properties of query
query.setStartKey(startKey);
query.setEndKey(endKey);
dataStore.deleteByQuery(query);
}
</source></p>
<p>And from the command line : </p>
<p><code> bin/gora logmanager -delete 12 </code></p>
<p><code> bin/gora logmanager -deleteByQuery 40 50 </code></p>
</section>
</section>
<section>
<title>MapReduce Support</title>
<p>Gora has first class MapReduce support for <a href="ext:hadoop">Apache Hadoop</a>.
Gora data stores can be used as inputs and outputs of jobs. Moreover, the objects can
be serialized, and passed between tasks keeping their persistency state. For the
serialization, Gora extends Avro DatumWriters. </p>
<section>
<title> Log analytics in MapReduce </title>
<p> For this part of the tutorial, we will be analyzing the logs that have been
stored at HBase earlier. Specifically, we will develop a MapReduce program to
calculate the number of daily pageviews for each URL in the site. </p>
<p> We will be using the <code>LogAnalytics</code> class to analyze the logs, which can
be found at <code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/LogAnalytics.java</code>.
For computing the analytics, the mapper takes in pageviews, and outputs tuples of
&lt;URL, timestamp&gt; pairs, with 1 as the value. The timestamp represents the day
in which the pageview occurred, so that the daily pageviews are accumulated.
The reducer just sums up the values, and outputs <code>MetricDatum</code> objects
to be sent to the output Gora data store.</p>
</section>
<section>
<title>Setting up the environment</title>
<p> We will be using the logs stored at HBase by the <code>LogManager</code> class.
We will push the output of the job to an HSQL database, since it has a zero conf
set up. However, you can also use MySQL or HBase for storing the analytics results.
If you want to continue with HBase, you can skip the next sections. </p>
<section>
<title> Setting up the database </title>
<p> First we need to download HSQL dependencies. For that, uncomment the following line
from <code>gora-tutorial/ivy/ivy.xml</code> (if using Maven hsqldb should already be available).
Ofcourse MySQL users should uncomment the mysql dependency instead. </p>
<p><code>&lt;!--&lt;dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="*->default"/&gt;--&gt;
</code></p>
<p> Then we need to run ant so that the new dependencies can be downloaded. </p>
<p><code> $ ant </code></p>
<p> If you are using Mysql, you should also setup the database server, create the database
and give necessary permissions to create tables, etc so that Gora can run properly. </p>
</section>
<section>
<title> Configuring Gora </title>
<p> We will put the configuration necessary to connect to the database to
<code>gora-tutorial/conf/gora.properties</code>. </p>
<p> <source>
#JDBC properties for gora-sql module using HSQL
gora.sqlstore.jdbc.driver=org.hsqldb.jdbcDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/goratest
#JDBC properties for gora-sql module using MySQL
#gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
#gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/goratest
#gora.sqlstore.jdbc.user=root
#gora.sqlstore.jdbc.password=
</source></p>
<p> As expected the <code>jdbc.driver</code> property is the JDBC driver class,
and <code>jdbc.url</code> is the JDBC connection URL. Moreover <code>jdbc.user</code>
and <code>jdbc.password</code> can be specific is needed. More information for these
parameters can be found at <a href="site:gora-sql">gora-sql</a> documentation. </p>
</section>
</section>
<section>
<title> Modelling the data </title>
<section>
<title>Data Beans for Analytics</title>
<p> For web site analytics, we will be using a generic <code>MetricDatum</code>
data structure. It holds a string <code>metricDimension</code>, a long
<code>timestamp</code>, and a long <code>metric</code> fields. The first two fields
are the dimensions of the web analytics data, and the last is the actual aggregate
metric value. For example we might have an instance <code>{metricDimension="/index",
timestamp=101, metric=12}</code>, representing that there have been 12 pageviews to
the URL "/index" for the given time interval 101. </p>
<p>The avro schema definition for <code>MetricDatum</code> can be found at
<code>gora-tutorial/src/main/avro/metricdatum.json</code>, and the compiled source
code at <code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/generated/MetricDatum.java</code>.</p>
<p><source>
{
"type": "record",
"name": "MetricDatum",
"namespace": "org.apache.gora.tutorial.log.generated",
"fields" : [
{"name": "metricDimension", "type": "string"},
{"name": "timestamp", "type": "long"},
{"name": "metric", "type" : "long"}
]
}
</source></p>
</section>
<section>
<title>Data store mappings </title>
<p> We will be using the SQL backend to store the job output data, just to
demonstrate the SQL backend. </p>
<p> Similar to what we have seen with HBase, gora-sql plugin reads configuration from the
<code>gora-sql-mappings.xml</code> file.
Specifically, we will use the <code>gora-tutorial/conf/gora-sql-mappings.xml</code> file. </p>
<p><source>
&lt;gora-orm&gt;
...
&lt;class name="org.apache.gora.tutorial.log.generated.MetricDatum" keyClass="java.lang.String" table="Metrics"&gt;
&lt;primarykey column="id" length="512"/&gt;
&lt;field name="metricDimension" column="metricDimension" length="512"/&gt;
&lt;field name="timestamp" column="ts"/&gt;
&lt;field name="metric" column="metric/&gt;
&lt;/class&gt;
&lt;/gora-orm&gt;
</source></p>
<p> SQL mapping files contain one or more <code>class</code> elements as the children of <code>gora-orm</code>.
The key value pair is declared in the <code>class</code> element. The <code>name</code> attribute is the
fully qualified name of the class, and the <code>keyClass</code> attribute is the fully qualified class
name of the key class. </p>
<p>Children of the <code>class</code> element are <code>field</code> elements and one
<code>primaryKey</code> element. Each <code>field</code>
element has a <code>name</code> and <code>column</code> attribute, and optional
<code>jdbc-type</code>, <code>length</code> and <code>scale</code> attributes.
<code>name</code> attribute contains
the name of the field in the persistent class, and <code>column</code> attribute is the name of the
column in the database. The <code>primaryKey</code> holds the actual key as the primary key field. Currently,
Gora only supports tables with one primary key. </p>
</section>
</section>
<section>
<title> Constructing the job </title>
<p> In constructing the job object for Hadoop, we need to define whether we will use
Gora as job input, output or both. Gora defines
its own <a href="ext:api/org/apache/gora/mapreduce/gorainputformat">GoraInputFormat</a>,
and <a href="ext:api/org/apache/gora/mapreduce/goraoutputformat">GoraOutputFormat</a>, which
uses <code>DataStore</code>'s as input sources and output sinks for the jobs.
<code>Gora{In|Out}putFormat</code> classes define static methods to set up the job properly.
However, if the mapper or reducer extends Gora's mapper and reducer classes,
you can use the static methods defined in <a href="ext:api/org/apache/gora/mapreduce/goramapper">GoraMapper</a> and
<a href="ext:api/org/apache/gora/mapreduce/gorareducer">GoraReducer</a> since they are more convenient. </p>
<p> For this tutorial we will use Gora as both input and output. As can be seen from the
<code>createJob()</code> function, quoted below, we create the job
as normal, and set the input parameters via
<a href="ext:api/org/apache/gora/mapreduce/goramapper/initmapperjob">GoraMapper#initMapperJob()</a>,
and <a href="ext:api/org/apache/gora/mapreduce/gorareducer/initreducerjob">GoraReducer#initReducerJob()
</a>. <code>GoraMapper#initMapperJob()</code> takes a store and an optional query to fetch the data from.
When a query is given, only the results of the query is used as the input of the job, if not all the records
are used.
The actual Mapper, map output key and value classes are passed to <code>initMapperJob()</code>
function as well. <code>GoraReducer#initReducerJob()</code> accepts
the data store to store the job's output as well as the actual reducer class.
<code>initMapperJob</code> and
<code>initReducerJob</code> functions have also overriden methods that take the data store class
rather than data store instances.</p>
<p>
<source>
public Job createJob(DataStore&lt;Long, Pageview&gt; inStore
, DataStore&lt;String, MetricDatum&gt; outStore, int numReducer) throws IOException {
Job job = new Job(getConf());
job.setJobName("Log Analytics");
job.setNumReduceTasks(numReducer);
job.setJarByClass(getClass());
/* Mappers are initialized with GoraMapper.initMapper() or
* GoraInputFormat.setInput()*/
GoraMapper.initMapperJob(job, inStore, TextLong.class, LongWritable.class
, LogAnalyticsMapper.class, true);
/* Reducers are initialized with GoraReducer#initReducer().
* If the output is not to be persisted via Gora, any reducer
* can be used instead. */
GoraReducer.initReducerJob(job, outStore, LogAnalyticsReducer.class);
return job;
}
</source>
</p>
</section>
<section>
<title> Gora mappers and using Gora an input </title>
<p> Typically, if Gora is used as job input, the Mapper class extends
<a href="ext:api/org/apache/gora/mapreduce/goramapper">GoraMapper</a>. However, currently
this is not forced by the API so other class hierarchies can be used instead.
The mapper receives the key value pairs that are the results of the input query, and emits
the results of the custom map task. Note that output records from map are independent
from the input and output data stores, so any Hadoop serializable key value class can be used.
However, Gora persistent classes are also Hadoop serializable. Hadoop serialization is
handled by the <a href="ext:api/org/apache/gora/mapreduce/persistentserialization">
PersistentSerialization</a> class. Gora also defines a <a href="ext:api/org/apache/gora/mapreduce/stringserialization">
StringSerialization</a> class, to serialize strings easily.
</p>
<p> Coming back to the code for the tutorial, we can see that <code>LogAnalytics</code>
class defines an inner class <code>LogAnalyticsMapper</code> which extends
<code>GoraMapper</code>. The map function receives <code>Long</code> keys which are the line
numbers, and <code>Pageview</code> values as read from the input data store. The map simply
rolls up the timestamp up to the day (meaning that only the day of the timestamp is used),
and outputs the key as a tuple of <code>&lt;URL,day&gt;</code>.
</p>
<p><source>
private TextLong tuple;
protected void map(Long key, Pageview pageview, Context context)
throws IOException ,InterruptedException {
Utf8 url = pageview.getUrl();
long day = getDay(pageview.getTimestamp());
tuple.getKey().set(url.toString());
tuple.getValue().set(day);
context.write(tuple, one);
};
</source></p>
</section>
<section>
<title> Gora reducers and using Gora as output</title>
<p>Similar to the input, typically, if Gora is used as job output, the Reducer extends
<a href="ext:api/org/apache/gora/mapreduce/gorareducer">GoraReducer</a>. The values
emitted by the reducer are persisted to the output data store as a result of the job.
</p>
<p> For this tutorial, the <code>LogAnalyticsReducer</code> inner class,
which extends <code>GoraReducer</code>, is used as the reducer. The reducer
just sums up all the values that correspond to the <code>&lt;URL,day&gt;</code> tuple.
Then the metric dimension object is constructed and emitted, which
will be stored at the output data store.
</p>
<p><source>
protected void reduce(TextLong tuple
, Iterable&lt;LongWritable&gt; values, Context context)
throws IOException ,InterruptedException {
long sum = 0L; //sum up the values
for(LongWritable value: values) {
sum+= value.get();
}
String dimension = tuple.getKey().toString();
long timestamp = tuple.getValue().get();
metricDatum.setMetricDimension(new Utf8(dimension));
metricDatum.setTimestamp(timestamp);
String key = metricDatum.getMetricDimension().toString();
metricDatum.setMetric(sum);
context.write(key, metricDatum);
};
</source></p>
</section>
<section>
<title> Running the job </title>
<p> Now that the job is constructed, we can run the Hadoop job as usual. Note that the <code>run</code> function
of the <code>LogAnalytics</code> class parses the arguments and runs the job. We can run the program by </p>
<p><code>$ bin/gora loganalytics [&lt;input data store&gt; [&lt;output data store&gt;]] </code></p>
<section>
<title> Running the job with SQL </title>
<p>Now, let's run the log analytics tools with the SQL backend(either Hsql or MySql). The input data store will be
<code>org.apache.gora.hbase.store.HBaseStore</code> and output store will be
<code>org.apache.gora.sql.store.SqlStore</code>. Remember that we have already configured the database
connection properties and which database will be used at the <a href="#Setting+up+the+environment-N103D7">
Setting up the environment</a> section. </p>
<p><code>$ bin/gora loganalytics org.apache.gora.hbase.store.HBaseStore org.apache.gora.sql.store.SqlStore</code></p>
<p> Now we should see some logging output from the job, and whether it finished with success. To check out the output
if we are using HSQLDB, below command can be used. </p>
<p><code>$ java -jar gora-tutorial/lib/hsqldb-2.0.0.jar</code></p>
<p>In the connection URL, the same URL that we have provided in gora.properties should be used. If on the other hand
MySQL is used, than we should be able to see the output using the mysql command line utility. </p>
<p> The results of the job are stored at the table Metrics, which is defined at the <code>gora-sql-mapping.xml</code>
file. Running a select query over this data confirms that the daily pageview metrics for the web site is indeed stored.
To see the most popular pages, run: </p>
<p><code>&gt; SELECT METRICDIMENSION, TS, METRIC FROM metrics order by metric desc</code></p>
<p><table>
<tr><th>METRICDIMENSION</th> <th>TS</th> <th>METRIC</th></tr>
<tr><td>/</td> <td> 1236902400000</td> <td> 220</td></tr>
<tr><td>/</td> <td> 1236988800000</td> <td> 212</td></tr>
<tr><td>/</td> <td> 1236816000000</td> <td> 191</td></tr>
<tr><td>/</td> <td> 1237075200000</td> <td> 155</td></tr>
<tr><td>/</td> <td> 1241395200000</td> <td> 111</td></tr>
<tr><td>/</td> <td> 1236643200000</td> <td> 110</td></tr>
<tr><td>/</td> <td> 1236729600000</td> <td> 95</td></tr>
<tr><td>/index.php?a=3__x8g0vi&amp;k=5508310</td> <td> 1236816000000</td> <td> 45</td></tr>
<tr><td>/index.php?a=1__5kf9nvgrzos&amp;k=208773</td> <td> 1236816000000</td> <td> 37</td></tr>
<tr><td>...</td> <td>...</td> <td>...</td></tr>
</table></p>
<p>As you can see, the home page (<code>/</code>) for varios days and some other pages are listed.
In total 3033 rows are present at the metrics table. </p>
</section>
<section>
<title>Running the job with HBase </title>
<p> Since HBaseStore is already defined as the default data store at <code>gora.properties</code>
we can run the job with HBase as:</p>
<p><code>$ bin/gora loganalytics</code></p>
<p>The outputs of the job will be saved in the Metrics table, whose layout is defined at
<code>gora-hbase-mapping.xml</code> file. To see the results:</p>
<p><code>hbase(main):010:0> scan 'Metrics', {LIMIT=>1}</code></p>
<p><source>
ROW COLUMN+CELL
/?a=1__-znawtuabsy&amp;k=96804_ column=common:metric, timestamp=1289815441740, value=\x00\x00\x00\x00\x00\x00\x00
1236902400000 \x09
/?a=1__-znawtuabsy&amp;k=96804_ column=common:metricDimension, timestamp=1289815441740, value=/?a=1__-znawtuabsy&amp;
1236902400000 k=96804
/?a=1__-znawtuabsy&amp;k=96804_ column=common:ts, timestamp=1289815441740, value=\x00\x00\x01\x1F\xFD \xD0\x00
1236902400000
1 row(s) in 0.0490 seconds
</source></p>
</section>
</section>
</section>
<section>
<title>More Examples</title>
<p> Other than this tutorial, there are several places that you can find
examples of Gora in action. </p>
<p>The first place to look at is the examples directories
under various Gora modules. All the modules have a <code>&lt;gora-module&gt;/src/examples/</code> directory
under which some example classes can be found. Especially, there are some classes that are used for tests under
<code>&lt;gora-core&gt;/src/examples/</code></p>
<p>Second, various unit tests of Gora modules can be referred to see the API in use. The unit tests can be found
at <code>&lt;gora-module&gt;/src/test/</code> </p>
<p>The source code for the projects using Gora can also be checked out as a reference. <a href="ext:nutch">Apache Nutch</a> is
one of the first class users of Gora; so looking into how Nutch uses Gora is always a good idea.
</p>
<p> Please feel free to grab our <a href="http://gora.apache.org/images/powered-by-gora.png">poweredBy</a> sticker and embedded it in anything backed by Apache Gora.</p>
</section>
<section>
<title>Feedback</title>
<p> At last, thanks for trying out Gora. If you find any bugs or you have suggestions for improvement,
do not hesitate to give feedback on the dev@gora.apache.org <a href="ext:devmail">mailing list</a>. </p>
</section>
</body>
</document>