| <?xml version="1.0"?> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <document> |
| <header> |
| <title>Gora Tutorial</title> |
| </header> |
| |
| <body> |
| |
| <p><hr/> <b>Author :</b> Enis Söztutar, enis [at] apache [dot] org<hr/> </p> |
| |
| <section> |
| <title>Introduction</title> |
| <p>This is the official tutorial for Apache Gora. For this tutorial, we |
| will be implementing a system to store our web server logs in Apache HBase, |
| and analyze the results using Apache Hadoop and store the results either in HSQLDB or MySQL.</p> |
| |
| <p> In this tutorial we will first look at how to set up the environment and |
| configure Gora and the data stores. Later, we will go over the data we will use and |
| define the data beans that will be used to interact with the persistency layer. |
| Next, we will go over the API of Gora to do some basic tasks such as storing objects, |
| fetching and querying objects, and deleting objects. Last, we will go over an example |
| program which uses Hadoop MapReduce to analyze the web server logs, and discuss the Gora |
| MapReduce API in some detail. </p> |
| |
| <section> |
| <title>Introduction to Gora</title> |
| <p> The Apache Gora open source framework provides an in-memory data |
| model and persistence for big data. Gora supports persisting to |
| column stores, key value stores, document stores and RDBMSs, and |
| analyzing the data with extensive Apache Hadoop MapReduce support. In Avro, the |
| beans to hold the data and RPC interfaces are defined using a JSON |
| schema. In mapping the data beans to data store specific settings, |
| Gora depends on mapping files, which are specific to each data store. |
| Unlike other ORM implementations, Gora the data bean to data store |
| specific schema mapping is explicit. This has the advantage that, |
| when using data models such as HBase and Cassandra, you can always |
| know how the values are persisted. </p> |
| |
| <p>Gora has a modular architecture. Most of the data stores in Gora, |
| has it's own module, such as <code>gora-hbase, gora-cassandra</code>, |
| and <code>gora-sql</code>. In your projects, you need to only include |
| the artifacts from the modules you use. You can consult the <a href="quickstart.html#Setting+up+your+project"> |
| Setting up your project</a> section in the quick start guide.</p> |
| </section> |
| </section> |
| |
| <section> |
| <title>Setting up the environment</title> |
| |
| <section> |
| <title>Setting up Gora</title> |
| <p>As a first step, we need to download and compile the Gora source code. The source codes |
| for the tutorial is in the <code>gora-tutorial</code> module. If you have |
| already downloaded Gora, that's cool, otherwise, please go |
| over the steps at the <a href="site:quickstart">Quick Start</a> guide for |
| how to download and compile Gora. </p> |
| <p> |
| Now, after the source code for Gora is at hand, let's have a look at the files under the |
| directory <code>gora-tutorial</code>. </p> |
| |
| <p> |
| <code>$ cd gora-tutorial</code><br/> |
| <code>$ tree</code><br/> |
| <source> |
| |-- build.xml |
| |-- conf |
| | |-- gora-hbase-mapping.xml |
| | |-- gora-sql-mapping.xml |
| | `-- gora.properties |
| |-- ivy |
| | `-- ivy.xml |
| `-- src |
| |-- examples |
| | `-- java |
| |-- main |
| | |-- avro |
| | | |-- metricdatum.json |
| | | `-- pageview.json |
| | |-- java |
| | | `-- org |
| | | `-- apache |
| | | `-- gora |
| | | `-- tutorial |
| | | `-- log |
| | | |-- KeyValueWritable.java |
| | | |-- LogAnalytics.java |
| | | |-- LogManager.java |
| | | |-- TextLong.java |
| | | `-- generated |
| | | |-- MetricDatum.java |
| | | `-- Pageview.java |
| | `-- resources |
| | `-- access.log.tar.gz |
| `-- test |
| |-- conf |
| `-- java |
| </source> |
| </p> |
| |
| <p>Since gora-tutorial is a top level module of Gora, it depends on the directory |
| structure imposed by Gora's main build scripts (<code>build.xml</code> and |
| <code>build-common.xml</code> with Ivy and pom.xml for Maven). The Java source code resides in directory <code> |
| src/main/java/</code>, avro schemas in <code>src/main/avro/</code>, and data in |
| <code>src/main/resources/</code>.</p> |
| </section> |
| |
| <section> |
| <title>Setting up HBase</title> |
| <p> For this tutorial we will be using <a href="ext:hbase"> HBase</a> to |
| store the logs. For those of you not familiar with HBase, it is a NoSQL |
| column store with an architecture very similar to Google's BigTable. </p> |
| <!-- TODO: Tutorial for SQL and Cassandra --> |
| <p> If you don't already have already HBase setup, you can go over the steps at |
| <a href="http://hbase.apache.org/docs/r0.20.6/api/overview-summary.html#overview_description"> HBase Overview </a> |
| documentation. Although Gora aims to support the most recent HBase versions, the above tutorial is |
| specifically for HBase 0.20.6 (don't worry the principals are the same), so download a version from |
| <a href="http://hbase.apache.org/releases.html">HBase releases</a>. After extracting |
| the file, cd to the hbase-${dist} directory and start the HBase server. </p> |
| <p><code>$ bin/start-hbase.sh</code> </p> |
| <p> and make sure that HBase is available by using the Hbase shell. |
| <p><code>$ bin/hbase shell</code> </p> |
| </p> |
| </section> |
| |
| <section> |
| <title>Configuring Gora</title> |
| <p> Gora is configured through a file in the classpath named <code>gora.properties</code>. |
| We will be using the following file <code>gora-tutorial/conf/gora.properties</code> </p> |
| |
| <p><source> |
| gora.datastore.default=org.apache.gora.hbase.store.HBaseStore |
| gora.datastore.autocreateschema=true |
| </source></p> |
| |
| <p> This file states that the default store will be <code>HBaseStore</code>, |
| and schemas(tables) should be automatically created. </p> |
| |
| <p> More information for configuring different settings in gora.properties |
| can be found <a href="site:gora-conf"> here </a>. </p> |
| </section> |
| |
| </section> |
| |
| <section> |
| <title> Modelling the data </title> |
| <section> |
| <title>Data for the tutorial</title> |
| <p>For this tutorial, we will be parsing and storing the logs of a web server. |
| Some example logs are at <code>src/main/resources/access.log.tar.gz</code>, which |
| belongs to the (now shutdown) server at http://www.buldinle.com/. Example logs contain 10,000 lines, between dates 2009/03/10 - 2009/03/15. <br/> |
| The first thing, we need to do is to extract the logs. </p> |
| <p><code>$ tar zxvf src/main/resources/access.log.tar.gz -C src/main/resources/</code></p> |
| <p> You can also use your own log files, given that the log |
| format is <a href="http://httpd.apache.org/docs/current/logs.html"> |
| Combined Log Format</a>. Some example lines from the log are: </p> |
| <code>88.254.190.73 - - [10/Mar/2009:20:40:26 +0200] "GET / HTTP/1.1" 200 43 "http://www.buldinle.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB5; .NET CLR 2.0.50727; InfoPath.2)"</code><br/> |
| <code>78.179.56.27 - - [11/Mar/2009:00:07:40 +0200] "GET /index.php?i=3&a=1__6x39kovbji8&k=3750105 HTTP/1.1" 200 43 "http://www.buldinle.com/index.php?i=3&a=1__6X39Kovbji8&k=3750105" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; OfficeLiveConnector.1.3; OfficeLivePatch.0.0)"</code><br/> |
| <code>78.163.99.14 - - [12/Mar/2009:18:18:25 +0200] "GET /index.php?a=3__x7l72c&k=4476881 HTTP/1.1" 200 43 "http://www.buldinle.com/index.php?a=3__x7l72c&k=4476881" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1)"</code><br/> |
| |
| <p>The first fields in order are: User's ip, ignored, ignored, Date and |
| time, HTTP method, URL, HTTP Method, HTTP status code, Number of bytes |
| returned, Referrer, and User Agent.</p> |
| |
| </section> |
| |
| <section> |
| <title>Defining data beans</title> |
| |
| <p> Data beans are the main way to hold the data in memory and persist in Gora. Gora |
| needs to explicitly keep track of the status of the data in memory, so |
| we use <a href="ext:avro">Apache Avro</a> for defining the beans. Using |
| avro gives us the possibility to explicitly keep track object's persistency state, |
| and a way to serialize object's data. </p> |
| <p>Defining data beans is a very easy task, but for the exact syntax, please |
| consult to <a href="ext:avrospec"> Avro Specification</a>.</p> |
| <p> First, we need to define the bean <b><code>Pageview</code></b> to hold a |
| single URL access in the logs. Let's go over the class at <code> src/main/avro/pageview.json </code> |
| </p> |
| <p> |
| <source> |
| { |
| "type": "record", |
| "name": "Pageview", |
| "namespace": "org.apache.gora.tutorial.log.generated", |
| "fields" : [ |
| {"name": "url", "type": "string"}, |
| {"name": "timestamp", "type": "long"}, |
| {"name": "ip", "type": "string"}, |
| {"name": "httpMethod", "type": "string"}, |
| {"name": "httpStatusCode", "type": "int"}, |
| {"name": "responseSize", "type": "int"}, |
| {"name": "referrer", "type": "string"}, |
| {"name": "userAgent", "type": "string"} |
| ] |
| } |
| </source> |
| </p> |
| |
| <p>Avro schemas are declared in JSON. |
| <a href="http://avro.apache.org/docs/current/spec.html#schema_record"> |
| Records</a> are defined with type |
| <code>"record"</code>, with a name as the name of the class, and a |
| namespace which is mapped to the package name in Java. The fields |
| are listed in the <code>"fields"</code> element. Each field is given |
| with its type. </p> |
| |
| </section> |
| |
| <section> |
| <title>Compiling Avro Schemas</title> |
| |
| <p>The next step after defining the data beans is to compile the schemas |
| into Java classes. For that we will use <code>GoraCompiler</code>. |
| Invoking the Gora compiler by (from Gora top level directory) </p> |
| <p> |
| <code> |
| $ bin/gora compile |
| </code> |
| </p> results in: |
| <p> |
| <code> |
| $ Usage: SpecificCompiler <schema file> <output dir> |
| </code> |
| </p> <p>so we will issue :</p> |
| <p> |
| <code> |
| $ bin/gora compile gora-tutorial/src/main/avro/pageview.json gora-tutorial/src/main/java/ |
| </code> |
| </p> |
| <p>to compile the Pageview class into |
| <code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/generated/Pageview.java</code>. |
| However, the tutorial java classes are already committed, so you do not need to do that |
| now. </p> |
| |
| <p> Gora compiler extends Avro's <code>SpecificCompiler</code> to convert JSON definition |
| into a Java class. Generated classes extend |
| the <a href="ext:api/org/apache/gora/persistency/persistent">Persistent</a> interface. |
| Most of the methods of the <code>Persistent</code> interface deal with bookkeeping for |
| persistence, and state tracking, so most of the time they are not used explicitly by the |
| user. Now, let's look at the internals of the generated class <code>Pageview.java</code>. |
| </p> |
| <p> |
| <source> |
| public class Pageview extends PersistentBase { |
| |
| private Utf8 url; |
| private long timestamp; |
| private Utf8 ip; |
| private Utf8 httpMethod; |
| private int httpStatusCode; |
| private int responseSize; |
| private Utf8 referrer; |
| private Utf8 userAgent; |
| |
| ... |
| |
| public static final Schema _SCHEMA = Schema.parse("{\"type\":\"record\", ... "); |
| public static enum Field { |
| URL(0,"url"), |
| TIMESTAMP(1,"timestamp"), |
| IP(2,"ip"), |
| HTTP_METHOD(3,"httpMethod"), |
| HTTP_STATUS_CODE(4,"httpStatusCode"), |
| RESPONSE_SIZE(5,"responseSize"), |
| REFERRER(6,"referrer"), |
| USER_AGENT(7,"userAgent"), |
| ; |
| private int index; |
| private String name; |
| Field(int index, String name) {this.index=index;this.name=name;} |
| public int getIndex() {return index;} |
| public String getName() {return name;} |
| public String toString() {return name;} |
| }; |
| public static final String[] _ALL_FIELDS = {"url","timestamp","ip","httpMethod" |
| ,"httpStatusCode","responseSize","referrer","userAgent",}; |
| |
| ... |
| } |
| </source> |
| </p> |
| |
| <p> We can see the actual field declarations in the class. Note that Avro uses <code>Utf8</code> |
| class as a placeholder for string fields. We can also see the embedded Avro |
| Schema declaration and an inner enum named <code>Field</code>. This enum and |
| the <code>_ALL_FIELDS</code> field will come in handy when we will use them |
| to query the datastore for specific fields. |
| </p> |
| </section> |
| |
| |
| <section> |
| <title>Defining data store mappings</title> |
| <p>Gora is designed to flexibly work with various types of data modeling, |
| including column stores(such as HBase, Cassandra, etc), SQL databases, flat files(binary, |
| JSON, XML encoded), and key-value stores. The mapping between the data bean and |
| the data store is thus defined in XML mapping files. Each data store has its own |
| mapping format, so that data-store specific settings can be leveraged more easily. |
| The mapping files declare how the fields of the classes declared in Avro schemas |
| are serialized and persisted to the data store.</p> |
| |
| <section> |
| <title> HBase mappings </title> |
| <p> HBase mappings are stored at file named <code>gora-hbase-mappings.xml</code>. |
| For this tutorial we will be using the file <code>gora-tutorial/conf/gora-hbase-mappings.xml</code>.</p> |
| |
| <!-- This is gora-sql-mapping.xml |
| <source> |
| <gora-orm> |
| <class name="org.apache.gora.tutorial.log.generated.Pageview" keyClass="java.lang.Long" table="AccessLog"> |
| <primarykey column="line"/> |
| <field name="url" column="url" length="512" primarykey="true"/> |
| <field name="timestamp" column="timestamp"/> |
| <field name="ip" column="ip" length="16"/> |
| <field name="httpMethod" column="httpMethod" length="6"/> |
| <field name="httpStatusCode" column="httpStatusCode"/> |
| <field name="responseSize" column="responseSize"/> |
| <field name="referrer" column="referrer" length="512"/> |
| <field name="userAgent" column="userAgent" length="512"/> |
| </class> |
| |
| ... |
| |
| </gora-orm> |
| |
| </source> |
| --> |
| |
| <p><source> |
| <gora-orm> |
| <table name="Pageview"> <!-- optional descriptors for tables --> |
| <family name="common"/> <!-- This can also have params like compression, bloom filters --> |
| <family name="http"/> |
| <family name="misc"/> |
| </table> |
| |
| <class name="org.apache.gora.tutorial.log.generated.Pageview" keyClass="java.lang.Long" table="AccessLog"> |
| <field name="url" family="common" qualifier="url"/> |
| <field name="timestamp" family="common" qualifier="timestamp"/> |
| <field name="ip" family="common" qualifier="ip" /> |
| <field name="httpMethod" family="http" qualifier="httpMethod"/> |
| <field name="httpStatusCode" family="http" qualifier="httpStatusCode"/> |
| <field name="responseSize" family="http" qualifier="responseSize"/> |
| <field name="referrer" family="misc" qualifier="referrer"/> |
| <field name="userAgent" family="misc" qualifier="userAgent"/> |
| </class> |
| |
| ... |
| |
| </gora-orm> |
| </source> </p> |
| |
| <p> |
| Every mapping file starts with the top level element <code><gora-orm></code>. |
| Gora HBase mapping files can have two type of child elements, <code>table</code> and |
| <code>class</code> declarations. All of the table and class definitions should be |
| listed at this level.</p> |
| |
| <p><code>table</code> declaration is optional and most of the time, Gora infers the table |
| declaration from the <code>class</code> sub elements. However, some of the HBase |
| specific table configuration such as compression, blockCache, etc can be given here, |
| if Gora is used to auto-create the tables. The exact syntax for the file can be found |
| <a href="gora-hbase.html#Gora+HBase+mappings">here</a>.</p> |
| |
| <p>In Gora, data store access is always |
| done in a key-value data model, since most of the target backends support this model. |
| DataStore API expects to know the class names of the key and persistent classes, so that |
| they can be instantiated. The key value pair is declared in the <code>class</code> element. |
| The <code>name</code> attribute is the fully qualified name of the class, |
| and the <code>keyClass</code> attribute is |
| the fully qualified class name of the key class. </p> |
| |
| <p>Children of the <code><class></code> element are <code><field></code> |
| elements. Each field element has a <code>name</code> and <code>family</code> attribute, and |
| an optional <code>qualifier</code> attribute. <code>name</code> attribute contains the name |
| of the field in the persistent class, and <code>family</code> declares the column family |
| of the HBase data model. If the qualifier is not given, the name of the field is used |
| as the column qualifier. Note that map and array type fields are stored in unique column |
| families, so the configuration should be list unique column families for each map and |
| array type, and no qualifier should be given. The exact data model is discussed further |
| at the <a href="site:gora-hbase">gora-hbase documentation</a>. </p> |
| </section> |
| </section> |
| </section> |
| |
| <section> |
| <title> Basic API </title> |
| |
| <section> |
| <title>Parsing the logs</title> |
| <p> Now that we have the basic setup, we can see Gora API in action. As you can notice below the API |
| is pretty simple to use. We will be using the class <code>LogManager</code> (which is located at |
| <code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/LogManager.java</code>) for parsing |
| and storing the logs, deleting some lines and querying.</p> |
| |
| |
| <p> First of all, let us look at the constructor. The only real thing it does is to call the |
| <code>init()</code> method. <code>init()</code> method constructs the |
| <code>DataStore</code> instance so that it can be used by the <code>LogManager</code>'s methods.</p> |
| <p><source> |
| public LogManager() { |
| try { |
| init(); |
| } catch (IOException ex) { |
| throw new RuntimeException(ex); |
| } |
| } |
| private void init() throws IOException { |
| dataStore = DataStoreFactory.getDataStore(Long.class, Pageview.class); |
| } |
| </source></p> |
| |
| <p> <a href="ext:api/org/apache/gora/store/datastore">DataStore</a> is probably the most important |
| class in the Gora API. <code>DataStore</code> handles actual object persistence. Objects can be persisted, |
| fetched, queried or deleted by the DataStore methods. Every data store that Gora supports, defines its own subclass |
| of the DataStore class. For example <code>gora-hbase</code> module defines <code>HBaseStore</code>, and |
| <code>gora-sql</code> module defines <code>SqlStore</code>. However, these subclasses are not explicitly |
| used by the user. </p> |
| |
| <p> DataStores always have associated key and value(persistent) classes. Key class is the class of the keys of the |
| data store, and the value is the actual data bean's class. The value class is almost always generated by |
| Avro schema definitions using the Gora compiler. </p> |
| |
| <p> Data store objects are created by <a href="ext:api/org/apache/gora/store/datastorefactory">DataStoreFactory</a>. It is necessary to |
| provide the key and value class. The datastore class is optional, |
| and if not specified it will be read from the configuration (gora.properties).</p> |
| |
| <p> For this tutorial, we have already defined the avro schema to use and compiled |
| our data bean into <code>Pageview</code> class. For keys in the data store, we will be using <code>Long</code>s. |
| The keys will hold the line of the pageview in the data file. </p> |
| |
| <p>Next, let's look at the main function of the <code>LogManager</code> class.</p> |
| <p><source> |
| public static void main(String[] args) throws Exception { |
| if(args.length ≶ 2) { |
| System.err.println(USAGE); |
| System.exit(1); |
| } |
| |
| LogManager manager = new LogManager(); |
| |
| if("-parse".equals(args[0])) { |
| manager.parse(args[1]); |
| } else if("-query".equals(args[0])) { |
| if(args.length == 2) |
| manager.query(Long.parseLong(args[1])); |
| else |
| manager.query(Long.parseLong(args[1]), Long.parseLong(args[2])); |
| } else if("-delete".equals(args[0])) { |
| manager.delete(Long.parseLong(args[1])); |
| } else if("-deleteByQuery".equalsIgnoreCase(args[0])) { |
| manager.deleteByQuery(Long.parseLong(args[1]), Long.parseLong(args[2])); |
| } else { |
| System.err.println(USAGE); |
| System.exit(1); |
| } |
| |
| manager.close(); |
| } |
| </source></p> |
| |
| <p>We can use the example log manager program from the command line (in the top level Gora directory): </p> |
| <p><code> |
| $ bin/gora logmanager |
| </code></p> |
| <p> which lists the usage as: </p> |
| <p><source> |
| LogManager -parse <input_log_file> |
| -get <lineNum> |
| -query <lineNum> |
| -query <startLineNum> <endLineNum> |
| -delete <lineNum> |
| -deleteByQuery <startLineNum> <endLineNum> |
| </source></p> |
| |
| <p> So to parse and store our logs located at <code>gora-tutorial/src/main/resources/access.log</code>, we will issue: </p> |
| <p><code> |
| $ bin/gora logmanager -parse gora-tutorial/src/main/resources/access.log |
| </code></p> |
| |
| <p> This should output something like: </p> |
| <p><source> |
| 10/09/30 18:30:17 INFO log.LogManager: Parsing file:gora-tutorial/src/main/resources/access.log |
| 10/09/30 18:30:23 INFO log.LogManager: finished parsing file. Total number of log lines:10000 |
| </source></p> |
| <p> Now, let's look at the code which parses the data and stores the logs. </p> |
| <p><source> |
| private void parse(String input) throws IOException, ParseException { |
| BufferedReader reader = new BufferedReader(new FileReader(input)); |
| long lineCount = 0; |
| try { |
| String line = reader.readLine(); |
| do { |
| Pageview pageview = parseLine(line); |
| |
| if(pageview != null) { |
| //store the pageview |
| storePageview(lineCount++, pageview); |
| } |
| |
| line = reader.readLine(); |
| } while(line != null); |
| |
| } finally { |
| reader.close(); |
| } |
| } |
| </source></p> |
| |
| <p> The file is iterated line-by-line. Notice that the <code>parseLine(line)</code> |
| function does the actual parsing converting the string to a <code>Pageview</code> object |
| defined earlier. </p> |
| |
| <p><source> |
| private Pageview parseLine(String line) throws ParseException { |
| StringTokenizer matcher = new StringTokenizer(line); |
| //parse the log line |
| String ip = matcher.nextToken(); |
| ... |
| |
| //construct and return pageview object |
| Pageview pageview = new Pageview(); |
| pageview.setIp(new Utf8(ip)); |
| pageview.setTimestamp(timestamp); |
| ... |
| |
| return pageview; |
| } |
| </source></p> |
| <p><code>parseLine()</code> uses standard <code>StringTokenizer</code>s for the job |
| and constructs and returns a <code>Pageview</code> object.</p> |
| </section> |
| |
| |
| <section> |
| <title>Storing objects in the DataStore</title> |
| |
| <p> If we look back at the <code>parse()</code> method above, we can see that the |
| <code>Pageview</code> objects returned by <code>parseLine() </code> are stored via |
| <code>storePageview()</code> method. </p> |
| |
| <p> The storePageview() method is where magic happens, but if we look at the code, |
| we can see that it is dead simple. </p> |
| |
| <p><source> |
| /** Stores the pageview object with the given key */ |
| private void storePageview(long key, Pageview pageview) throws IOException { |
| dataStore.put(key, pageview); |
| } |
| </source></p> |
| |
| <p> All we need to do is to call the <a href="ext:api/org/apache/gora/store/datastore/put"> |
| put()</a> method, which expects a long as key and an instance of <code>Pageview</code> |
| as a value.</p> |
| |
| </section> |
| |
| <section> |
| <title> Closing the DataStore</title> |
| <p> <code>DataStore</code> implementations can do a lot of caching for performance. |
| However, this means that data is not always flushed to persistent storage all the times. |
| So we need to make sure that upon finishing storing objects, we need to close the datastore |
| instance by calling it's <a href="ext:api/org/apache/gora/store/datastore/close">close()</a> method. |
| LogManager always closes it's datastore in it's own <code>close()</code> method. </p> |
| |
| <p><source> |
| private void close() throws IOException { |
| //It is very important to close the datastore properly, otherwise |
| //some data loss might occur. |
| if(dataStore != null) |
| dataStore.close(); |
| } |
| </source></p> |
| |
| <p>If you are pushing a lot of data, or if you want your data to be accessible before closing |
| the data store, you can also the <a href="ext:api/org/apache/gora/store/datastore/flush">flush()</a> |
| method which, as expected, flushes the data to the underlying data store. However, the actual flush |
| semantics can vary by the data store backend. For example, in SQL flush calls <code>commit()</code> |
| on the jdbc <code>Connection</code> object, whereas in Hbase, <code>HTable#flush()</code> is called. |
| Also note that even if you call <code>flush()</code> at the end of all data manipulation operations, |
| you still need to call the <code>close()</code> on the datastore. |
| </p> |
| |
| </section> |
| |
| <section> |
| <title>Persisted data in HBase</title> |
| <p>Now that we have stored the web access log data in HBase, we can look at |
| how the data is stored at HBase. For that, start the HBase shell.</p> |
| <p><code>$ cd ../hbase-0.20.6</code></p> |
| <p><code>$ bin/hbase shell</code></p> |
| |
| <p> If you have a fresh HBase installation, there should be one table.</p> |
| <p><code>hbase(main):010:0> list</code></p> |
| <p><source> |
| AccessLog |
| 1 row(s) in 0.0470 seconds |
| </source></p> |
| <p> Remember that AccessLog is the name of the table we specified at |
| <code>gora-hbase-mapping.xml</code>. Looking at the contents of the table: </p> |
| |
| <p><code>hbase(main):010:0> scan 'AccessLog', {LIMIT=>1}</code></p> |
| <p><source> |
| ROW COLUMN+CELL |
| \x00\x00\x00\x00\x00\x00\x0 column=common:ip, timestamp=1285860617341, value=88.240.129.183 |
| 0\x00 |
| \x00\x00\x00\x00\x00\x00\x0 column=common:timestamp, timestamp=1285860617341, value=\x00\x00\x01\x1F\xF1\xAEl |
| 0\x00 P |
| \x00\x00\x00\x00\x00\x00\x0 column=common:url, timestamp=1285860617341, value=/index.php?a=1__wwv40pdxdpo&k=2 |
| 0\x00 18978 |
| \x00\x00\x00\x00\x00\x00\x0 column=http:httpMethod, timestamp=1285860617341, value=GET |
| 0\x00 |
| \x00\x00\x00\x00\x00\x00\x0 column=http:httpStatusCode, timestamp=1285860617341, value=\x00\x00\x00\xC8 |
| 0\x00 |
| \x00\x00\x00\x00\x00\x00\x0 column=http:responseSize, timestamp=1285860617341, value=\x00\x00\x00+ |
| 0\x00 |
| \x00\x00\x00\x00\x00\x00\x0 column=misc:referrer, timestamp=1285860617341, value=http://www.buldinle.com/inde |
| 0\x00 x.php?a=1__WWV40pdxdpo&k=218978 |
| \x00\x00\x00\x00\x00\x00\x0 column=misc:userAgent, timestamp=1285860617341, value=Mozilla/4.0 (compatible; MS |
| 0\x00 IE 6.0; Windows NT 5.1) |
| </source></p> |
| |
| <p>The output shows all the columns matching the first line with key 0. We can see |
| the columns <code>common:ip, common:timestamp, common:url, </code> etc. Remember that |
| these are the columns that we have described in the <code>gora-hbase-mapping.xml</code> |
| file. </p> |
| |
| <p> You can also count the number of entries in the table to make sure that all the records |
| have been stored.</p> |
| <p><code>hbase(main):010:0> count 'AccessLog'</code></p> |
| <p><source> |
| ... |
| 10000 row(s) in 1.0580 seconds |
| </source></p> |
| </section> |
| |
| <section> |
| <title>Fetching objects from data store</title> |
| <p> Fetching objects from the data store is as easy as storing them. There are essentially |
| two methods for fetching objects. First one is to fetch a single object given it's key. The |
| second method is to run a query through the data store. </p> |
| |
| <p>To fetch objects one by one, we can use one of the overloaded |
| <a href="ext:api/org/apache/gora/store/datastore/get">get()</a> methods. |
| The method with signature <code>get(K key)</code> returns the object corresponding to the given key fetching all the |
| fields. On the other hand <code>get(K key, String[] fields) </code> returns the object corresponding to the |
| given key, but fetching only the fields given as the second argument.</p> |
| |
| <p>When run with the argument -get <code>LogManager</code> class fetches the pageview object |
| from the data store and prints the results. </p> |
| |
| <p><source> |
| /** Fetches a single pageview object and prints it*/ |
| private void get(long key) throws IOException { |
| Pageview pageview = dataStore.get(key); |
| printPageview(pageview); |
| } |
| </source></p> |
| |
| <p> To display the 42nd line of the access log : </p> |
| <p><code>$ bin/gora logmanager -get 42 </code></p> |
| <p><source> |
| org.apache.gora.tutorial.log.generated.Pageview@321ce053 { |
| "url":"/index.php?i=0&a=1__rntjt9z0q9w&k=398179" |
| "timestamp":"1236710649000" |
| "ip":"88.240.129.183" |
| "httpMethod":"GET" |
| "httpStatusCode":"200" |
| "responseSize":"43" |
| "referrer":"http://www.buldinle.com/index.php?i=0&a=1__RnTjT9z0Q9w&k=398179" |
| "userAgent":"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" |
| } |
| </source></p> |
| </section> |
| |
| <section> |
| <title> Querying objects </title> |
| <p> DataStore API defines a <a href="ext:api/org/apache/gora/query/query">Query</a> |
| interface to query the objects at the data store. Each data store implementation |
| can use a specific implementation of the <code>Query</code> interface. Queries are |
| instantiated by calling <a href="ext:api/org/apache/gora/store/datastore/newquery"> |
| DataStore#newQuery()</a>. When the query is run through the datastore, the results |
| are returned via the <a href="ext:api/org/apache/gora/query/result"> Result</a> |
| interface. Let's see how we can run a query and display the results below in the |
| the LogManager class. </p> |
| |
| <p><source> |
| /** Queries and prints pageview object that have keys between startKey and endKey*/ |
| private void query(long startKey, long endKey) throws IOException { |
| Query<Long, Pageview> query = dataStore.newQuery(); |
| //set the properties of query |
| query.setStartKey(startKey); |
| query.setEndKey(endKey); |
| |
| Result<Long, Pageview> result = query.execute(); |
| |
| printResult(result); |
| } |
| </source> </p> |
| |
| <p> After constructing a <a href="ext:api/org/apache/gora/query/query">Query</a>, its properties |
| are set via the setter methods. Then calling |
| <a href="ext:api/org/apache/gora/query/query/execute">query.execute()</a> returns |
| the Result object.</p> |
| |
| <p> <a href="ext:api/org/apache/gora/query/result"> Result</a> interface allows us to |
| iterate the results one by one by calling the <a href="ext:api/org/apache/gora/query/result/next"> |
| next()</a> method. The <a href="ext:api/org/apache/gora/query/result/getkey"> |
| getKey()</a> method returns the current key and <a href="ext:api/org/apache/gora/query/result/get"> |
| get()</a> returns current persistent object. </p> |
| |
| <p><source> |
| private void printResult(Result<Long, Pageview> result) throws IOException { |
| |
| while(result.next()) { //advances the Result object and breaks if at end |
| long resultKey = result.getKey(); //obtain current key |
| Pageview resultPageview = result.get(); //obtain current value object |
| |
| //print the results |
| System.out.println(resultKey + ":"); |
| printPageview(resultPageview); |
| } |
| |
| System.out.println("Number of pageviews from the query:" + result.getOffset()); |
| } |
| </source> </p> |
| |
| <p>With these functions defined, we can run the Log Manager class, to query the |
| access logs at HBase. For example, to display the log records between lines 10 and 12 |
| we can use </p> |
| |
| <p><code> bin/gora logmanager -query 10 12 </code></p> |
| |
| <p>Which results in:</p> |
| <p> <source> |
| 10: |
| org.apache.gora.tutorial.log.generated.Pageview@d38d0eaa { |
| "url":"/" |
| "timestamp":"1236710442000" |
| "ip":"144.122.180.55" |
| "httpMethod":"GET" |
| "httpStatusCode":"200" |
| "responseSize":"43" |
| "referrer":"http://buldinle.com/" |
| "userAgent":"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.10 (intrepid) Firefox/3.0.6" |
| } |
| 11: |
| org.apache.gora.tutorial.log.generated.Pageview@b513110a { |
| "url":"/index.php?i=7&a=1__gefuumyhl5c&k=5143555" |
| "timestamp":"1236710453000" |
| "ip":"85.100.75.104" |
| "httpMethod":"GET" |
| "httpStatusCode":"200" |
| "responseSize":"43" |
| "referrer":"http://www.buldinle.com/index.php?i=7&a=1__GeFUuMyHl5c&k=5143555" |
| "userAgent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7" |
| } |
| </source></p> |
| |
| </section> |
| |
| |
| <section> |
| <title>Deleting objects</title> |
| <p> Just like fetching objects, there are two main methods to delete |
| objects from the data store. The first one is to delete objects one by |
| one using the <a href="ext:api/org/apache/gora/store/datastore/delete"> |
| DataStore#delete(K)</a> method, which takes the key of the object. |
| Alternatively we can delete all of the data that matches a given query by |
| calling the <a href="ext:api/org/apache/gora/store/datastore/deletebyquery"> |
| DataStore#deleteByQuery(Query)</a> method. By using deleteByQuery, we can |
| do fine-grain deletes, for example deleting just a specific field |
| from several records. </p> |
| <p>Continueing from the LogManager class, the api's for both are given below.</p> |
| |
| <p> <source> |
| /**Deletes the pageview with the given line number */ |
| private void delete(long lineNum) throws Exception { |
| dataStore.delete(lineNum); |
| dataStore.flush(); //write changes may need to be flushed before |
| //they are committed |
| } |
| |
| /** This method illustrates delete by query call */ |
| private void deleteByQuery(long startKey, long endKey) throws IOException { |
| //Constructs a query from the dataStore. The matching rows to this query will be deleted |
| Query≶Long, Pageview> query = dataStore.newQuery(); |
| //set the properties of query |
| query.setStartKey(startKey); |
| query.setEndKey(endKey); |
| |
| dataStore.deleteByQuery(query); |
| } |
| </source></p> |
| |
| <p>And from the command line : </p> |
| <p><code> bin/gora logmanager -delete 12 </code></p> |
| <p><code> bin/gora logmanager -deleteByQuery 40 50 </code></p> |
| |
| </section> |
| </section> |
| |
| <section> |
| <title>MapReduce Support</title> |
| <p>Gora has first class MapReduce support for <a href="ext:hadoop">Apache Hadoop</a>. |
| Gora data stores can be used as inputs and outputs of jobs. Moreover, the objects can |
| be serialized, and passed between tasks keeping their persistency state. For the |
| serialization, Gora extends Avro DatumWriters. </p> |
| |
| <section> |
| <title> Log analytics in MapReduce </title> |
| <p> For this part of the tutorial, we will be analyzing the logs that have been |
| stored at HBase earlier. Specifically, we will develop a MapReduce program to |
| calculate the number of daily pageviews for each URL in the site. </p> |
| |
| <p> We will be using the <code>LogAnalytics</code> class to analyze the logs, which can |
| be found at <code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/LogAnalytics.java</code>. |
| For computing the analytics, the mapper takes in pageviews, and outputs tuples of |
| <URL, timestamp> pairs, with 1 as the value. The timestamp represents the day |
| in which the pageview occurred, so that the daily pageviews are accumulated. |
| The reducer just sums up the values, and outputs <code>MetricDatum</code> objects |
| to be sent to the output Gora data store.</p> |
| </section> |
| |
| <section> |
| <title>Setting up the environment</title> |
| <p> We will be using the logs stored at HBase by the <code>LogManager</code> class. |
| We will push the output of the job to an HSQL database, since it has a zero conf |
| set up. However, you can also use MySQL or HBase for storing the analytics results. |
| If you want to continue with HBase, you can skip the next sections. </p> |
| |
| <section> |
| <title> Setting up the database </title> |
| <p> First we need to download HSQL dependencies. For that, uncomment the following line |
| from <code>gora-tutorial/ivy/ivy.xml</code> (if using Maven hsqldb should already be available). |
| Ofcourse MySQL users should uncomment the mysql dependency instead. </p> |
| <p><code><!--<dependency org="org.hsqldb" name="hsqldb" rev="2.0.0" conf="*->default"/>--> |
| </code></p> |
| |
| <p> Then we need to run ant so that the new dependencies can be downloaded. </p> |
| <p><code> $ ant </code></p> |
| |
| <p> If you are using Mysql, you should also setup the database server, create the database |
| and give necessary permissions to create tables, etc so that Gora can run properly. </p> |
| </section> |
| |
| <section> |
| <title> Configuring Gora </title> |
| <p> We will put the configuration necessary to connect to the database to |
| <code>gora-tutorial/conf/gora.properties</code>. </p> |
| |
| <p> <source> |
| #JDBC properties for gora-sql module using HSQL |
| gora.sqlstore.jdbc.driver=org.hsqldb.jdbcDriver |
| gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/goratest |
| |
| #JDBC properties for gora-sql module using MySQL |
| #gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver |
| #gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/goratest |
| #gora.sqlstore.jdbc.user=root |
| #gora.sqlstore.jdbc.password= |
| </source></p> |
| |
| <p> As expected the <code>jdbc.driver</code> property is the JDBC driver class, |
| and <code>jdbc.url</code> is the JDBC connection URL. Moreover <code>jdbc.user</code> |
| and <code>jdbc.password</code> can be specific is needed. More information for these |
| parameters can be found at <a href="site:gora-sql">gora-sql</a> documentation. </p> |
| </section> |
| </section> |
| |
| <section> |
| <title> Modelling the data </title> |
| |
| <section> |
| <title>Data Beans for Analytics</title> |
| <p> For web site analytics, we will be using a generic <code>MetricDatum</code> |
| data structure. It holds a string <code>metricDimension</code>, a long |
| <code>timestamp</code>, and a long <code>metric</code> fields. The first two fields |
| are the dimensions of the web analytics data, and the last is the actual aggregate |
| metric value. For example we might have an instance <code>{metricDimension="/index", |
| timestamp=101, metric=12}</code>, representing that there have been 12 pageviews to |
| the URL "/index" for the given time interval 101. </p> |
| |
| <p>The avro schema definition for <code>MetricDatum</code> can be found at |
| <code>gora-tutorial/src/main/avro/metricdatum.json</code>, and the compiled source |
| code at <code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/generated/MetricDatum.java</code>.</p> |
| <p><source> |
| { |
| "type": "record", |
| "name": "MetricDatum", |
| "namespace": "org.apache.gora.tutorial.log.generated", |
| "fields" : [ |
| {"name": "metricDimension", "type": "string"}, |
| {"name": "timestamp", "type": "long"}, |
| {"name": "metric", "type" : "long"} |
| ] |
| } |
| </source></p> |
| </section> |
| |
| <section> |
| <title>Data store mappings </title> |
| <p> We will be using the SQL backend to store the job output data, just to |
| demonstrate the SQL backend. </p> |
| |
| <p> Similar to what we have seen with HBase, gora-sql plugin reads configuration from the |
| <code>gora-sql-mappings.xml</code> file. |
| Specifically, we will use the <code>gora-tutorial/conf/gora-sql-mappings.xml</code> file. </p> |
| |
| <p><source> |
| <gora-orm> |
| ... |
| <class name="org.apache.gora.tutorial.log.generated.MetricDatum" keyClass="java.lang.String" table="Metrics"> |
| <primarykey column="id" length="512"/> |
| <field name="metricDimension" column="metricDimension" length="512"/> |
| <field name="timestamp" column="ts"/> |
| <field name="metric" column="metric/> |
| </class> |
| </gora-orm> |
| </source></p> |
| |
| <p> SQL mapping files contain one or more <code>class</code> elements as the children of <code>gora-orm</code>. |
| The key value pair is declared in the <code>class</code> element. The <code>name</code> attribute is the |
| fully qualified name of the class, and the <code>keyClass</code> attribute is the fully qualified class |
| name of the key class. </p> |
| |
| <p>Children of the <code>class</code> element are <code>field</code> elements and one |
| <code>primaryKey</code> element. Each <code>field</code> |
| element has a <code>name</code> and <code>column</code> attribute, and optional |
| <code>jdbc-type</code>, <code>length</code> and <code>scale</code> attributes. |
| <code>name</code> attribute contains |
| the name of the field in the persistent class, and <code>column</code> attribute is the name of the |
| column in the database. The <code>primaryKey</code> holds the actual key as the primary key field. Currently, |
| Gora only supports tables with one primary key. </p> |
| |
| </section> |
| </section> |
| |
| <section> |
| <title> Constructing the job </title> |
| <p> In constructing the job object for Hadoop, we need to define whether we will use |
| Gora as job input, output or both. Gora defines |
| its own <a href="ext:api/org/apache/gora/mapreduce/gorainputformat">GoraInputFormat</a>, |
| and <a href="ext:api/org/apache/gora/mapreduce/goraoutputformat">GoraOutputFormat</a>, which |
| uses <code>DataStore</code>'s as input sources and output sinks for the jobs. |
| <code>Gora{In|Out}putFormat</code> classes define static methods to set up the job properly. |
| However, if the mapper or reducer extends Gora's mapper and reducer classes, |
| you can use the static methods defined in <a href="ext:api/org/apache/gora/mapreduce/goramapper">GoraMapper</a> and |
| <a href="ext:api/org/apache/gora/mapreduce/gorareducer">GoraReducer</a> since they are more convenient. </p> |
| |
| |
| <p> For this tutorial we will use Gora as both input and output. As can be seen from the |
| <code>createJob()</code> function, quoted below, we create the job |
| as normal, and set the input parameters via |
| <a href="ext:api/org/apache/gora/mapreduce/goramapper/initmapperjob">GoraMapper#initMapperJob()</a>, |
| and <a href="ext:api/org/apache/gora/mapreduce/gorareducer/initreducerjob">GoraReducer#initReducerJob() |
| </a>. <code>GoraMapper#initMapperJob()</code> takes a store and an optional query to fetch the data from. |
| When a query is given, only the results of the query is used as the input of the job, if not all the records |
| are used. |
| The actual Mapper, map output key and value classes are passed to <code>initMapperJob()</code> |
| function as well. <code>GoraReducer#initReducerJob()</code> accepts |
| the data store to store the job's output as well as the actual reducer class. |
| <code>initMapperJob</code> and |
| <code>initReducerJob</code> functions have also overriden methods that take the data store class |
| rather than data store instances.</p> |
| |
| <p> |
| <source> |
| public Job createJob(DataStore<Long, Pageview> inStore |
| , DataStore<String, MetricDatum> outStore, int numReducer) throws IOException { |
| Job job = new Job(getConf()); |
| |
| job.setJobName("Log Analytics"); |
| job.setNumReduceTasks(numReducer); |
| job.setJarByClass(getClass()); |
| |
| /* Mappers are initialized with GoraMapper.initMapper() or |
| * GoraInputFormat.setInput()*/ |
| GoraMapper.initMapperJob(job, inStore, TextLong.class, LongWritable.class |
| , LogAnalyticsMapper.class, true); |
| |
| /* Reducers are initialized with GoraReducer#initReducer(). |
| * If the output is not to be persisted via Gora, any reducer |
| * can be used instead. */ |
| GoraReducer.initReducerJob(job, outStore, LogAnalyticsReducer.class); |
| |
| return job; |
| } |
| </source> |
| </p> |
| </section> |
| |
| <section> |
| <title> Gora mappers and using Gora an input </title> |
| <p> Typically, if Gora is used as job input, the Mapper class extends |
| <a href="ext:api/org/apache/gora/mapreduce/goramapper">GoraMapper</a>. However, currently |
| this is not forced by the API so other class hierarchies can be used instead. |
| The mapper receives the key value pairs that are the results of the input query, and emits |
| the results of the custom map task. Note that output records from map are independent |
| from the input and output data stores, so any Hadoop serializable key value class can be used. |
| However, Gora persistent classes are also Hadoop serializable. Hadoop serialization is |
| handled by the <a href="ext:api/org/apache/gora/mapreduce/persistentserialization"> |
| PersistentSerialization</a> class. Gora also defines a <a href="ext:api/org/apache/gora/mapreduce/stringserialization"> |
| StringSerialization</a> class, to serialize strings easily. |
| </p> |
| |
| <p> Coming back to the code for the tutorial, we can see that <code>LogAnalytics</code> |
| class defines an inner class <code>LogAnalyticsMapper</code> which extends |
| <code>GoraMapper</code>. The map function receives <code>Long</code> keys which are the line |
| numbers, and <code>Pageview</code> values as read from the input data store. The map simply |
| rolls up the timestamp up to the day (meaning that only the day of the timestamp is used), |
| and outputs the key as a tuple of <code><URL,day></code>. |
| </p> |
| |
| <p><source> |
| private TextLong tuple; |
| |
| protected void map(Long key, Pageview pageview, Context context) |
| throws IOException ,InterruptedException { |
| |
| Utf8 url = pageview.getUrl(); |
| long day = getDay(pageview.getTimestamp()); |
| |
| tuple.getKey().set(url.toString()); |
| tuple.getValue().set(day); |
| |
| context.write(tuple, one); |
| }; |
| </source></p> |
| </section> |
| |
| <section> |
| <title> Gora reducers and using Gora as output</title> |
| <p>Similar to the input, typically, if Gora is used as job output, the Reducer extends |
| <a href="ext:api/org/apache/gora/mapreduce/gorareducer">GoraReducer</a>. The values |
| emitted by the reducer are persisted to the output data store as a result of the job. |
| </p> |
| |
| <p> For this tutorial, the <code>LogAnalyticsReducer</code> inner class, |
| which extends <code>GoraReducer</code>, is used as the reducer. The reducer |
| just sums up all the values that correspond to the <code><URL,day></code> tuple. |
| Then the metric dimension object is constructed and emitted, which |
| will be stored at the output data store. |
| </p> |
| |
| <p><source> |
| protected void reduce(TextLong tuple |
| , Iterable<LongWritable> values, Context context) |
| throws IOException ,InterruptedException { |
| |
| long sum = 0L; //sum up the values |
| for(LongWritable value: values) { |
| sum+= value.get(); |
| } |
| |
| String dimension = tuple.getKey().toString(); |
| long timestamp = tuple.getValue().get(); |
| |
| metricDatum.setMetricDimension(new Utf8(dimension)); |
| metricDatum.setTimestamp(timestamp); |
| |
| String key = metricDatum.getMetricDimension().toString(); |
| metricDatum.setMetric(sum); |
| |
| context.write(key, metricDatum); |
| }; |
| </source></p> |
| </section> |
| |
| <section> |
| <title> Running the job </title> |
| <p> Now that the job is constructed, we can run the Hadoop job as usual. Note that the <code>run</code> function |
| of the <code>LogAnalytics</code> class parses the arguments and runs the job. We can run the program by </p> |
| <p><code>$ bin/gora loganalytics [<input data store> [<output data store>]] </code></p> |
| |
| <section> |
| <title> Running the job with SQL </title> |
| <p>Now, let's run the log analytics tools with the SQL backend(either Hsql or MySql). The input data store will be |
| <code>org.apache.gora.hbase.store.HBaseStore</code> and output store will be |
| <code>org.apache.gora.sql.store.SqlStore</code>. Remember that we have already configured the database |
| connection properties and which database will be used at the <a href="#Setting+up+the+environment-N103D7"> |
| Setting up the environment</a> section. </p> |
| |
| <p><code>$ bin/gora loganalytics org.apache.gora.hbase.store.HBaseStore org.apache.gora.sql.store.SqlStore</code></p> |
| |
| <p> Now we should see some logging output from the job, and whether it finished with success. To check out the output |
| if we are using HSQLDB, below command can be used. </p> |
| |
| <p><code>$ java -jar gora-tutorial/lib/hsqldb-2.0.0.jar</code></p> |
| |
| <p>In the connection URL, the same URL that we have provided in gora.properties should be used. If on the other hand |
| MySQL is used, than we should be able to see the output using the mysql command line utility. </p> |
| |
| <p> The results of the job are stored at the table Metrics, which is defined at the <code>gora-sql-mapping.xml</code> |
| file. Running a select query over this data confirms that the daily pageview metrics for the web site is indeed stored. |
| To see the most popular pages, run: </p> |
| |
| <p><code>> SELECT METRICDIMENSION, TS, METRIC FROM metrics order by metric desc</code></p> |
| |
| <p><table> |
| <tr><th>METRICDIMENSION</th> <th>TS</th> <th>METRIC</th></tr> |
| <tr><td>/</td> <td> 1236902400000</td> <td> 220</td></tr> |
| <tr><td>/</td> <td> 1236988800000</td> <td> 212</td></tr> |
| <tr><td>/</td> <td> 1236816000000</td> <td> 191</td></tr> |
| <tr><td>/</td> <td> 1237075200000</td> <td> 155</td></tr> |
| <tr><td>/</td> <td> 1241395200000</td> <td> 111</td></tr> |
| <tr><td>/</td> <td> 1236643200000</td> <td> 110</td></tr> |
| <tr><td>/</td> <td> 1236729600000</td> <td> 95</td></tr> |
| <tr><td>/index.php?a=3__x8g0vi&k=5508310</td> <td> 1236816000000</td> <td> 45</td></tr> |
| <tr><td>/index.php?a=1__5kf9nvgrzos&k=208773</td> <td> 1236816000000</td> <td> 37</td></tr> |
| <tr><td>...</td> <td>...</td> <td>...</td></tr> |
| </table></p> |
| |
| <p>As you can see, the home page (<code>/</code>) for varios days and some other pages are listed. |
| In total 3033 rows are present at the metrics table. </p> |
| </section> |
| |
| <section> |
| <title>Running the job with HBase </title> |
| <p> Since HBaseStore is already defined as the default data store at <code>gora.properties</code> |
| we can run the job with HBase as:</p> |
| <p><code>$ bin/gora loganalytics</code></p> |
| |
| <p>The outputs of the job will be saved in the Metrics table, whose layout is defined at |
| <code>gora-hbase-mapping.xml</code> file. To see the results:</p> |
| |
| <p><code>hbase(main):010:0> scan 'Metrics', {LIMIT=>1}</code></p> |
| <p><source> |
| ROW COLUMN+CELL |
| /?a=1__-znawtuabsy&k=96804_ column=common:metric, timestamp=1289815441740, value=\x00\x00\x00\x00\x00\x00\x00 |
| 1236902400000 \x09 |
| /?a=1__-znawtuabsy&k=96804_ column=common:metricDimension, timestamp=1289815441740, value=/?a=1__-znawtuabsy& |
| 1236902400000 k=96804 |
| /?a=1__-znawtuabsy&k=96804_ column=common:ts, timestamp=1289815441740, value=\x00\x00\x01\x1F\xFD \xD0\x00 |
| 1236902400000 |
| 1 row(s) in 0.0490 seconds |
| </source></p> |
| |
| </section> |
| </section> |
| </section> |
| |
| <section> |
| <title>More Examples</title> |
| <p> Other than this tutorial, there are several places that you can find |
| examples of Gora in action. </p> |
| |
| <p>The first place to look at is the examples directories |
| under various Gora modules. All the modules have a <code><gora-module>/src/examples/</code> directory |
| under which some example classes can be found. Especially, there are some classes that are used for tests under |
| <code><gora-core>/src/examples/</code></p> |
| |
| <p>Second, various unit tests of Gora modules can be referred to see the API in use. The unit tests can be found |
| at <code><gora-module>/src/test/</code> </p> |
| |
| <p>The source code for the projects using Gora can also be checked out as a reference. <a href="ext:nutch">Apache Nutch</a> is |
| one of the first class users of Gora; so looking into how Nutch uses Gora is always a good idea. |
| </p> |
| <p> Please feel free to grab our <a href="http://gora.apache.org/images/powered-by-gora.png">poweredBy</a> sticker and embedded it in anything backed by Apache Gora.</p> |
| </section> |
| |
| <section> |
| <title>Feedback</title> |
| <p> At last, thanks for trying out Gora. If you find any bugs or you have suggestions for improvement, |
| do not hesitate to give feedback on the dev@gora.apache.org <a href="ext:devmail">mailing list</a>. </p> |
| </section> |
| |
| </body> |
| </document> |