blob: a2d828c68415bfbfd02ad2faf10e1573ce334885 [file] [log] [blame]
<!DOCTYPE html>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<html>
<head>
<title>Using Blur - Apache Blur (Incubator) Documentation</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- Bootstrap -->
<link href="resources/css/bootstrap.min.css" rel="stylesheet" media="screen">
<link href="resources/css/bs-docs.css" rel="stylesheet" media="screen">
</head>
<body>
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="http://incubator.apache.org/blur">Apache Blur (Incubator)</a>
</div>
<div class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Main</a></li>
<li><a href="getting-started.html">Getting Started</a></li>
<li><a href="platform.html">Platform</a></li>
<li><a href="data-model.html">Data Model</a></li>
<li><a href="cluster-setup.html">Cluster Setup</a></li>
<li class="active"><a href="using-blur.html">Using Blur</a></li>
<li><a href="Blur.html">Blur API</a></li>
<li><a href="console.html">Console</a></li>
</ul>
</div>
</div>
</div>
<div class="container bs-docs-container">
<div class="row">
<div class="col-md-3">
<div class="bs-sidebar hidden-print affix" role="complementary">
<ul class="nav bs-sidenav">
<li><a href="#thrift-client">Thrift Client</a>
<ul class="nav">
<li><a href="#simple_connection_example">Getting A Client Example</a>
<ul class="nav">
<li><a href="#simple_connection_example_constr">&nbsp;&nbsp;Connection String</a></li>
<li><a href="#simple_connection_example_thrift">&nbsp;&nbsp;Thrift Client</a></li>
</ul>
</li>
<li><a href="#simple_query_example">Query Example</a></li>
<li><a href="#simple_query_example_data">Query Example with Data</a></li>
<li><a href="#simple_query_sorting">Query Example with Sorting</a></li>
<li><a href="#simple_faceting_example">Faceting Example</a></li>
<li><a href="#simple_fetch_data">Fetch Data</a></li>
<li><a href="#simple_mutate_example">Mutate Example</a></li>
<li><a href="#simple_shortened_mutate_example">Shortened Mutate Example</a></li>
</ul>
</li>
<li><a href="#shell">Shell</a>
<ul class="nav">
|||Shell-Menu|||
</ul>
</li>
<li><a href="#map-reduce">Map Reduce</a></li>
<li><a href="#csv-loader">CSV Loader</a></li>
<li><a href="#jdbc">JDBC</a></li>
</ul>
</div>
</div>
<div class="col-md-9" role="main">
<section>
<div class="page-header">
<h1 id="thrift-client">Thrift Client</h1>
</div>
<p>
The following examples are of using the Thrift API directly. You will need to following libraries at a minimum:
<ul>
<li>blur-thrift-*.jar</li>
<li>blur-util-*.jar</li>
<li>slf4j-api-1.6.1.jar</li>
<li>slf4j-log4j12-1.6.1.jar</li>
<li>commons-logging-1.1.1.jar</li>
<li>log4j-1.2.15.jar</li>
</ul>
<div class="bs-callout bs-callout-info"><h4>Note</h4>Other versions of these libraries could work, but these are the versions that Blur currently uses.</div>
</p>
<h3 id="simple_connection_example">Getting A Client Example</h3>
<p>
<br/>
<h4 id="simple_connection_example_constr">Connection String</h4>
The connection string can be parsed or constructed through "Connection" object. If you are using the
parsed version there are some options. At a minimum you will have to provide hostname and port:
<pre><code class="bash">host1:40010</code></pre>
You can list multiple hosts:
<pre><code class="bash">host1:40010,host2:40010</code></pre>
You can add a SOCKS proxy server for each host:
<pre><code class="bash">host1:40010/proxyhost1:6001</code></pre>
You can also add a timeout on the socket of 90 seconds (the default is 60 seconds):
<pre><code class="bash">host1:40010/proxyhost1:6001#90000</code></pre>
Multiple hosts with a different timeout:
<pre><code class="bash">host1:40010,host2:40010,host3:40010#90000</code></pre>
Here is all options together:
<pre><code class="bash">host1:40010/proxyhost1:6001,host2:40010/proxyhost1:6001#90000</code></pre>
<br/>
<h4 id="simple_connection_example_thrift">Thrift Client</h4>
Client Example 1:
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");</code></pre>
Client Example 2:
<pre><code class="java">Connection connection = new Connection("controller1:40010");
Iface client = BlurClient.getClient(connection);</code></pre>
Client Example 3:
<pre><code class="java">BlurClientManager.execute("controller1:40010,controller2:40010", new BlurCommand&lt;T&gt;() {
@Override
public T call(Client client) throws BlurException, TException {
// your code here...
}
});</code></pre>
Client Example 4:
<pre><code class="java">List&lt;Connection&gt; connections = BlurClientManager.getConnections("controller1:40010,controller2:40010");
BlurClientManager.execute(connections, new BlurCommand&lt;T&gt;() {
@Override
public T call(Client client) throws BlurException, TException {
// your code here...
}
});</code></pre>
</p>
<h3 id="simple_query_example">Query Example</h3>
<p>
This is a simple example of how to run a query via the Thrift API and get back search results. By default
the first 10 results are returned with only row ids to the results.
</p>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery(&quot;+docs.body:\&quot;Hadoop is awesome\&quot;&quot;);
BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
BlurResults results = client.query(&quot;table1&quot;, blurQuery);
System.out.println(&quot;Total Results: &quot; + results.totalResults);
for (BlurResult result : results.getResults()) {
&nbsp;&nbsp;System.out.println(result);
}
</code></pre>
<h3 id="simple_query_example_data">Query Example with Data</h3>
<p>
This is an example of how to run a query via the Thrift API and get back search results with data. All
the columns in the "fam0" family are returned for each Record in the Row.
</p>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery(&quot;+docs.body:\&quot;Hadoop is awesome\&quot;&quot;);
Selector selector = new Selector();
// This will fetch all the columns in family "fam0".
selector.addToColumnFamiliesToFetch("fam0");
// This will fetch the "col1", "col2" columns in family "fam1".
Set<String> cols = new HashSet<String>();
cols.add("col1");
cols.add("col2");
selector.putToColumnsToFetch("fam1", cols);
BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
blurQuery.setSelector(selector);
BlurResults results = client.query(&quot;table1&quot;, blurQuery);
System.out.println(&quot;Total Results: &quot; + results.totalResults);
for (BlurResult result : results.getResults()) {
&nbsp;&nbsp;System.out.println(result);
}
</code></pre>
<h3 id="simple_query_sorting">Query Example with Sorting</h3>
<p>
This is an example of how to run a query via the Thrift API and get back search results with
data being sorted by the &quot;docs.timestamp&quot; column. All the columns in the records
will be returned.
<div class="bs-callout bs-callout-info"><h4>Note</h4>Sorting is only allowed on Record queries at this point.</div>
</p>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery(&quot;+docs.body:\&quot;Hadoop is awesome\&quot;&quot;);
query.setRowQuery(false);
Selector selector = new Selector();
selector.setRecordOnly(true);
BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
blurQuery.setSelector(selector);
blurQuery.addToSortFields(new SortField(&quot;docs&quot;, &quot;timestamp&quot;, true));
BlurResults results = client.query(&quot;table1&quot;, blurQuery);
System.out.println(&quot;Total Results: &quot; + results.totalResults);
for (BlurResult result : results.getResults()) {
&nbsp;&nbsp;System.out.println(result);
}
</code></pre>
<h3 id="simple_faceting_example">Faceting Example</h3>
<p>
This is an example of how to use the faceting feature in a query. This API will likely be update in a future version.
</p>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Query query = new Query();
query.setQuery(&quot;+docs.body:\&quot;Hadoop is awesome\&quot;&quot;);
final BlurQuery blurQuery = new BlurQuery();
blurQuery.setQuery(query);
// This facet will stop counting once the count has reached 10000. However this is only counted
// on each server, so it is likely you will receive a count larger than your max.
blurQuery.addToFacets(new Facet("fam1.col1:value1 OR fam1.col1:value2", 10000));
blurQuery.addToFacets(new Facet("fam1.col1:value100 AND fam1.col1:value200", Long.MAX_VALUE));
BlurResults results = client.query(tableName, blurQuery);
System.out.println(&quot;Facet Results:&quot;);
List<Long> facetCounts = results.getFacetCounts();
List<Facet> facets = blurQuery.getFacets();
for (int i = 0; i < facets.size(); i++) {
&nbsp;&nbsp;System.out.println("Facet [" + facets.get(i) + "] got [" + facetCounts.get(i) + "]");
}
BlurResults results = client.query(&quot;table1&quot;, blurQuery);
System.out.println(&quot;Total Results: &quot; + results.totalResults);
for (BlurResult result : results.getResults()) {
&nbsp;&nbsp;System.out.println(result);
}</code></pre>
<h3 id="simple_fetch_data">Fetch Data</h3>
<p>
This is an example of how to fetch data via the Thrift API. All the records
of the Row "rowid1" are returned. If it is not found then Row would be null.
</p>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Selector selector = new Selector();
selector.setRowId("rowid1");
FetchResult fetchRow = client.fetchRow("table1", selector);
FetchRowResult rowResult = fetchRow.getRowResult();
Row row = rowResult.getRow();
for (Record record : row.getRecords()) {
System.out.println(record);
}
</code></pre>
<h3 id="simple_mutate_example">Mutate Example</h3>
<p>
This is an example of how to perform a mutate on a table and either add or replace an existing Row.
</p>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
Record record1 = new Record();
record1.setRecordId("recordid1");
record1.setFamily("fam0");
record1.addToColumns(new Column("col0", "val0"));
record1.addToColumns(new Column("col1", "val1"));
Record record2 = new Record();
record2.setRecordId("recordid2");
record2.setFamily("fam1");
record2.addToColumns(new Column("col4", "val4"));
record2.addToColumns(new Column("col5", "val5"));
List<RecordMutation> recordMutations = new ArrayList<RecordMutation>();
recordMutations.add(new RecordMutation(RecordMutationType.REPLACE_ENTIRE_RECORD, record1));
recordMutations.add(new RecordMutation(RecordMutationType.REPLACE_ENTIRE_RECORD, record2));
// This will replace the exiting Row of "rowid1" (if one exists) in table "table1". It will
// write the mutate to the write ahead log (WAL) and it will not block waiting for the
// mutate to become visible.
RowMutation mutation = new RowMutation("table1", "rowid1", true, RowMutationType.REPLACE_ROW,
recordMutations, false);
mutation.setRecordMutations(recordMutations);
client.mutate(mutation);
</code></pre>
<h3 id="simple_shortened_mutate_example">Shortened Mutate Example</h3>
<p>
This is the same example as above but is shorted with a help class.
</p>
<pre><code class="java">import static org.apache.blur.thrift.util.BlurThriftHelper.*;
Iface client = BlurClient.getClient("controller1:40010,controller2:40010");
// This will replace the exiting Row of "rowid1" (if one exists) in table "table1". It will
// write the mutate to the write ahead log (WAL) and it will not block waiting for the
// mutate to become visible.
RowMutation mutation = newRowMutation("table1", "rowid1",
newRecordMutation("fam0", "recordid1", newColumn("col0", "val0"), newColumn("col1", "val2")),
newRecordMutation("fam1", "recordid2", newColumn("col4", "val4"), newColumn("col5", "val4")));
client.mutate(mutation);
</code></pre>
</section>
<section>
<div class="page-header">
<h1 id="shell">Shell</h1>
</div>
<p>
The shell can be invoked by running:
<pre><code class="bash">$BLUR_HOME/bin/blur shell</code></pre>
Also any shell command can be invoked as a cli command by running:
<pre><code class="bash">$BLUR_HOME/bin/blur &lt;command&gt;
# For example to get help
$BLUR_HOME/bin/blur help
</code></pre>
The following rules are used when interacting with the shell:
<ul>
<li>Arguments are denoted by &quot;&lt; &gt;&quot;.</li>
<li>Optional arguments are denoted by &quot;[ ]&quot;.</li>
<li>Options are denoted by &quot;-&quot;.</li>
<li>Multiple options / arguments are denoted by &quot;*&quot;.</li>
</ul>
</p>
|||Shell-Body|||
</section>
<section>
<div class="page-header">
<h1 id="map-reduce">Map Reduce</h1>
</div>
<p>Here is an example of the typical usage of the BlurOutputFormat. The Blur table has to be created before the MapReduce job is started. The setupJob method configures the following:</p>
<ul>
<li>The reducer class to be DefaultBlurReducer</li>
<li>The number of reducers to be equal to the number of shards in the table.</li>
<li>The output key class to a standard Text writable from the Hadoop library</li>
<li>The output value class is a BlurMutate writable from the Blur library</li>
<li>The output format to be BlurOutputFormat</li>
<li>Sets the TableDescriptor in the Configuration</li>
<li>Sets the output path to the TableDescriptor.getTableUri() value</li>
<li>Also the job will use the BlurOutputCommitter class to commit or rollback the MapReduce job</li>
</ul>
<h3>Example Usage</h3>
<pre><code class="java">Iface client = BlurClient.getClient("controller1:40010");
TableDescriptor tableDescriptor = client.describe(tableName);
Job job = new Job(jobConf, "blur index");
job.setJarByClass(BlurOutputFormatTest.class);
job.setMapperClass(CsvBlurMapper.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(input));
CsvBlurMapper.addColumns(job, "cf1", "col");
BlurOutputFormat.setupJob(job, tableDescriptor);
BlurOutputFormat.setIndexLocally(job, true);
BlurOutputFormat.setOptimizeInFlight(job, true);
job.waitForCompletion(true);</code></pre>
<h3>Options</h3>
<ul>
<li>
BlurOutputFormat.setIndexLocally(Job,boolean)
<ul><li>Enabled by default, this will enable local indexing on the machine where the task is running. Then when the RecordWriter closes the index is copied to the remote destination in HDFS.</li></ul>
</li>
<li>
BlurOutputFormat.setMaxDocumentBufferSize(Job,int)
<ul><li>Sets the maximum number of documents that the buffer will hold in memory before overflowing to disk. By default this is 1000 which will probably be very low for most systems.</li></ul>
</li>
<li>
BlurOutputFormat.setOptimizeInFlight(Job,boolean)
<ul><li>Enabled by default, this will optimize the index while copying from the local index to the remote destination in HDFS. Used in conjunction with the setIndexLocally.</li></ul>
</li>
<li>
BlurOutputFormat.setReducerMultiplier(Job,int)
<ul><li>This will multiple the number of reducers for this job. For example if the table has 256 shards the normal number of reducers is 256. However if the reducer multiplier is set to 4 then the number of reducers will be 1024 and each shard will get 4 new segments instead of the normal 1.</li></ul>
</li>
</ul>
</section>
<section>
<div class="page-header">
<h1 id="csv-loader">CSV Loader</h1>
</div>
<p>
The CSV Loader program can be invoked by running:<pre><code class="bash">$BLUR_HOME/bin/blur csvloader</code></pre>
<div class="bs-callout bs-callout-warning"><h4>Caution</h4>Also the machine that will execute this command will need to have Hadoop installed and configured locally,
otherwise the scripts will not work correctly.</div>
<pre><code class="bash">usage: csvloader
The "csvloader" command is used to load delimited into a Blur table.
The required options are "-c", "-t", "-d". The standard format for the contents of a file
is:"rowid,recordid,family,col1,col2,...". However there are several options, such as the rowid and
recordid can be generated based on the data in the record via the "-A" and "-a" options. The family
can assigned based on the path via the "-I" option. The column name order can be mapped via the "-d"
option. Also you can set the input format to either sequence files vie the "-S" option or leave the
default text files.
-A No Row Ids - Automatically generate row ids for each record based on a MD5
has of the data within the record.
-a No Record Ids - Automatically generate record ids for each record based on a
MD5 has of the data within the record.
-b &lt;size&gt; The maximum number of Lucene documents to buffer in the reducer for a single
row before spilling over to disk. (default 1000)
-c &lt;controller*&gt; * Thrift controller connection string. (host1:40010 host2:40010 ...)
-C &lt;minimum maximum&gt; Enables a combine file input to help deal with many small files as the
input. Provide the minimum and maximum size per mapper. For a minimum of
1GB and a maximum of 2.5GB: (1000000000 2500000000)
-d &lt;family column*&gt; * Define the mapping of fields in the CSV file to column names. (family col1
col2 col3 ...)
-I &lt;family path*&gt; The directory to index with a family name, the family name is assumed to NOT
be present in the file contents. (family hdfs://namenode/input/in1)
-i &lt;path*&gt; The directory to index, the family name is assumed to BE present in the file
contents. (hdfs://namenode/input/in1)
-l Disable the use storage local on the server that is running the reducing
task and copy to Blur table once complete. (enabled by default)
-o Disable optimize indexes during copy, this has very little overhead.
(enabled by default)
-p &lt;codec&gt; Sets the compression codec for the map compress output setting.
(SNAPPY,GZIP,BZIP,DEFAULT, or classname)
-r &lt;multiplier&gt; The reducer multipler allows for an increase in the number of reducers per
shard in the given table. For example if the table has 128 shards and the
reducer multiplier is 4 the total number of reducers will be 512, 4 reducers
per shard. (default 1)
-s &lt;delimiter&gt; The file delimiter to be used. (default value ',') NOTE: For special
charactors like the default hadoop separator of ASCII value 1, you can use
standard java escaping (\u0001)
-S The input files are sequence files.
-t &lt;tablename&gt; * Blur table name.</code></pre>
</p>
</section>
<section>
<div class="page-header">
<h1 id="jdbc">JDBC</h1>
</div>
<p>
The JDBC driver is very experimental and is currently read-only. It has a very basic SQL-ish
language that should allow for most Blur queries.
Basic SQL syntax will work for example:<br/>
<pre><code class="bash">select * from testtable where fam1.col1 = 'val1'</code></pre>
You may also use Lucene syntax by wrapping the Lucene query in a &quot;query()&quot; function:<br/>
<pre><code class="bash">select * from testtable where query(fam1.col1:val?)</code></pre>
Here is a screenshot of the JDBC driver in SQuirrel:
<img src="resources/img/SQuirrel.png">
<br/>
<br/>
<br/>
<br/>
</p>
</section>
</div>
</div>
</div>
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="resources/js/jquery-2.0.3.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="resources/js/bootstrap.min.js"></script>
<!-- Enable responsive features in IE8 with Respond.js (https://github.com/scottjehl/Respond) -->
<script src="resources/js/respond.min.js"></script>
<script src="resources/js/docs.js"></script>
</body>
</html>