blob: fa736b8a5216b3fdb4cc7579498a50618a5d83ec [file] [log] [blame]
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
Executing a Query in HDFS
* 1. Connecting VXQuery with HDFS
In order to read HDFS data, VXQuery needs access to the HDFS configuration
directory, which contains:
core-site.xml
hdfs-site.xml
mapred-site.xml
Some systems may automatically set this directory as a system environment
variable ("HADOOP_CONF_DIR"). If this is the case, VXQuery will retrieve
this automatically when attempting to perform HDFS queries.
When this variable is not set, users will need to provide this directory as
a Command Line Option when executing VXQuery:
-hdfs-conf /path/to/hdfs/conf_folder
* 2. Running the Query
For files stored in HDFS there are 2 ways to access them from VXQuery.
[[a]] Reading them as whole files.
[[b]] Reading them block by block.
** a. Reading them as whole files.
For this option you only need to change the path to files. To define that your
file(s) exist and should be read from HDFS you must add <"hdfs:/"> in front
of the path. VXQuery will read the path of the files you request in your query
and try to locate them.
So in order to run a query that will read the input files from HDFS you need
to make sure that
a) The environmental variable is set for "HADOOP_CONF_DIR" or you pass the
directory location using -hdfs-conf
b) The path defined in your query begins with <hdfs://> and the full path to
the file(s).
c) The path exists on HDFS and the user that runs the query has read permission
to these files.
*** Example
I want to find all the <books> that are published after 2004.
The file is located in HDFS in this path </user/hduser/store/books.xml>
My query will look like this:
----------
for $x in collection("hdfs://user/hduser/store")
where $x/year>2004
return $x/title
----------
If I want only one file, the <<books.xml>> to be parsed from HDFS, my query will
look like this:
----------
for $x in doc("hdfs://user/hduser/store/books.xml")
where $x/year>2004
return $x/title
----------
** b. Reading them block by block
In order to use that option you need to modify your query. Instead of using the
<collection> or <doc> function to define your input file(s) you need to use
<collection-with-tag>.
<collection-with-tag> accepts two arguments, one is the path to the HDFS directory
you have stored your input files, and the second is a specific <<tag>> that exists
in the input file(s). This is the tag of the element that contains the fields that
your query is looking for.
Other than these arguments, you do not need to change anything else in the query.
Note: since this strategy is optimized to read block by block, the result will
include all elements with the given tag, regardless of depth within the xml tree.
*** Example
The same example, using <<collection-with-tag>>.
My input file <books.xml>:
-----------------------------
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book>
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book>
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
----------------------------
My query will look like this:
----------------------------
for $x in collection-with-tag("hdfs://user/hduser/store","book")/book
where $x/year>2004
return $x/title
----------------------------
Take notice that I defined the path to the directory containing the file(s)
and not the file, <collection-with-tag> expects the path to the directory. I also
added the </book> after the function. This is also needed, like <collection> and
<doc> functions, for the query to be parsed correctly.