| ~~ Licensed to the Apache Software Foundation (ASF) under one or more |
| ~~ contributor license agreements. See the NOTICE file distributed with |
| ~~ this work for additional information regarding copyright ownership. |
| ~~ The ASF licenses this file to You under the Apache License, Version 2.0 |
| ~~ (the "License"); you may not use this file except in compliance with |
| ~~ the License. You may obtain a copy of the License at |
| ~~ |
| ~~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~~ |
| ~~ Unless required by applicable law or agreed to in writing, software |
| ~~ distributed under the License is distributed on an "AS IS" BASIS, |
| ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ~~ See the License for the specific language governing permissions and |
| ~~ limitations under the License. |
| |
| Executing a Query in HDFS |
| |
| |
| * 1. Connecting VXQuery with HDFS |
| |
| In order to read HDFS data, VXQuery needs access to the HDFS configuration |
| directory, which contains: |
| |
| core-site.xml |
| hdfs-site.xml |
| mapred-site.xml |
| |
| Some systems may automatically set this directory as a system environment |
| variable ("HADOOP_CONF_DIR"). If this is the case, VXQuery will retrieve |
| this automatically when attempting to perform HDFS queries. |
| |
| When this variable is not set, users will need to provide this directory as |
| a Command Line Option when executing VXQuery: |
| -hdfs-conf /path/to/hdfs/conf_folder |
| |
| |
| * 2. Running the Query |
| |
| |
| For files stored in HDFS there are 2 ways to access them from VXQuery. |
| |
| |
| [[a]] Reading them as whole files. |
| |
| |
| [[b]] Reading them block by block. |
| |
| |
| ** a. Reading them as whole files. |
| |
| For this option you only need to change the path to files. To define that your |
| file(s) exist and should be read from HDFS you must add <"hdfs:/"> in front |
| of the path. VXQuery will read the path of the files you request in your query |
| and try to locate them. |
| |
| |
| So in order to run a query that will read the input files from HDFS you need |
| to make sure that |
| |
| |
| a) The environmental variable is set for "HADOOP_CONF_DIR" or you pass the |
| directory location using -hdfs-conf |
| |
| |
| b) The path defined in your query begins with <hdfs://> and the full path to |
| the file(s). |
| |
| |
| c) The path exists on HDFS and the user that runs the query has read permission |
| to these files. |
| |
| |
| *** Example |
| |
| I want to find all the <books> that are published after 2004. |
| |
| |
| The file is located in HDFS in this path </user/hduser/store/books.xml> |
| |
| |
| My query will look like this: |
| |
| |
| ---------- |
| for $x in collection("hdfs://user/hduser/store") |
| where $x/year>2004 |
| return $x/title |
| ---------- |
| |
| |
| If I want only one file, the <<books.xml>> to be parsed from HDFS, my query will |
| look like this: |
| |
| |
| ---------- |
| for $x in doc("hdfs://user/hduser/store/books.xml") |
| where $x/year>2004 |
| return $x/title |
| ---------- |
| |
| |
| ** b. Reading them block by block |
| |
| |
| In order to use that option you need to modify your query. Instead of using the |
| <collection> or <doc> function to define your input file(s) you need to use |
| <collection-with-tag>. |
| |
| |
| <collection-with-tag> accepts two arguments, one is the path to the HDFS directory |
| you have stored your input files, and the second is a specific <<tag>> that exists |
| in the input file(s). This is the tag of the element that contains the fields that |
| your query is looking for. |
| |
| Other than these arguments, you do not need to change anything else in the query. |
| |
| Note: since this strategy is optimized to read block by block, the result will |
| include all elements with the given tag, regardless of depth within the xml tree. |
| |
| |
| *** Example |
| |
| The same example, using <<collection-with-tag>>. |
| |
| My input file <books.xml>: |
| |
| ----------------------------- |
| <?xml version="1.0" encoding="UTF-8"?> |
| <bookstore> |
| |
| <book> |
| <title lang="en">Everyday Italian</title> |
| <author>Giada De Laurentiis</author> |
| <year>2005</year> |
| <price>30.00</price> |
| </book> |
| |
| <book> |
| <title lang="en">Harry Potter</title> |
| <author>J K. Rowling</author> |
| <year>2005</year> |
| <price>29.99</price> |
| </book> |
| |
| <book> |
| <title lang="en">XQuery Kick Start</title> |
| <author>James McGovern</author> |
| <author>Per Bothner</author> |
| <author>Kurt Cagle</author> |
| <author>James Linn</author> |
| <author>Vaidyanathan Nagarajan</author> |
| <year>2003</year> |
| <price>49.99</price> |
| </book> |
| |
| <book> |
| <title lang="en">Learning XML</title> |
| <author>Erik T. Ray</author> |
| <year>2003</year> |
| <price>39.95</price> |
| </book> |
| |
| </bookstore> |
| ---------------------------- |
| |
| |
| My query will look like this: |
| |
| |
| ---------------------------- |
| for $x in collection-with-tag("hdfs://user/hduser/store","book")/book |
| where $x/year>2004 |
| return $x/title |
| ---------------------------- |
| |
| |
| Take notice that I defined the path to the directory containing the file(s) |
| and not the file, <collection-with-tag> expects the path to the directory. I also |
| added the </book> after the function. This is also needed, like <collection> and |
| <doc> functions, for the query to be parsed correctly. |