integration/hive/hive-guide.md - carbondata - Git at Google

 <!--
     Licensed to the Apache Software Foundation (ASF) under one or more
     contributor license agreements.  See the NOTICE file distributed with
     this work for additional information regarding copyright ownership.
     The ASF licenses this file to you under the Apache License, Version 2.0
     (the "License"); you may not use this file except in compliance with
     the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

     Unless required by applicable law or agreed to in writing, software
     distributed under the License is distributed on an "AS IS" BASIS,
     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     See the License for the specific language governing permissions and
     limitations under the License.
 -->

 # Quick Start
 This tutorial provides a quick introduction to using current integration/hive module.

 ## Build (In 1.2.0, hive integration only support spark2.1 and hadoop2.7.2)
 mvn -DskipTests -Pspark-2.1 -Phadoop-2.7.2 clean package

 ## Prepare CarbonData in Spark
 * Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData.

   ```
   cd carbondata
   cat > sample.csv << EOF
   id,name,scale,country,salary
   1,yuhai,1.77,china,33000.1
   2,runlin,1.70,china,33000.2
   EOF
   ```

 * copy data to HDFS

 ```
 $HADOOP_HOME/bin/hadoop fs -put sample.csv <hdfs store path>/sample.csv
 ```

 * Add the following params to $SPARK_CONF_DIR/conf/hive-site.xml
 ```xml
 <property>
   <name>hive.metastore.pre.event.listeners</name>
   <value>org.apache.carbondata.hive.CarbonHiveMetastoreListener</value>
 </property>
 ```
 * Start Spark shell by running the following command in the Spark directory

 ```
 ./bin/spark-shell --jars <carbondata assembly jar path, carbon hive jar path>
 ```

 ```
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.CarbonSession._
 val rootPath = "hdfs:////user/hadoop/carbon"
 val storeLocation = s"$rootPath/store"
 val warehouse = s"$rootPath/warehouse"
 val metastoredb = s"$rootPath/metastore_db"

 val carbon = SparkSession.builder().enableHiveSupport().config("spark.sql.warehouse.dir", warehouse).config(org.apache.carbondata.core.constants.CarbonCommonConstants.STORE_LOCATION, storeLocation).getOrCreateCarbonSession(storeLocation, metastoredb)

 carbon.sql("create table hive_carbon(id int, name string, scale decimal, country string, salary double) STORED BY 'carbondata'")
 carbon.sql("LOAD DATA INPATH '<hdfs store path>/sample.csv' INTO TABLE hive_carbon")
 scala>carbon.sql("SELECT * FROM hive_carbon").show()
 ```

 ## Query Data in Hive
 ### Configure hive classpath
 ```
 mkdir hive/auxlibs/
 cp carbondata/assembly/target/scala-2.11/carbondata_2.11*.jar hive/auxlibs/
 cp carbondata/integration/hive/target/carbondata-hive-*.jar hive/auxlibs/
 cp $SPARK_HOME/jars/spark-catalyst*.jar hive/auxlibs/
 cp $SPARK_HOME/jars/scala*.jar hive/auxlibs/
 export HIVE_AUX_JARS_PATH=hive/auxlibs/
 ```
 ### Fix snappy issue
 ```
 copy snappy-java-xxx.jar from "./<SPARK_HOME>/jars/" to "./Library/Java/Extensions"
 export HADOOP_OPTS="-Dorg.xerial.snappy.lib.path=/Library/Java/Extensions -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/Users/apple/DEMO/tmp"
 ```

 ### Start hive client
 $HIVE_HOME/bin/hive

 ### Query data from hive table

 ```
 set hive.mapred.supports.subdirectories=true;
 set mapreduce.input.fileinputformat.input.dir.recursive=true;

 select * from hive_carbon;
 select count(*) from hive_carbon;
 select * from hive_carbon order by id;
 ```
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to you under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	# Quick Start
	This tutorial provides a quick introduction to using current integration/hive module.

	## Build (In 1.2.0, hive integration only support spark2.1 and hadoop2.7.2)
	mvn -DskipTests -Pspark-2.1 -Phadoop-2.7.2 clean package

	## Prepare CarbonData in Spark
	* Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData.

	```
	cd carbondata
	cat > sample.csv << EOF
	id,name,scale,country,salary
	1,yuhai,1.77,china,33000.1
	2,runlin,1.70,china,33000.2
	EOF
	```

	* copy data to HDFS

	```
	$HADOOP_HOME/bin/hadoop fs -put sample.csv <hdfs store path>/sample.csv
	```

	* Add the following params to $SPARK_CONF_DIR/conf/hive-site.xml
	```xml
	<property>
	<name>hive.metastore.pre.event.listeners</name>
	<value>org.apache.carbondata.hive.CarbonHiveMetastoreListener</value>
	</property>
	```
	* Start Spark shell by running the following command in the Spark directory

	```
	./bin/spark-shell --jars <carbondata assembly jar path, carbon hive jar path>
	```

	```
	import org.apache.spark.sql.SparkSession
	import org.apache.spark.sql.CarbonSession._
	val rootPath = "hdfs:////user/hadoop/carbon"
	val storeLocation = s"$rootPath/store"
	val warehouse = s"$rootPath/warehouse"
	val metastoredb = s"$rootPath/metastore_db"

	val carbon = SparkSession.builder().enableHiveSupport().config("spark.sql.warehouse.dir", warehouse).config(org.apache.carbondata.core.constants.CarbonCommonConstants.STORE_LOCATION, storeLocation).getOrCreateCarbonSession(storeLocation, metastoredb)

	carbon.sql("create table hive_carbon(id int, name string, scale decimal, country string, salary double) STORED BY 'carbondata'")
	carbon.sql("LOAD DATA INPATH '<hdfs store path>/sample.csv' INTO TABLE hive_carbon")
	scala>carbon.sql("SELECT * FROM hive_carbon").show()
	```

	## Query Data in Hive
	### Configure hive classpath
	```
	mkdir hive/auxlibs/
	cp carbondata/assembly/target/scala-2.11/carbondata_2.11*.jar hive/auxlibs/
	cp carbondata/integration/hive/target/carbondata-hive-*.jar hive/auxlibs/
	cp $SPARK_HOME/jars/spark-catalyst*.jar hive/auxlibs/
	cp $SPARK_HOME/jars/scala*.jar hive/auxlibs/
	export HIVE_AUX_JARS_PATH=hive/auxlibs/
	```
	### Fix snappy issue
	```
	copy snappy-java-xxx.jar from "./<SPARK_HOME>/jars/" to "./Library/Java/Extensions"
	export HADOOP_OPTS="-Dorg.xerial.snappy.lib.path=/Library/Java/Extensions -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/Users/apple/DEMO/tmp"
	```

	### Start hive client
	$HIVE_HOME/bin/hive

	### Query data from hive table

	```
	set hive.mapred.supports.subdirectories=true;
	set mapreduce.input.fileinputformat.input.dir.recursive=true;

	select * from hive_carbon;
	select count(*) from hive_carbon;
	select * from hive_carbon order by id;
	```