blob: e308dfe952e67df7d345a1182931eecc51e54f25 [file] [log] [blame]
<!DOCTYPE html><html><head><meta charset="utf-8"><title>Untitled Document.md</title><style></style></head><body id="preview">
<p>&lt;!–<br>
Licensed to the Apache Software Foundation (ASF) under one<br>
or more contributor license agreements. See the NOTICE file<br>
distributed with this work for additional information<br>
regarding copyright ownership. The ASF licenses this file<br>
to you under the Apache License, Version 2.0 (the<br>
“License”); you may not use this file except in compliance<br>
with the License. You may obtain a copy of the License at</p>
<pre><code> http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
&quot;AS IS&quot; BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
</code></pre>
<p>–&gt;</p>
<h1><a id="Installation_Guide_19"></a>Installation Guide</h1>
<p>This tutorial will guide you through the installation and configuration of CarbonData in the following two modes :</p>
<ul>
<li><a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Cluster+deployment+guide">On Standalone Spark cluster</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Cluster+deployment+guide">On Spark on Yarn cluster</a></li>
</ul>
<p>followed by :</p>
<ul>
<li><a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Cluster+deployment+guide">Query Execution using Carbon Thrift Server</a></li>
</ul>
<h2><a id="Installing_and_Configuring_CarbonData_on_Standalone_Spark_Cluster_28"></a>Installing and Configuring CarbonData on “Standalone Spark” Cluster</h2>
<h3><a id="Prerequisite_30"></a>Prerequisite</h3>
<ul>
<li>Hadoop HDFS and Yarn should be installed and running.</li>
<li>Spark should be installed and running in all the clients.</li>
<li>CarbonData user should have permission to access HDFS.</li>
</ul>
<h3><a id="Procedure_35"></a>Procedure</h3>
<p>The following steps are only for Driver Nodes.(Driver nodes are the one which starts the spark context.)</p>
<ol>
<li>
<p><a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Building+CarbonData+And+IDE+Configuration">Build the CarbonData</a> project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the “&lt;SPARK_HOME&gt;/carbonlib” folder.</p>
<p>(Note: - Create the carbonlib folder if does not exists inside “&lt;SPARK_HOME&gt;” path.)</p>
</li>
<li>
<p>carbonlib folder path must be added in Spark classpath. (Edit “&lt;SPARK_HOME&gt;/conf/spark-env.sh” file and modify the value of SPARK_CLASSPATH by appending “&lt;SPARK_HOME&gt;/carbonlib/*” to the existing value)</p>
</li>
<li>
<p>Copy the carbon.properties.template to “&lt;SPARK_HOME&gt;/conf/carbon.properties” folder from “./conf/” of CarbonData repository.</p>
</li>
<li>
<p>Copy “carbonplugins” folder to “&lt;SPARK_HOME&gt;/carbonlib” folder from “./processing/” folder of CarbonData repository.</p>
<p>(Note: -carbonplugins will contain .kettle folder.)</p>
</li>
<li>
<p>In Spark node, configure the properties mentioned as the below table in “&lt;SPARK_HOME&gt;/conf/spark-defaults.conf” file</p>
</li>
</ol>
<table class="table table-striped table-bordered">
<thead>
<tr>
<th>Property</th>
<th>Description</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>carbon.kettle.home</td>
<td>Path that will be used by CarbonData internally to create graph for loading the data</td>
<td>$SPARK_HOME /carbonlib/carbonplugins</td>
</tr>
<tr>
<td>spark.driver.extraJavaOptions</td>
<td>A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.</td>
<td>-Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties</td>
</tr>
<tr>
<td>spark.executor.extraJavaOptions</td>
<td>A string of extra JVM options to pass to executors. For instance, GC settings or other logging. NOTE: You can enter multiple values separated by space.</td>
<td>-Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties</td>
</tr>
</tbody>
</table>
<ol start="6">
<li>Add the following properties in “&lt;SPARK_HOME&gt;/conf/” carbon.properties:</li>
</ol>
<table class="table table-striped table-bordered">
<thead>
<tr>
<th>Property</th>
<th>Required</th>
<th>Description</th>
<th>Example</th>
<th>Remark</th>
</tr>
</thead>
<tbody>
<tr>
<td>carbon.storelocation</td>
<td>NO</td>
<td>Location where data Carbon will create the store and write the data in its own format.</td>
<td>hdfs://IP:PORT/Opt/CarbonStore</td>
<td>Propose</td>
</tr>
<tr>
<td>carbon.kettle.home</td>
<td>YES</td>
<td>Path that will used by Carbon internally to create graph for loading the data.</td>
<td>$SPARK_HOME/carbonlib/carbonplugins</td>
<td></td>
</tr>
</tbody>
</table>
<ol start="7">
<li>Installation verification,for example:</li>
</ol>
<pre><code>./spark-shell --master spark://IP:PORT --total-executor-cores 2 --executor-memory 2G
</code></pre>
<p>Note: Make sure that user should have permission of carbon jars and files through which driver and executor will start.</p>
<p>To get started with CarbonData : <a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Quick+Start">Quick Start</a> ,<a href="https://cwiki.apache.org/confluence/display/CARBONDATA/DDL+operations+on+CarbonData">DDL Operations</a></p>
<h2><a id="Installing_and_Configuring_Carbon_on_Spark_on_YARN_Cluster_72"></a>Installing and Configuring Carbon on “Spark on YARN” Cluster</h2>
<p>This section provides the procedure to install Carbon on “Spark on YARN” cluster.</p>
<h3><a id="Prerequisite_75"></a>Prerequisite</h3>
<ul>
<li>Hadoop HDFS and Yarn should be installed and running.</li>
<li>Spark should be installed and running in all the clients.</li>
<li>CarbonData user should have permission to access HDFS.</li>
</ul>
<h3><a id="Procedure_80"></a>Procedure</h3>
<ol>
<li>
<p><a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Building+CarbonData+And+IDE+Configuration">Build the CarbonData</a> project and get the assembly jar from “./assembly/target/scala-2.10/carbondata_xxx.jar” and put in the “&lt;SPARK_HOME&gt;/carbonlib” folder.</p>
<p>(Note: - Create the carbonlib folder if does not exists inside “&lt;SPARK_HOME&gt;” path.)</p>
</li>
<li>
<p>Copy the carbon.properties.template to “&lt;SPARK_HOME&gt;/conf/carbon.properties” folder from “./conf/” of CarbonData repository.<br>
carbonplugins will contain .kettle folder.</p>
</li>
<li>
<p>Copy the “carbon.properties.template” to “&lt;SPARK_HOME&gt;/conf/carbon.properties” folder from conf folder of carbondata repository.</p>
</li>
<li>
<p>Modify the parameters in “spark-default.conf” located in the “&lt;SPARK_HOME&gt;/conf”</p>
</li>
</ol>
<table class="table table-striped table-bordered">
<thead>
<tr>
<th>Property</th>
<th>Description</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>spark.master</td>
<td>Set this value to run the Spark in yarn cluster mode.</td>
<td>Set “yarn-client” to run the Spark in yarn cluster mode.</td>
</tr>
<tr>
<td>spark.yarn.dist.files</td>
<td>Comma-separated list of files to be placed in the working directory of each executor.</td>
<td>“&lt;YOUR_SPARK_HOME_PATH&gt;”/conf/carbon.properties</td>
</tr>
<tr>
<td>spark.yarn.dist.archives</td>
<td>Comma-separated list of archives to be extracted into the working directory of each executor.</td>
<td>“&lt;YOUR_SPARK_HOME_PATH&gt;”/carbonlib/carbondata_xxx.jar</td>
</tr>
<tr>
<td>spark.executor.extraJavaOptions</td>
<td>A string of extra JVM options to pass to executors. For instance NOTE: You can enter multiple values separated by space.</td>
<td>-Dcarbon.properties.filepath=carbon.properties</td>
</tr>
<tr>
<td>spark.executor.extraClassPath</td>
<td>Extra classpath entries to prepend to the classpath of executors. NOTE: If SPARK_CLASSPATH is defined in <a href="http://spark-env.sh">spark-env.sh</a>, then comment it and append the values in below parameter spark.driver.extraClassPath</td>
<td>“&lt;YOUR_SPARK_HOME_PATH&gt;”/carbonlib/carbonlib/carbondata_xxx.jar</td>
</tr>
<tr>
<td>spark.driver.extraClassPath</td>
<td>Extra classpath entries to prepend to the classpath of the driver. NOTE: If SPARK_CLASSPATH is defined in <a href="http://spark-env.sh">spark-env.sh</a>, then comment it and append the value in below parameter spark.driver.extraClassPath.</td>
<td>“&lt;YOUR_SPARK_HOME_PATH&gt;”/carbonlib/carbonlib/carbondata_xxx.jar</td>
</tr>
<tr>
<td>spark.driver.extraJavaOptions</td>
<td>A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.</td>
<td>-Dcarbon.properties.filepath=&quot;&lt;YOUR_SPARK_HOME_PATH&gt;&quot;/conf/carbon.properties</td>
</tr>
<tr>
<td>carbon.kettle.home</td>
<td>Path that will used by Carbon internally to create graph for loading the data.</td>
<td>“&lt;YOUR_SPARK_HOME_PATH&gt;”/carbonlib/carbonplugins</td>
</tr>
</tbody>
</table>
<ol start="5">
<li>Add the following properties in &lt;SPARK_HOME&gt;/conf/ carbon.properties:</li>
</ol>
<table class="table table-striped table-bordered">
<thead>
<tr>
<th>Property</th>
<th>Required</th>
<th>Description</th>
<th>Example</th>
<th>Default Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>carbon.storelocation</td>
<td>NO</td>
<td>Location where data Carbon will create the store and write the data in its own format.</td>
<td>hdfs://IP:PORT/Opt/CarbonStore</td>
<td>Propose</td>
</tr>
<tr>
<td>carbon.kettle.home</td>
<td>YES</td>
<td>Path that will used by Carbon internally to create graph for loading the data.</td>
<td>$SPARK_HOME/carbonlib/carbonplugins</td>
<td></td>
</tr>
</tbody>
</table>
<ol start="6">
<li>Installation verification</li>
</ol>
<pre><code>./bin/spark-shell --master yarn-client --driver-memory 1g --executor-cores 2 --executor-memory 2G
</code></pre>
<p>Note: Make sure that user should have permission of carbon jars and files through which driver and executor will start.</p>
<p>To get started with CarbonData : <a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Quick+Start">Quick Start</a> ,<a href="https://cwiki.apache.org/confluence/display/CARBONDATA/DDL+operations+on+CarbonData">DDL Operations</a></p>
<h2><a id="Query_execution_using_Carbon_thrift_server_118"></a>Query execution using Carbon thrift server</h2>
<h3><a id="Start_Thrift_server_120"></a>Start Thrift server</h3>
<p>a. cd &lt;SPARK_HOME&gt;</p>
<p>b. Run below command to start the Carbon thrift server</p>
<pre><code>./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer
$SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR &lt;carbon_store_path&gt;
</code></pre>
<table class="table table-striped table-bordered">
<thead>
<tr>
<th>Parameter</th>
<th>Description</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>CARBON_ASSEMBLY_JAR</td>
<td>Carbon assembly jar name present in the “”/carbonlib/ folder.</td>
<td>carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar</td>
</tr>
<tr>
<td>carbon_store_path</td>
<td>This is parameter to the CarbonThriftServer class. This a HDFS path where carbon files will be kept. Strongly Recommended to put same as carbon.storelocation parameter of carbon.proeprties.</td>
<td>hdfs//hacluster/user/hive/warehouse/carbon.storehdfs//10.10.10.10:54310 /user/hive/warehouse/carbon.store</td>
</tr>
</tbody>
</table>
<h3><a id="Examples_133"></a>Examples</h3>
<ol>
<li>Start with default memory and executors</li>
</ol>
<pre><code>./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer $SPARK_HOME/carbonlib/carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar hdfs://hacluster/user/hive/warehouse/carbon.store
</code></pre>
<ol start="2">
<li>Start with Fixed executors and resources</li>
</ol>
<pre><code>./bin/spark-submit --conf spark.sql.hive.thriftServer.singleSession=true --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer --num-executors 3 --driver-memory 20g --executor-memory 250g --executor-cores 32 /srv/OSCON/BigData/HACluster/install/spark/sparkJdbc/lib/carbondata_2.10-0.1.0-incubating-SNAPSHOT-shade-hadoop2.7.2.jar hdfs://hacluster/user/hive/warehouse/carbon.store
</code></pre>
<h3><a id="Connecting_to_Carbon_Thrift_Server_Using_Beeline_142"></a>Connecting to Carbon Thrift Server Using Beeline</h3>
<pre><code>cd &lt;SPARK_HOME&gt;
./bin/beeline jdbc:hive2://&lt;thrftserver_host&gt;:port
Example
./bin/beeline jdbc:hive2://10.10.10.10:10000
</code></pre>
</body></html>