| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
| <link href='images/favicon.ico' rel='shortcut icon' type='image/x-icon'> |
| <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> |
| <title>CarbonData</title> |
| <style> |
| |
| </style> |
| <!-- Bootstrap --> |
| |
| <link rel="stylesheet" href="css/bootstrap.min.css"> |
| <link href="css/style.css" rel="stylesheet"> |
| <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries --> |
| <!-- WARNING: Respond.js doesn't work if you view the page via file:// --> |
| <!--[if lt IE 9]> |
| <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> |
| <script src="https://oss.maxcdn.scom/respond/1.4.2/respond.min.js"></script> |
| <![endif]--> |
| <script src="js/jquery.min.js"></script> |
| <script src="js/bootstrap.min.js"></script> |
| <script defer src="https://use.fontawesome.com/releases/v5.0.8/js/all.js"></script> |
| |
| |
| </head> |
| <body> |
| <header> |
| <nav class="navbar navbar-default navbar-custom cd-navbar-wrapper"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button aria-controls="navbar" aria-expanded="false" data-target="#navbar" data-toggle="collapse" |
| class="navbar-toggle collapsed" type="button"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| <a href="index.html" class="logo"> |
| <img src="images/CarbonDataLogo.png" alt="CarbonData logo" title="CarbocnData logo"/> |
| </a> |
| </div> |
| <div class="navbar-collapse collapse cd_navcontnt" id="navbar"> |
| <ul class="nav navbar-nav navbar-right navlist-custom"> |
| <li><a href="index.html" class="hidden-xs"><i class="fa fa-home" aria-hidden="true"></i> </a> |
| </li> |
| <li><a href="index.html" class="hidden-lg hidden-md hidden-sm">Home</a></li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle " data-toggle="dropdown" role="button" aria-haspopup="true" |
| aria-expanded="false"> Download <span class="caret"></span></a> |
| <ul class="dropdown-menu"> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/2.2.0/" |
| target="_blank">Apache CarbonData 2.2.0</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/2.1.1/" |
| target="_blank">Apache CarbonData 2.1.1</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/2.1.0/" |
| target="_blank">Apache CarbonData 2.1.0</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/2.0.1/" |
| target="_blank">Apache CarbonData 2.0.1</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/2.0.0/" |
| target="_blank">Apache CarbonData 2.0.0</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/1.6.1/" |
| target="_blank">Apache CarbonData 1.6.1</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/1.6.0/" |
| target="_blank">Apache CarbonData 1.6.0</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.4/" |
| target="_blank">Apache CarbonData 1.5.4</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.3/" |
| target="_blank">Apache CarbonData 1.5.3</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.2/" |
| target="_blank">Apache CarbonData 1.5.2</a></li> |
| <li> |
| <a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.1/" |
| target="_blank">Apache CarbonData 1.5.1</a></li> |
| <li> |
| <a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Releases" |
| target="_blank">Release Archive</a></li> |
| </ul> |
| </li> |
| <li><a href="documentation.html" class="active">Documentation</a></li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" |
| aria-expanded="false">Community <span class="caret"></span></a> |
| <ul class="dropdown-menu"> |
| <li> |
| <a href="https://github.com/apache/carbondata/blob/master/docs/how-to-contribute-to-apache-carbondata.md" |
| target="_blank">Contributing to CarbonData</a></li> |
| <li> |
| <a href="https://github.com/apache/carbondata/blob/master/docs/release-guide.md" |
| target="_blank">Release Guide</a></li> |
| <li> |
| <a href="https://cwiki.apache.org/confluence/display/CARBONDATA/PMC+and+Committers+member+list" |
| target="_blank">Project PMC and Committers</a></li> |
| <li> |
| <a href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=66850609" |
| target="_blank">CarbonData Meetups</a></li> |
| <li><a href="security.html">Apache CarbonData Security</a></li> |
| <li><a href="https://issues.apache.org/jira/browse/CARBONDATA" target="_blank">Apache |
| Jira</a></li> |
| <li><a href="videogallery.html">CarbonData Videos </a></li> |
| </ul> |
| </li> |
| <li class="dropdown"> |
| <a href="http://www.apache.org/" class="apache_link hidden-xs dropdown-toggle" |
| data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache</a> |
| <ul class="dropdown-menu"> |
| <li><a href="http://www.apache.org/" target="_blank">Apache Homepage</a></li> |
| <li><a href="http://www.apache.org/licenses/" target="_blank">License</a></li> |
| <li><a href="http://www.apache.org/foundation/sponsorship.html" |
| target="_blank">Sponsorship</a></li> |
| <li><a href="http://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="http://www.apache.org/" class="hidden-lg hidden-md hidden-sm dropdown-toggle" |
| data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache</a> |
| <ul class="dropdown-menu"> |
| <li><a href="http://www.apache.org/" target="_blank">Apache Homepage</a></li> |
| <li><a href="http://www.apache.org/licenses/" target="_blank">License</a></li> |
| <li><a href="http://www.apache.org/foundation/sponsorship.html" |
| target="_blank">Sponsorship</a></li> |
| <li><a href="http://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li> |
| </ul> |
| </li> |
| |
| <li> |
| <a href="#" id="search-icon"><i class="fa fa-search" aria-hidden="true"></i></a> |
| |
| </li> |
| |
| </ul> |
| </div><!--/.nav-collapse --> |
| <div id="search-box"> |
| <form method="get" action="http://www.google.com/search" target="_blank"> |
| <div class="search-block"> |
| <table border="0" cellpadding="0" width="100%"> |
| <tr> |
| <td style="width:80%"> |
| <input type="text" name="q" size=" 5" maxlength="255" value="" |
| class="search-input" placeholder="Search...." required/> |
| </td> |
| <td style="width:20%"> |
| <input type="submit" value="Search"/></td> |
| </tr> |
| <tr> |
| <td align="left" style="font-size:75%" colspan="2"> |
| <input type="checkbox" name="sitesearch" value="carbondata.apache.org" checked/> |
| <span style=" position: relative; top: -3px;"> Only search for CarbonData</span> |
| </td> |
| </tr> |
| </table> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| </header> <!-- end Header part --> |
| |
| <div class="fixed-padding"></div> <!-- top padding with fixde header --> |
| |
| <section><!-- Dashboard nav --> |
| <div class="container-fluid q"> |
| <div class="col-sm-12 col-md-12 maindashboard"> |
| <div class="verticalnavbar"> |
| <nav class="b-sticky-nav"> |
| <div class="nav-scroller"> |
| <div class="nav__inner"> |
| <a class="b-nav__intro nav__item" href="./introduction.html">introduction</a> |
| <a class="b-nav__quickstart nav__item" href="./quick-start-guide.html">quick start</a> |
| <a class="b-nav__uses nav__item" href="./usecases.html">use cases</a> |
| |
| <div class="nav__item nav__item__with__subs"> |
| <a class="b-nav__docs nav__item nav__sub__anchor" href="./language-manual.html">Language Reference</a> |
| <a class="nav__item nav__sub__item" href="./ddl-of-carbondata.html">DDL</a> |
| <a class="nav__item nav__sub__item" href="./dml-of-carbondata.html">DML</a> |
| <a class="nav__item nav__sub__item" href="./streaming-guide.html">Streaming</a> |
| <a class="nav__item nav__sub__item" href="./configuration-parameters.html">Configuration</a> |
| <a class="nav__item nav__sub__item" href="./index-developer-guide.html">Indexes</a> |
| <a class="nav__item nav__sub__item" href="./supported-data-types-in-carbondata.html">Data Types</a> |
| </div> |
| |
| <div class="nav__item nav__item__with__subs"> |
| <a class="b-nav__datamap nav__item nav__sub__anchor" href="./index-management.html">Index Managament</a> |
| <a class="nav__item nav__sub__item" href="./bloomfilter-index-guide.html">Bloom Filter</a> |
| <a class="nav__item nav__sub__item" href="./lucene-index-guide.html">Lucene</a> |
| <a class="nav__item nav__sub__item" href="./secondary-index-guide.html">Secondary Index</a> |
| <a class="nav__item nav__sub__item" href="../spatial-index-guide.html">Spatial Index</a> |
| <a class="nav__item nav__sub__item" href="../mv-guide.html">MV</a> |
| </div> |
| |
| <div class="nav__item nav__item__with__subs"> |
| <a class="b-nav__api nav__item nav__sub__anchor" href="./sdk-guide.html">API</a> |
| <a class="nav__item nav__sub__item" href="./sdk-guide.html">Java SDK</a> |
| <a class="nav__item nav__sub__item" href="./csdk-guide.html">C++ SDK</a> |
| </div> |
| |
| <a class="b-nav__perf nav__item" href="./performance-tuning.html">Performance Tuning</a> |
| <a class="b-nav__s3 nav__item" href="./s3-guide.html">S3 Storage</a> |
| <a class="b-nav__indexserver nav__item" href="./index-server.html">Index Server</a> |
| <a class="b-nav__prestodb nav__item" href="./prestodb-guide.html">PrestoDB Integration</a> |
| <a class="b-nav__prestosql nav__item" href="./prestosql-guide.html">PrestoSQL Integration</a> |
| <a class="b-nav__flink nav__item" href="./flink-integration-guide.html">Flink Integration</a> |
| <a class="b-nav__scd nav__item" href="./scd-and-cdc-guide.html">SCD & CDC</a> |
| <a class="b-nav__faq nav__item" href="./faq.html">FAQ</a> |
| <a class="b-nav__contri nav__item" href="./how-to-contribute-to-apache-carbondata.html">Contribute</a> |
| <a class="b-nav__security nav__item" href="./security.html">Security</a> |
| <a class="b-nav__release nav__item" href="./release-guide.html">Release Guide</a> |
| </div> |
| </div> |
| <div class="navindicator"> |
| <div class="b-nav__intro navindicator__item"></div> |
| <div class="b-nav__quickstart navindicator__item"></div> |
| <div class="b-nav__uses navindicator__item"></div> |
| <div class="b-nav__docs navindicator__item"></div> |
| <div class="b-nav__datamap navindicator__item"></div> |
| <div class="b-nav__api navindicator__item"></div> |
| <div class="b-nav__perf navindicator__item"></div> |
| <div class="b-nav__s3 navindicator__item"></div> |
| <div class="b-nav__indexserver navindicator__item"></div> |
| <div class="b-nav__prestodb navindicator__item"></div> |
| <div class="b-nav__prestosql navindicator__item"></div> |
| <div class="b-nav__flink navindicator__item"></div> |
| <div class="b-nav__scd navindicator__item"></div> |
| <div class="b-nav__faq navindicator__item"></div> |
| <div class="b-nav__contri navindicator__item"></div> |
| <div class="b-nav__security navindicator__item"></div> |
| </div> |
| </nav> |
| </div> |
| <div class="mdcontent"> |
| <section> |
| <div style="padding:10px 15px;"> |
| <div id="viewpage" name="viewpage"> |
| <div class="row"> |
| <div class="col-sm-12 col-md-12"> |
| <div> |
| <h1> |
| <a id="quick-start" class="anchor" href="#quick-start" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Quick Start</h1> |
| <p>This tutorial provides a quick introduction to use CarbonData. To follow along with this guide, download a packaged release of CarbonData from the <a href="https://dist.apache.org/repos/dist/release/carbondata/" target=_blank rel="nofollow">CarbonData website</a>. Alternatively, it can be created following <a href="https://github.com/apache/carbondata/tree/master/build" target=_blank>Building CarbonData</a> steps.</p> |
| <h2> |
| <a id="prerequisites" class="anchor" href="#prerequisites" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Prerequisites</h2> |
| <ul> |
| <li> |
| <p>CarbonData supports Spark versions up to 2.4. Please download Spark package from <a href="https://spark.apache.org/downloads.html" target=_blank rel="nofollow">Spark website</a></p> |
| </li> |
| <li> |
| <p>Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData</p> |
| <pre><code>cd carbondata |
| cat > sample.csv << EOF |
| id,name,city,age |
| 1,david,shenzhen,31 |
| 2,eason,shenzhen,27 |
| 3,jarry,wuhan,35 |
| EOF |
| </code></pre> |
| </li> |
| </ul> |
| <h2> |
| <a id="integration" class="anchor" href="#integration" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Integration</h2> |
| <h3> |
| <a id="integration-with-execution-engines" class="anchor" href="#integration-with-execution-engines" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Integration with Execution Engines</h3> |
| <p>CarbonData can be integrated with Spark, Presto, Flink and Hive execution engines. The below documentation guides on Installing and Configuring with these execution engines.</p> |
| <h4> |
| <a id="spark" class="anchor" href="#spark" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Spark</h4> |
| <p><a href="#installing-and-configuring-carbondata-to-run-locally-with-spark-sql-cli">Installing and Configuring CarbonData to run locally with Spark SQL CLI</a></p> |
| <p><a href="#installing-and-configuring-carbondata-to-run-locally-with-spark-shell">Installing and Configuring CarbonData to run locally with Spark Shell</a></p> |
| <p><a href="#installing-and-configuring-carbondata-on-standalone-spark-cluster">Installing and Configuring CarbonData on Standalone Spark Cluster</a></p> |
| <p><a href="#installing-and-configuring-carbondata-on-spark-on-yarn-cluster">Installing and Configuring CarbonData on Spark on YARN Cluster</a></p> |
| <p><a href="#query-execution-using-carbondata-thrift-server">Installing and Configuring CarbonData Thrift Server for Query Execution</a></p> |
| <h4> |
| <a id="presto" class="anchor" href="#presto" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Presto</h4> |
| <p><a href="#installing-and-configuring-carbondata-on-presto">Installing and Configuring CarbonData on Presto</a></p> |
| <h4> |
| <a id="hive" class="anchor" href="#hive" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Hive</h4> |
| <p><a href="./hive-guide.html">Installing and Configuring CarbonData on Hive</a></p> |
| <h3> |
| <a id="integration-with-storage-engines" class="anchor" href="#integration-with-storage-engines" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Integration with Storage Engines</h3> |
| <h4> |
| <a id="hdfs" class="anchor" href="#hdfs" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>HDFS</h4> |
| <p><a href="#installing-and-configuring-carbondata-on-standalone-spark-cluster">CarbonData supports read and write with HDFS</a></p> |
| <h4> |
| <a id="s3" class="anchor" href="#s3" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>S3</h4> |
| <p><a href="./s3-guide.html">CarbonData supports read and write with S3</a></p> |
| <h4> |
| <a id="alluxio" class="anchor" href="#alluxio" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Alluxio</h4> |
| <p><a href="./alluxio-guide.html">CarbonData supports read and write with Alluxio</a></p> |
| <h2> |
| <a id="installing-and-configuring-carbondata-to-run-locally-with-spark-sql-cli" class="anchor" href="#installing-and-configuring-carbondata-to-run-locally-with-spark-sql-cli" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Installing and Configuring CarbonData to run locally with Spark SQL CLI</h2> |
| <p>This will work with spark 2.3+ versions. In Spark SQL CLI, it uses CarbonExtensions to customize the SparkSession with CarbonData's parser, analyzer, optimizer and physical planning strategy rules in Spark. |
| To enable CarbonExtensions, we need to add the following configuration.</p> |
| <table> |
| <thead> |
| <tr> |
| <th>Key</th> |
| <th>Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>spark.sql.extensions</td> |
| <td>org.apache.spark.sql.CarbonExtensions</td> |
| </tr> |
| </tbody> |
| </table> |
| <p>Start Spark SQL CLI by running the following command in the Spark directory:</p> |
| <pre><code>./bin/spark-sql --conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions --jars <carbondata assembly jar path> |
| </code></pre> |
| <h6> |
| <a id="creating-a-table" class="anchor" href="#creating-a-table" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Creating a Table</h6> |
| <pre><code>CREATE TABLE IF NOT EXISTS test_table ( |
| id string, |
| name string, |
| city string, |
| age Int) |
| STORED AS carbondata; |
| </code></pre> |
| <p><strong>NOTE</strong>: CarbonExtensions only support "STORED AS carbondata" and "USING carbondata"</p> |
| <h6> |
| <a id="loading-data-to-a-table" class="anchor" href="#loading-data-to-a-table" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Loading Data to a Table</h6> |
| <pre><code>LOAD DATA INPATH '/local-path/sample.csv' INTO TABLE test_table; |
| |
| LOAD DATA INPATH 'hdfs://hdfs-path/sample.csv' INTO TABLE test_table; |
| </code></pre> |
| <pre><code>insert into table test_table select '1', 'name1', 'city1', 1; |
| </code></pre> |
| <p><strong>NOTE</strong>: Please provide the real file path of <code>sample.csv</code> for the above script. |
| If you get "tablestatus.lock" issue, please refer to <a href="faq.html">FAQ</a></p> |
| <h6> |
| <a id="query-data-from-a-table" class="anchor" href="#query-data-from-a-table" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Query Data from a Table</h6> |
| <pre><code>SELECT * FROM test_table; |
| </code></pre> |
| <pre><code>SELECT city, avg(age), sum(age) |
| FROM test_table |
| GROUP BY city; |
| </code></pre> |
| <h2> |
| <a id="installing-and-configuring-carbondata-to-run-locally-with-spark-shell" class="anchor" href="#installing-and-configuring-carbondata-to-run-locally-with-spark-shell" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Installing and Configuring CarbonData to run locally with Spark Shell</h2> |
| <p>Apache Spark Shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit <a href="http://spark.apache.org/docs/latest/" target=_blank rel="nofollow">Apache Spark Documentation</a> for more details on the Spark shell.</p> |
| <h4> |
| <a id="basics" class="anchor" href="#basics" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Basics</h4> |
| <h6> |
| <a id="option-1-using-carbonsession-deprecated-since-20" class="anchor" href="#option-1-using-carbonsession-deprecated-since-20" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Option 1: Using CarbonSession (deprecated since 2.0)</h6> |
| <p>Start Spark shell by running the following command in the Spark directory:</p> |
| <pre><code>./bin/spark-shell --jars <carbondata assembly jar path> |
| </code></pre> |
| <p><strong>NOTE</strong>: Path where packaged release of CarbonData was downloaded or assembly jar will be available after <a href="https://github.com/apache/carbondata/blob/master/build/README.md" target=_blank>building CarbonData</a> and can be copied from <code>./assembly/target/scala-2.1x/apache-carbondata_xxx.jar</code></p> |
| <p>In this shell, SparkSession is readily available as <code>spark</code> and Spark context is readily available as <code>sc</code>.</p> |
| <p>In order to create a CarbonSession we will have to configure it explicitly in the following manner :</p> |
| <ul> |
| <li>Import the following :</li> |
| </ul> |
| <pre><code>import org.apache.spark.sql.SparkSession |
| import org.apache.spark.sql.CarbonSession._ |
| </code></pre> |
| <ul> |
| <li>Create a CarbonSession :</li> |
| </ul> |
| <pre><code>val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("<carbon_store_path>") |
| </code></pre> |
| <p><strong>NOTE</strong></p> |
| <ul> |
| <li>By default metastore location points to <code>../carbon.metastore</code>, user can provide own metastore location to CarbonSession like |
| <code>SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("<carbon_store_path>", "<local metastore path>")</code>.</li> |
| <li>Data storage location can be specified by <code><carbon_store_path></code>, like <code>/carbon/data/store</code>, <code>hdfs://localhost:9000/carbon/data/store</code> or <code>s3a://carbon/data/store</code>.</li> |
| </ul> |
| <h6> |
| <a id="option-2-using-sparksession-with-carbonextensions" class="anchor" href="#option-2-using-sparksession-with-carbonextensions" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Option 2: Using SparkSession with CarbonExtensions</h6> |
| <p>Start Spark shell by running the following command in the Spark directory:</p> |
| <pre><code>./bin/spark-shell --conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions --jars <carbondata assembly jar path> |
| </code></pre> |
| <p><strong>NOTE</strong></p> |
| <ul> |
| <li>In this flow, we can use the built-in SparkSession <code>spark</code> instead of <code>carbon</code>. |
| We also can create a new SparkSession instead of the built-in SparkSession <code>spark</code> if need. |
| It need to add "org.apache.spark.sql.CarbonExtensions" into spark configuration "spark.sql.extensions". |
| <pre><code>SparkSession newSpark = SparkSession |
| .builder() |
| .config(sc.getConf) |
| .enableHiveSupport |
| .config("spark.sql.extensions","org.apache.spark.sql.CarbonExtensions") |
| .getOrCreate() |
| </code></pre> |
| </li> |
| <li>Data storage location can be specified by "spark.sql.warehouse.dir".</li> |
| </ul> |
| <h4> |
| <a id="executing-queries" class="anchor" href="#executing-queries" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Executing Queries</h4> |
| <h6> |
| <a id="creating-a-table-1" class="anchor" href="#creating-a-table-1" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Creating a Table</h6> |
| <pre><code>carbon.sql( |
| s""" |
| | CREATE TABLE IF NOT EXISTS test_table( |
| | id string, |
| | name string, |
| | city string, |
| | age Int) |
| | STORED AS carbondata |
| """.stripMargin) |
| </code></pre> |
| <p><strong>NOTE</strong>: |
| The following table list all supported syntax:</p> |
| <table> |
| <thead> |
| <tr> |
| <th>create table</th> |
| <th>SparkSession with CarbonExtensions</th> |
| <th>CarbonSession</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>STORED AS carbondata</td> |
| <td>yes</td> |
| <td>yes</td> |
| </tr> |
| <tr> |
| <td>USING carbondata</td> |
| <td>yes</td> |
| <td>yes</td> |
| </tr> |
| <tr> |
| <td>STORED BY 'carbondata'</td> |
| <td>no</td> |
| <td>yes</td> |
| </tr> |
| <tr> |
| <td>STORED BY 'org.apache.carbondata.format'</td> |
| <td>no</td> |
| <td>yes</td> |
| </tr> |
| </tbody> |
| </table> |
| <p>We suggest to use CarbonExtensions instead of CarbonSession.</p> |
| <h6> |
| <a id="loading-data-to-a-table-1" class="anchor" href="#loading-data-to-a-table-1" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Loading Data to a Table</h6> |
| <pre><code>carbon.sql("LOAD DATA INPATH '/path/to/sample.csv' INTO TABLE test_table") |
| </code></pre> |
| <p><strong>NOTE</strong>: Please provide the real file path of <code>sample.csv</code> for the above script. |
| If you get "tablestatus.lock" issue, please refer to <a href="faq.html">FAQ</a></p> |
| <h6> |
| <a id="query-data-from-a-table-1" class="anchor" href="#query-data-from-a-table-1" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Query Data from a Table</h6> |
| <pre><code>carbon.sql("SELECT * FROM test_table").show() |
| |
| carbon.sql( |
| s""" |
| | SELECT city, avg(age), sum(age) |
| | FROM test_table |
| | GROUP BY city |
| """.stripMargin).show() |
| </code></pre> |
| <h2> |
| <a id="installing-and-configuring-carbondata-on-standalone-spark-cluster" class="anchor" href="#installing-and-configuring-carbondata-on-standalone-spark-cluster" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Installing and Configuring CarbonData on Standalone Spark Cluster</h2> |
| <h3> |
| <a id="prerequisites-1" class="anchor" href="#prerequisites-1" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Prerequisites</h3> |
| <ul> |
| <li>Hadoop HDFS and Yarn should be installed and running.</li> |
| <li>Spark should be installed and running on all the cluster nodes.</li> |
| <li>CarbonData user should have permission to access HDFS.</li> |
| </ul> |
| <h3> |
| <a id="procedure" class="anchor" href="#procedure" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Procedure</h3> |
| <ol> |
| <li> |
| <p><a href="https://github.com/apache/carbondata/blob/master/build/README.md" target=_blank>Build the CarbonData</a> project and get the assembly jar from <code>./assembly/target/scala-2.1x/apache-carbondata_xxx.jar</code>.</p> |
| </li> |
| <li> |
| <p>Copy <code>./assembly/target/scala-2.1x/apache-carbondata_xxx.jar</code> to <code>$SPARK_HOME/carbonlib</code> folder.</p> |
| <p><strong>NOTE</strong>: Create the carbonlib folder if it does not exist inside <code>$SPARK_HOME</code> path.</p> |
| </li> |
| <li> |
| <p>Add the carbonlib folder path in the Spark classpath. (Edit <code>$SPARK_HOME/conf/spark-env.sh</code> file and modify the value of <code>SPARK_CLASSPATH</code> by appending <code>$SPARK_HOME/carbonlib/*</code> to the existing value)</p> |
| </li> |
| <li> |
| <p>Copy the <code>./conf/carbon.properties.template</code> file from CarbonData repository to <code>$SPARK_HOME/conf/</code> folder and rename the file to <code>carbon.properties</code>.</p> |
| </li> |
| <li> |
| <p>Repeat Step 2 to Step 5 in all the nodes of the cluster.</p> |
| </li> |
| <li> |
| <p>In Spark node[master], configure the properties mentioned in the following table in <code>$SPARK_HOME/conf/spark-defaults.conf</code> file.</p> |
| </li> |
| </ol> |
| <table> |
| <thead> |
| <tr> |
| <th>Property</th> |
| <th>Value</th> |
| <th>Description</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>spark.driver.extraJavaOptions</td> |
| <td><code>-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties</code></td> |
| <td>A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.</td> |
| </tr> |
| <tr> |
| <td>spark.executor.extraJavaOptions</td> |
| <td><code>-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties</code></td> |
| <td>A string of extra JVM options to pass to executors. For instance, GC settings or other logging. <strong>NOTE</strong>: You can enter multiple values separated by space.</td> |
| </tr> |
| </tbody> |
| </table> |
| <ol start="7"> |
| <li>Verify the installation. For example:</li> |
| </ol> |
| <pre><code>./bin/spark-shell \ |
| --master spark://HOSTNAME:PORT \ |
| --total-executor-cores 2 \ |
| --executor-memory 2G |
| </code></pre> |
| <p><strong>NOTE</strong>:</p> |
| <ul> |
| <li>property "carbon.storelocation" is deprecated in carbondata 2.0 version. Only the users who used this property in previous versions can still use it in carbon 2.0 version.</li> |
| <li>Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.</li> |
| </ul> |
| <h2> |
| <a id="installing-and-configuring-carbondata-on-spark-on-yarn-cluster" class="anchor" href="#installing-and-configuring-carbondata-on-spark-on-yarn-cluster" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Installing and Configuring CarbonData on Spark on YARN Cluster</h2> |
| <p>This section provides the procedure to install CarbonData on "Spark on YARN" cluster.</p> |
| <h3> |
| <a id="prerequisites-2" class="anchor" href="#prerequisites-2" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Prerequisites</h3> |
| <ul> |
| <li>Hadoop HDFS and Yarn should be installed and running.</li> |
| <li>Spark should be installed and running in all the clients.</li> |
| <li>CarbonData user should have permission to access HDFS.</li> |
| </ul> |
| <h3> |
| <a id="procedure-1" class="anchor" href="#procedure-1" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Procedure</h3> |
| <p>The following steps are only for Driver Nodes. (Driver nodes are the one which starts the spark context.)</p> |
| <ol> |
| <li> |
| <p><a href="https://github.com/apache/carbondata/blob/master/build/README.md" target=_blank>Build the CarbonData</a> project and get the assembly jar from <code>./assembly/target/scala-2.1x/apache-carbondata_xxx.jar</code> and copy to <code>$SPARK_HOME/carbonlib</code> folder.</p> |
| <p><strong>NOTE</strong>: Create the carbonlib folder if it does not exists inside <code>$SPARK_HOME</code> path.</p> |
| </li> |
| <li> |
| <p>Copy the <code>./conf/carbon.properties.template</code> file from CarbonData repository to <code>$SPARK_HOME/conf/</code> folder and rename the file to <code>carbon.properties</code>.</p> |
| </li> |
| <li> |
| <p>Create <code>tar.gz</code> file of carbonlib folder and move it inside the carbonlib folder.</p> |
| </li> |
| </ol> |
| <pre><code>cd $SPARK_HOME |
| tar -zcvf carbondata.tar.gz carbonlib/ |
| mv carbondata.tar.gz carbonlib/ |
| </code></pre> |
| <ol start="4"> |
| <li>Configure the properties mentioned in the following table in <code>$SPARK_HOME/conf/spark-defaults.conf</code> file.</li> |
| </ol> |
| <table> |
| <thead> |
| <tr> |
| <th>Property</th> |
| <th>Description</th> |
| <th>Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>spark.master</td> |
| <td>Set this value to run the Spark in yarn cluster mode.</td> |
| <td>Set yarn-client to run the Spark in yarn cluster mode.</td> |
| </tr> |
| <tr> |
| <td>spark.yarn.dist.files</td> |
| <td>Comma-separated list of files to be placed in the working directory of each executor.</td> |
| <td><code>$SPARK_HOME/conf/carbon.properties</code></td> |
| </tr> |
| <tr> |
| <td>spark.yarn.dist.archives</td> |
| <td>Comma-separated list of archives to be extracted into the working directory of each executor.</td> |
| <td><code>$SPARK_HOME/carbonlib/carbondata.tar.gz</code></td> |
| </tr> |
| <tr> |
| <td>spark.executor.extraJavaOptions</td> |
| <td>A string of extra JVM options to pass to executors. For instance <strong>NOTE</strong>: You can enter multiple values separated by space.</td> |
| <td><code>-Dcarbon.properties.filepath = carbon.properties</code></td> |
| </tr> |
| <tr> |
| <td>spark.executor.extraClassPath</td> |
| <td>Extra classpath entries to prepend to the classpath of executors. <strong>NOTE</strong>: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the values in below parameter spark.driver.extraClassPath</td> |
| <td><code>carbondata.tar.gz/carbonlib/*</code></td> |
| </tr> |
| <tr> |
| <td>spark.driver.extraClassPath</td> |
| <td>Extra classpath entries to prepend to the classpath of the driver. <strong>NOTE</strong>: If SPARK_CLASSPATH is defined in spark-env.sh, then comment it and append the value in below parameter spark.driver.extraClassPath.</td> |
| <td><code>$SPARK_HOME/carbonlib/*</code></td> |
| </tr> |
| <tr> |
| <td>spark.driver.extraJavaOptions</td> |
| <td>A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.</td> |
| <td><code>-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties</code></td> |
| </tr> |
| </tbody> |
| </table> |
| <ol start="5"> |
| <li>Verify the installation.</li> |
| </ol> |
| <pre><code>./bin/spark-shell \ |
| --master yarn-client \ |
| --driver-memory 1G \ |
| --executor-memory 2G \ |
| --executor-cores 2 |
| </code></pre> |
| <p><strong>NOTE</strong>:</p> |
| <ul> |
| <li>property "carbon.storelocation" is deprecated in carbondata 2.0 version. Only the users who used this property in previous versions can still use it in carbon 2.0 version.</li> |
| <li>Make sure you have permissions for CarbonData JARs and files through which driver and executor will start.</li> |
| <li>If use Spark + Hive 1.1.X, it needs to add carbondata assembly jar and carbondata-hive jar into parameter 'spark.sql.hive.metastore.jars' in spark-default.conf file.</li> |
| </ul> |
| <h2> |
| <a id="query-execution-using-carbondata-thrift-server" class="anchor" href="#query-execution-using-carbondata-thrift-server" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Query Execution Using CarbonData Thrift Server</h2> |
| <h3> |
| <a id="starting-carbondata-thrift-server" class="anchor" href="#starting-carbondata-thrift-server" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Starting CarbonData Thrift Server.</h3> |
| <p>a. cd <code>$SPARK_HOME</code></p> |
| <p>b. Run the following command to start the CarbonData thrift server.</p> |
| <pre><code>./bin/spark-submit \ |
| --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ |
| $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR |
| </code></pre> |
| <table> |
| <thead> |
| <tr> |
| <th>Parameter</th> |
| <th>Description</th> |
| <th>Example</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>CARBON_ASSEMBLY_JAR</td> |
| <td>CarbonData assembly jar name present in the <code>$SPARK_HOME/carbonlib/</code> folder.</td> |
| <td>apache-carbondata-xx.jar</td> |
| </tr> |
| </tbody> |
| </table> |
| <p>c. Run the following command to work with S3 storage.</p> |
| <pre><code>./bin/spark-submit \ |
| --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ |
| $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR <access_key> <secret_key> <endpoint> |
| </code></pre> |
| <table> |
| <thead> |
| <tr> |
| <th>Parameter</th> |
| <th>Description</th> |
| <th>Example</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>CARBON_ASSEMBLY_JAR</td> |
| <td>CarbonData assembly jar name present in the <code>$SPARK_HOME/carbonlib/</code> folder.</td> |
| <td>apache-carbondata-xx.jar</td> |
| </tr> |
| <tr> |
| <td>access_key</td> |
| <td>Access key for S3 storage</td> |
| <td></td> |
| </tr> |
| <tr> |
| <td>secret_key</td> |
| <td>Secret key for S3 storage</td> |
| <td></td> |
| </tr> |
| <tr> |
| <td>endpoint</td> |
| <td>Endpoint for connecting to S3 storage</td> |
| <td></td> |
| </tr> |
| </tbody> |
| </table> |
| <p><strong>NOTE</strong>: From Spark 1.6, by default the Thrift server runs in multi-session mode. Which means each JDBC/ODBC connection owns a copy of their own SQL configuration and temporary function registry. Cached tables are still shared though. If you prefer to run the Thrift server in single-session mode and share all SQL configuration and temporary function registry, please set option <code>spark.sql.hive.thriftServer.singleSession</code> to <code>true</code>. You may either add this option to <code>spark-defaults.conf</code>, or pass it to <code>spark-submit.sh</code> via <code>--conf</code>:</p> |
| <pre><code>./bin/spark-submit \ |
| --conf spark.sql.hive.thriftServer.singleSession=true \ |
| --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ |
| $SPARK_HOME/carbonlib/$CARBON_ASSEMBLY_JAR |
| </code></pre> |
| <p><strong>But</strong> in single-session mode, if one user changes the database from one connection, the database of the other connections will be changed too.</p> |
| <p><strong>Examples</strong></p> |
| <ul> |
| <li>Start with default memory and executors.</li> |
| </ul> |
| <pre><code>./bin/spark-submit \ |
| --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ |
| $SPARK_HOME/carbonlib/apache-carbondata-xxx.jar |
| </code></pre> |
| <ul> |
| <li>Start with Fixed executors and resources.</li> |
| </ul> |
| <pre><code>./bin/spark-submit \ |
| --class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \ |
| --num-executors 3 \ |
| --driver-memory 20G \ |
| --executor-memory 250G \ |
| --executor-cores 32 \ |
| $SPARK_HOME/carbonlib/apache-carbondata-xxx.jar |
| </code></pre> |
| <h3> |
| <a id="connecting-to-carbondata-thrift-server-using-beeline" class="anchor" href="#connecting-to-carbondata-thrift-server-using-beeline" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Connecting to CarbonData Thrift Server Using Beeline.</h3> |
| <pre><code>cd $SPARK_HOME |
| ./sbin/start-thriftserver.sh |
| ./bin/beeline -u jdbc:hive2://<thriftserver_host>:port |
| |
| Example |
| ./bin/beeline -u jdbc:hive2://10.10.10.10:10000 |
| </code></pre> |
| <h2> |
| <a id="installing-and-configuring-carbondata-on-presto" class="anchor" href="#installing-and-configuring-carbondata-on-presto" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Installing and Configuring CarbonData on Presto</h2> |
| <p><strong>NOTE:</strong> <strong>CarbonData tables cannot be created nor loaded from Presto. User needs to create CarbonData Table and load data into it |
| either with <a href="#installing-and-configuring-carbondata-to-run-locally-with-spark-shell">Spark</a> or <a href="./sdk-guide.html">SDK</a> or <a href="./csdk-guide.html">C++ SDK</a>. |
| Once the table is created, it can be queried from Presto.</strong></p> |
| <p>Please refer the presto guide linked below.</p> |
| <p>prestodb guide - <a href="./prestodb-guide.html">prestodb</a></p> |
| <p>prestosql guide - <a href="./prestosql-guide.html">prestosql</a></p> |
| <p>Once installed the presto with carbonData as per the above guide, |
| you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.</p> |
| <p>List the schemas(databases) available</p> |
| <pre><code>show schemas; |
| </code></pre> |
| <p>Selected the schema where CarbonData table resides</p> |
| <pre><code>use carbonschema; |
| </code></pre> |
| <p>List the available tables</p> |
| <pre><code>show tables; |
| </code></pre> |
| <p>Query from the available tables</p> |
| <pre><code>select * from carbon_table; |
| </code></pre> |
| <p><strong>Note:</strong> Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.</p> |
| <script> |
| // Show selected style on nav item |
| $(function() { $('.b-nav__quickstart').addClass('selected'); }); |
| </script></div> |
| </div> |
| </div> |
| </div> |
| <div class="doc-footer"> |
| <a href="#top" class="scroll-top">Top</a> |
| </div> |
| </div> |
| </section> |
| </div> |
| </div> |
| </div> |
| </section><!-- End systemblock part --> |
| <script src="js/custom.js"></script> |
| </body> |
| </html> |