blob: 89d43c21c333dd8145867c2822a83cd6b5ac2335 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href='images/favicon.ico' rel='shortcut icon' type='image/x-icon'>
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<title>CarbonData</title>
<style>
</style>
<!-- Bootstrap -->
<link rel="stylesheet" href="css/bootstrap.min.css">
<link href="css/style.css" rel="stylesheet">
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.scom/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<script src="js/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<script defer src="https://use.fontawesome.com/releases/v5.0.8/js/all.js"></script>
</head>
<body>
<header>
<nav class="navbar navbar-default navbar-custom cd-navbar-wrapper">
<div class="container">
<div class="navbar-header">
<button aria-controls="navbar" aria-expanded="false" data-target="#navbar" data-toggle="collapse"
class="navbar-toggle collapsed" type="button">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="index.html" class="logo">
<img src="images/CarbonDataLogo.png" alt="CarbonData logo" title="CarbocnData logo"/>
</a>
</div>
<div class="navbar-collapse collapse cd_navcontnt" id="navbar">
<ul class="nav navbar-nav navbar-right navlist-custom">
<li><a href="index.html" class="hidden-xs"><i class="fa fa-home" aria-hidden="true"></i> </a>
</li>
<li><a href="index.html" class="hidden-lg hidden-md hidden-sm">Home</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle " data-toggle="dropdown" role="button" aria-haspopup="true"
aria-expanded="false"> Download <span class="caret"></span></a>
<ul class="dropdown-menu">
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/2.2.0/"
target="_blank">Apache CarbonData 2.2.0</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/2.1.1/"
target="_blank">Apache CarbonData 2.1.1</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/2.1.0/"
target="_blank">Apache CarbonData 2.1.0</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/2.0.1/"
target="_blank">Apache CarbonData 2.0.1</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/2.0.0/"
target="_blank">Apache CarbonData 2.0.0</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/1.6.1/"
target="_blank">Apache CarbonData 1.6.1</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/1.6.0/"
target="_blank">Apache CarbonData 1.6.0</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.4/"
target="_blank">Apache CarbonData 1.5.4</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.3/"
target="_blank">Apache CarbonData 1.5.3</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.2/"
target="_blank">Apache CarbonData 1.5.2</a></li>
<li>
<a href="https://dist.apache.org/repos/dist/release/carbondata/1.5.1/"
target="_blank">Apache CarbonData 1.5.1</a></li>
<li>
<a href="https://cwiki.apache.org/confluence/display/CARBONDATA/Releases"
target="_blank">Release Archive</a></li>
</ul>
</li>
<li><a href="documentation.html" class="active">Documentation</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true"
aria-expanded="false">Community <span class="caret"></span></a>
<ul class="dropdown-menu">
<li>
<a href="https://github.com/apache/carbondata/blob/master/docs/how-to-contribute-to-apache-carbondata.md"
target="_blank">Contributing to CarbonData</a></li>
<li>
<a href="https://github.com/apache/carbondata/blob/master/docs/release-guide.md"
target="_blank">Release Guide</a></li>
<li>
<a href="https://cwiki.apache.org/confluence/display/CARBONDATA/PMC+and+Committers+member+list"
target="_blank">Project PMC and Committers</a></li>
<li>
<a href="https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=66850609"
target="_blank">CarbonData Meetups</a></li>
<li><a href="security.html">Apache CarbonData Security</a></li>
<li><a href="https://issues.apache.org/jira/browse/CARBONDATA" target="_blank">Apache
Jira</a></li>
<li><a href="videogallery.html">CarbonData Videos </a></li>
</ul>
</li>
<li class="dropdown">
<a href="http://www.apache.org/" class="apache_link hidden-xs dropdown-toggle"
data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache</a>
<ul class="dropdown-menu">
<li><a href="http://www.apache.org/" target="_blank">Apache Homepage</a></li>
<li><a href="http://www.apache.org/licenses/" target="_blank">License</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html"
target="_blank">Sponsorship</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
</ul>
</li>
<li class="dropdown">
<a href="http://www.apache.org/" class="hidden-lg hidden-md hidden-sm dropdown-toggle"
data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Apache</a>
<ul class="dropdown-menu">
<li><a href="http://www.apache.org/" target="_blank">Apache Homepage</a></li>
<li><a href="http://www.apache.org/licenses/" target="_blank">License</a></li>
<li><a href="http://www.apache.org/foundation/sponsorship.html"
target="_blank">Sponsorship</a></li>
<li><a href="http://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
</ul>
</li>
<li>
<a href="#" id="search-icon"><i class="fa fa-search" aria-hidden="true"></i></a>
</li>
</ul>
</div><!--/.nav-collapse -->
<div id="search-box">
<form method="get" action="http://www.google.com/search" target="_blank">
<div class="search-block">
<table border="0" cellpadding="0" width="100%">
<tr>
<td style="width:80%">
<input type="text" name="q" size=" 5" maxlength="255" value=""
class="search-input" placeholder="Search...." required/>
</td>
<td style="width:20%">
<input type="submit" value="Search"/></td>
</tr>
<tr>
<td align="left" style="font-size:75%" colspan="2">
<input type="checkbox" name="sitesearch" value="carbondata.apache.org" checked/>
<span style=" position: relative; top: -3px;"> Only search for CarbonData</span>
</td>
</tr>
</table>
</div>
</form>
</div>
</div>
</nav>
</header> <!-- end Header part -->
<div class="fixed-padding"></div> <!-- top padding with fixde header -->
<section><!-- Dashboard nav -->
<div class="container-fluid q">
<div class="col-sm-12 col-md-12 maindashboard">
<div class="verticalnavbar">
<nav class="b-sticky-nav">
<div class="nav-scroller">
<div class="nav__inner">
<a class="b-nav__intro nav__item" href="./introduction.html">introduction</a>
<a class="b-nav__quickstart nav__item" href="./quick-start-guide.html">quick start</a>
<a class="b-nav__uses nav__item" href="./usecases.html">use cases</a>
<div class="nav__item nav__item__with__subs">
<a class="b-nav__docs nav__item nav__sub__anchor" href="./language-manual.html">Language Reference</a>
<a class="nav__item nav__sub__item" href="./ddl-of-carbondata.html">DDL</a>
<a class="nav__item nav__sub__item" href="./dml-of-carbondata.html">DML</a>
<a class="nav__item nav__sub__item" href="./streaming-guide.html">Streaming</a>
<a class="nav__item nav__sub__item" href="./configuration-parameters.html">Configuration</a>
<a class="nav__item nav__sub__item" href="./index-developer-guide.html">Indexes</a>
<a class="nav__item nav__sub__item" href="./supported-data-types-in-carbondata.html">Data Types</a>
</div>
<div class="nav__item nav__item__with__subs">
<a class="b-nav__datamap nav__item nav__sub__anchor" href="./index-management.html">Index Managament</a>
<a class="nav__item nav__sub__item" href="./bloomfilter-index-guide.html">Bloom Filter</a>
<a class="nav__item nav__sub__item" href="./lucene-index-guide.html">Lucene</a>
<a class="nav__item nav__sub__item" href="./secondary-index-guide.html">Secondary Index</a>
<a class="nav__item nav__sub__item" href="../spatial-index-guide.html">Spatial Index</a>
<a class="nav__item nav__sub__item" href="../mv-guide.html">MV</a>
</div>
<div class="nav__item nav__item__with__subs">
<a class="b-nav__api nav__item nav__sub__anchor" href="./sdk-guide.html">API</a>
<a class="nav__item nav__sub__item" href="./sdk-guide.html">Java SDK</a>
<a class="nav__item nav__sub__item" href="./csdk-guide.html">C++ SDK</a>
</div>
<a class="b-nav__perf nav__item" href="./performance-tuning.html">Performance Tuning</a>
<a class="b-nav__s3 nav__item" href="./s3-guide.html">S3 Storage</a>
<a class="b-nav__indexserver nav__item" href="./index-server.html">Index Server</a>
<a class="b-nav__prestodb nav__item" href="./prestodb-guide.html">PrestoDB Integration</a>
<a class="b-nav__prestosql nav__item" href="./prestosql-guide.html">PrestoSQL Integration</a>
<a class="b-nav__flink nav__item" href="./flink-integration-guide.html">Flink Integration</a>
<a class="b-nav__scd nav__item" href="./scd-and-cdc-guide.html">SCD & CDC</a>
<a class="b-nav__faq nav__item" href="./faq.html">FAQ</a>
<a class="b-nav__contri nav__item" href="./how-to-contribute-to-apache-carbondata.html">Contribute</a>
<a class="b-nav__security nav__item" href="./security.html">Security</a>
<a class="b-nav__release nav__item" href="./release-guide.html">Release Guide</a>
</div>
</div>
<div class="navindicator">
<div class="b-nav__intro navindicator__item"></div>
<div class="b-nav__quickstart navindicator__item"></div>
<div class="b-nav__uses navindicator__item"></div>
<div class="b-nav__docs navindicator__item"></div>
<div class="b-nav__datamap navindicator__item"></div>
<div class="b-nav__api navindicator__item"></div>
<div class="b-nav__perf navindicator__item"></div>
<div class="b-nav__s3 navindicator__item"></div>
<div class="b-nav__indexserver navindicator__item"></div>
<div class="b-nav__prestodb navindicator__item"></div>
<div class="b-nav__prestosql navindicator__item"></div>
<div class="b-nav__flink navindicator__item"></div>
<div class="b-nav__scd navindicator__item"></div>
<div class="b-nav__faq navindicator__item"></div>
<div class="b-nav__contri navindicator__item"></div>
<div class="b-nav__security navindicator__item"></div>
</div>
</nav>
</div>
<div class="mdcontent">
<section>
<div style="padding:10px 15px;">
<div id="viewpage" name="viewpage">
<div class="row">
<div class="col-sm-12 col-md-12">
<div>
<h1>
<a id="carbondata-bloomfilter-index" class="anchor" href="#carbondata-bloomfilter-index" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>CarbonData BloomFilter Index</h1>
<ul>
<li><a href="#index-management">Index Management</a></li>
<li><a href="#bloomfilter-index-introduction">BloomFilter Index Introduction</a></li>
<li><a href="#loading-data">Loading Data</a></li>
<li><a href="#querying-data">Querying Data</a></li>
<li><a href="#data-management-with-bloomfilter-index">Data Management</a></li>
<li><a href="#useful-tips">Useful Tips</a></li>
</ul>
<h4>
<a id="index-management" class="anchor" href="#index-management" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Index Management</h4>
<p>Creating BloomFilter Index</p>
<pre><code>CREATE INDEX [IF NOT EXISTS] index_name
ON TABLE main_table (city,name)
AS 'bloomfilter'
PROPERTIES ('BLOOM_SIZE'='640000', 'BLOOM_FPP'='0.00001')
</code></pre>
<p>Dropping Specified Index</p>
<pre><code>DROP INDEX [IF EXISTS] index_name
ON [TABLE] main_table
</code></pre>
<p>Showing all Indexes on this table</p>
<pre><code>SHOW INDEXES
ON [TABLE] main_table
</code></pre>
<blockquote>
<p>NOTE: Keywords given inside <code>[]</code> is optional.</p>
</blockquote>
<p>Disable Index</p>
<blockquote>
<p>The index by default is enabled. To support tuning on query, we can disable a specific index during query to observe whether we can gain performance enhancement from it. This is effective only for current session.</p>
</blockquote>
<pre><code>// disable the index
SET carbon.index.visible.dbName.tableName.indexName = false
// enable the index
SET carbon.index.visible.dbName.tableName.indexName = true
</code></pre>
<h2>
<a id="bloomfilter-index-introduction" class="anchor" href="#bloomfilter-index-introduction" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>BloomFilter Index Introduction</h2>
<p>A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
Carbondata introduced BloomFilter as an index to enhance the performance of querying with precise value.
It is well suitable for queries that do precise matching on high cardinality columns(such as Name/ID).
Internally, CarbonData maintains a BloomFilter per blocklet for each index column to indicate that whether a value of the column is in this blocklet.
Just like the other indexes, BloomFilter index is managed along with main tables by CarbonData.
User can create BloomFilter index on specified columns with specified BloomFilter configurations such as size and probability.</p>
<p>For instance, main table called <strong>index_test</strong> which is defined as:</p>
<pre><code>CREATE TABLE index_test (
id string,
name string,
age int,
city string,
country string)
STORED AS carbondata
TBLPROPERTIES('SORT_COLUMNS'='id')
</code></pre>
<p>In the above example, <code>id</code> and <code>name</code> are high cardinality columns
and we always query on <code>id</code> and <code>name</code> with precise value.
since <code>id</code> is in the sort_columns and it is ordered,
query on it will be fast because CarbonData can skip all the irrelative blocklets.
But queries on <code>name</code> may be bad since the blocklet minmax may not help,
because in each blocklet the range of the value of <code>name</code> may be the same -- all from A* to z*.
In this case, user can create a BloomFilter Index on column <code>name</code>.
Moreover, user can also create a BloomFilter Index on the sort_columns.
This is useful if user has too many segments and the range of the value of sort_columns are almost the same.</p>
<p>User can create BloomFilter Index using the Create Index DDL:</p>
<pre><code>CREATE INDEX dm
ON TABLE index_test (name,id)
AS 'bloomfilter'
PROPERTIES ('BLOOM_SIZE'='640000', 'BLOOM_FPP'='0.00001', 'BLOOM_COMPRESS'='true')
</code></pre>
<p>Here, (name,id) are INDEX_COLUMNS. Carbondata will generate BloomFilter index on these columns. Queries on these columns are usually like <code>'COL = VAL'</code>.</p>
<p><strong>Properties for BloomFilter Index</strong></p>
<table>
<thead>
<tr>
<th>Property</th>
<th>Is Required</th>
<th>Default Value</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>BLOOM_SIZE</td>
<td>NO</td>
<td>640000</td>
<td>This value is internally used by BloomFilter as the number of expected insertions, it will affect the size of BloomFilter index. Since each blocklet has a BloomFilter here, so the default value is the approximate distinct index values in a blocklet assuming that each blocklet contains 20 pages and each page contains 32000 records. The value should be an integer.</td>
</tr>
<tr>
<td>BLOOM_FPP</td>
<td>NO</td>
<td>0.00001</td>
<td>This value is internally used by BloomFilter as the False-Positive Probability, it will affect the size of bloomfilter index as well as the number of hash functions for the BloomFilter. The value should be in the range (0, 1). In one test scenario, a 96GB TPCH customer table with bloom_size=320000 and bloom_fpp=0.00001 will result in 18 false positive samples.</td>
</tr>
<tr>
<td>BLOOM_COMPRESS</td>
<td>NO</td>
<td>true</td>
<td>Whether to compress the BloomFilter index files.</td>
</tr>
</tbody>
</table>
<h2>
<a id="loading-data" class="anchor" href="#loading-data" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Loading Data</h2>
<p>When loading data to main table, BloomFilter files will be generated for all the
index_columns provided in the CREATE statement which contains the blockletId and a BloomFilter for each index column.
These index files will be written inside a folder named with Index name
inside each segment folders.</p>
<h2>
<a id="querying-data" class="anchor" href="#querying-data" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Querying Data</h2>
<p>User can verify whether a query can leverage BloomFilter Index by executing <code>EXPLAIN</code> command,
which will show the transformed logical plan, and thus user can check whether the BloomFilter Index can skip blocklets during the scan.
If the Index does not prune blocklets well, you can try to increase the value of property <code>BLOOM_SIZE</code> and decrease the value of property <code>BLOOM_FPP</code>.</p>
<h2>
<a id="data-management-with-bloomfilter-index" class="anchor" href="#data-management-with-bloomfilter-index" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Data Management With BloomFilter Index</h2>
<p>Data management with BloomFilter Index has no difference with that on Lucene Index.
You can refer to the corresponding section in <a href="./lucene-index-guide.html" target=_blank>CarbonData Lucene Index</a></p>
<h2>
<a id="useful-tips" class="anchor" href="#useful-tips" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Useful Tips</h2>
<ul>
<li>BloomFilter Index is suggested to be created on the high cardinality columns.
Query conditions on these columns are always simple <code>equal</code> or <code>in</code>,
such as 'col1=XX', 'col1 in (XX, YY)'.</li>
<li>We can create multiple BloomFilter Indexes on one table,
but we do recommend you to create one BloomFilter Index that contains multiple index columns,
because the data loading and query performance will be better.</li>
<li>
<code>BLOOM_FPP</code> is only the expected number from user, the actual FPP may be worse.
If the BloomFilter Index does not work well,
you can try to increase <code>BLOOM_SIZE</code> and decrease <code>BLOOM_FPP</code> at the same time.
Notice that bigger <code>BLOOM_SIZE</code> will increase the size of index file
and smaller <code>BLOOM_FPP</code> will increase runtime calculation while performing query.</li>
<li>'0' skipped blocklets of BloomFilter Index in explain output indicates that
BloomFilter Index does not prune better than Main Index.
(For example since the data is not ordered, a specific value may be contained in many blocklets. In this case, bloom may not work better than Main Index.)
If this occurs very often, it means that current BloomFilter is useless. You can disable or drop it.
Sometimes we cannot see any pruning result about BloomFilter Index in the explain output,
this indicates that the previous Index has pruned all the blocklets and there is no need to continue pruning.</li>
<li>In some scenarios, the BloomFilter Index may not enhance the query performance significantly
but if it can reduce the number of spark task,
there is still a chance that BloomFilter Index can enhance the performance for concurrent query.</li>
<li>Note that BloomFilter Index will decrease the data loading performance and may cause slight storage expansion (for index file).</li>
</ul>
<script>
$(function() {
// Show selected style on nav item
$('.b-nav__datamap').addClass('selected');
if (!$('.b-nav__datamap').parent().hasClass('nav__item__with__subs--expanded')) {
// Display datamap subnav items
$('.b-nav__datamap').parent().toggleClass('nav__item__with__subs--expanded');
}
});
</script></div>
</div>
</div>
</div>
<div class="doc-footer">
<a href="#top" class="scroll-top">Top</a>
</div>
</div>
</section>
</div>
</div>
</div>
</section><!-- End systemblock part -->
<script src="js/custom.js"></script>
</body>
</html>