docs/release_notes.adoc - kudu - Git at Google

 // Copyright 2015 Cloudera, Inc.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.

 [[release_notes]]
 = Kudu Release Notes

 :author: Kudu Team
 :imagesdir: ./images
 :icons: font
 :toc: left
 :toclevels: 3
 :doctype: book
 :backend: html5
 :sectlinks:
 :experimental:

 == Introducing Kudu

 Kudu is a columnar storage manager developed for the Hadoop platform. Kudu shares
 the common technical properties of Hadoop ecosystem applications: it runs on
 commodity hardware, is horizontally scalable, and supports highly available operation.

 Kudu’s design sets it apart. Some of Kudu’s benefits include:

 * Fast processing of OLAP workloads.
 * Integration with MapReduce, Spark and other Hadoop ecosystem components.
 * Tight integration with Cloudera Impala, making it a good, mutable alternative to
 using HDFS with Parquet. See link:kudu_impala_integration.html[Kudu Impala Integration].
 * Strong but flexible consistency model.
 * Strong performance for running sequential and random workloads simultaneously.
 * Easy to administer and manage with Cloudera Manager.
 * Efficient utilization of hardware resources.
 * High availability. Tablet Servers and Masters use the Raft Consensus Algorithm.
 Given a replication factor of `2f+1`, if `f` tablet servers serving a given tablet
 fail, the tablet is still available.
 +
 NOTE: High availability for masters is not supported during the public beta.

 By combining all of these properties, Kudu targets support for families of
 applications that are difficult or impossible to implement on current generation
 Hadoop storage technologies.

 === Kudu-Impala Integration Features
 `CREATE TABLE`::
   Impala supports creating and dropping tables using Kudu as the persistence layer.
   The tables follow the same internal / external approach as other tables in Impala,
   allowing for flexible data ingestion and querying.
 `INSERT`::
   Data can be inserted into Kudu tables from Impala using the same mechanisms as
   any other table with HDFS or HBase persistence.
 `UPDATE` / `DELETE`::
   Impala supports the `UPDATE` and `DELETE` SQL commands to modify existing data in
   a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen
   to be as compatible as possible to existing solutions. In addition to simple `DELETE`
   or `UPDATE` commands, you can specify complex joins in the `FROM` clause of the query
   using the same syntax as a regular `SELECT` statement.
 Flexible Partitioning::
   Similar to partitioning of tables in Hive, Kudu allows you to dynamically
   pre-split tables by hash or range into a predefined number of tablets, in order
   to distribute writes and queries evenly across your cluster. You can partition by
   any number of primary key columns, by any number of hashes and an optional list of
   split rows.
 Parallel Scan::
   To achieve the highest possible performance on modern hardware, the Kudu client
   within Impala parallelizes scans to multiple tablets.
 High-efficiency queries::
   Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates
   are evaluated as close as possible to the data. Query performance is comparable
   to Parquet in many workloads.

 == About the Kudu Public Beta

 This release of Kudu is a public beta. Do not run this beta release on production clusters. During the
 public beta period, Kudu will be supported via a link:https://issues.cloudera.org/projects/KUDU
 [public JIRA] and a public mailing list link:mailto:kudu-user@cloudera.org[mailing
 list], which will be monitored by the Kudu development team and community members.
 Commercial support is not available at this time.

 * You can submit any issues or feedback related to your Kudu experience via either
 the JIRA system or the mailing list. The Kudu development team and community members
 will respond and assist as quickly as possible.
 * The Kudu team will work with early adopters to fix bugs and release new binary drops
 when fixes or features are ready. However, we cannot commit to issue resolution or
 bug fix delivery times during the public beta period, and it is possible that some
 fixes or enhancements will not be selected for a release.
 * We can't guarantee timeframes or contents for future beta code drops. However,
 they will be announced to the user group when they occur.
 * No guarantees are made regarding upgrades from this release to follow-on releases.
 While multiple drops of beta code are planned, we can't guarantee their schedules
 or contents.

 == Resources

 - link:http://getkudu.io[Kudu Website]
 - link:http://github.com/cloudera/kudu[Kudu Github Repository]
 - link:index.html.html[Kudu Documentation]

 == Installation Options
 * A Quickstart VM is provided to get you up and running quickly.
 * You can install parcels or packages in clusters managed by Cloudera Manager, or
 packages in standalone CDH clusters.
 * You can build Kudu from source.

 For full installation details, see link:installation.html[Kudu Installation].

 == Limitations of the Public Beta

 === Operating System Limitations
 * RHEL 6.4 or newer, CentOS 6.4 or newer, and Ubuntu Trusty are are the only
 operating systems supported for installation in the public beta.
 Others may work but have not been tested.

 === Storage Limitations
 * Kudu has been tested with up to 4 TB of data per tablet server. More testing
 is needed for denser storage configurations.

 === Schema Limitations
 * Testing with more than 20 columns has been limited.
 * Multi-kilobyte rows have not been thoroughly tested.
 * The columns which make up the primary key must be listed first in the schema.
 * Key columns cannot be altered. You must drop and recreate a table to change its keys.
 * Key columns must not be null.
 * Columns with `DOUBLE`, `FLOAT`, or `BOOL` types are not allowed as part of a
 primary key definition.
 * Type and nullability of existing columns cannot be changed by altering the table.
 * A table’s primary key cannot be changed.
 * Dropping a column does not immediately reclaim space. Compaction must run first.
 There is no way to run compaction manually, but dropping the table will reclaim the
 space immediately.

 === Ingest Limitations
 * Ingest via Sqoop or Flume is not supported in the public beta. The recommended
 approach for bulk ingest is to use Impala’s `CREATE TABLE AS SELECT` functionality
 or use the Kudu's Java or C++ API.
 * Tables must be manually pre-split into tablets using simple or compound primary
 keys. Automatic splitting is not yet possible. Instead, add split rows at table creation.
 * Tablets cannot currently be merged. Instead, create a new table with the contents
 of the old tables to be merged.

 === Cloudera Manager Limitations
 * Some metrics, such as latency histograms, are not yet available in Cloudera Manager.
 * Some service and role chart pages are still under development. More charts and
 metrics will be visible in future releases.

 === Replication and Backup Limitations
 * Replication and failover of Kudu masters is considered experimental. It is
 recommended to run a single master and periodically perform a manual backup of its data directories.

 === Impala Limitations
 * To use Kudu with Impala, you must install a special release of Impala. Obtaining
 and installing a compatible Impala release is detailed in Kudu's
 link:kudu_impala_integration.html[Impala Integration] documentation.
 * To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
 * Updates, inserts, and deletes via Impala are non-transactional. If a query
 fails part-way, its partial effects will not be rolled back..
 * All queries will be distributed across all Impala nodes which host a replica
 of the target table(s), even if a predicate on a primary key could correctly
 restrict the query to a single tablet. This limits the maximum concurrency of
 short queries made via Impala.
 * ALTER TABLE on Kudu tables is not yet supported via Impala.
 * No timestamp and decimal type support.
 * The maximum parallelism of a single query is limited to the number of tablets
 in a table. For good analytic performance, aim for 10 or more tablets per host
 or large tables.
 * Impala is only able to push down predicates involving `=`, `<=`, `>=`,
 or `BETWEEN` comparisons between a column and a literal value. Impala pushes down
 predicates`<` and `>` for integer columns only. For example, for a table with
 an integer key `ts`, and a float key `income`, the predicate `WHERE ts >= 12345`
 will convert into an efficient range scan, whereas ‘where income > 1000000.0’
 will currently fetch all data from the table and evaluate the predicate within Impala.

 === Security Limitations
 * Authentication and authorization are not included in the public beta.
 * Data encryption is not included in the public beta.

 === Client and API Limitations
 * Potentially-incompatible C++ and Java API changes may be required during the
 public beta.
 * ALTER TABLE is not yet fully supported via the client APIs. More ALTER TABLE
 operations will become available in future betas.
 * The Python API is not supported.

 === Application Integration Limitations
 * The Spark DataFrame implementation is not yet complete.

 === Other Known Issues
 The following are known bugs and issues with the current beta release. They will
 be addressed in later beta releases.

 * Building Kudu from source using `gcc` 4.6 causes runtime and test failures. Be sure
 you are using a different version of `gcc` if you build Kudu from source.
 * If the Kudu master is configured with the `-log_fsync_all` option, tablet servers
 and clients will experience frequent timeouts, and the cluster may become unusable.
 * If a tablet server has a very large number of tablets, it may take several minutes
 to start up. It is recommended to limit the number of tablets per server to 100 or fewer.
 Consider this limitation when pre-splitting your tables. If you notice slow start-up times,
 you can monitor the number of tablets per server in the web UI.

 == Next Steps
 - link:quickstart.html[Kudu Quickstart]
 - link:installation.html[Installing Kudu]
 - link:configuration.html[Configuring Kudu]
	// Copyright 2015 Cloudera, Inc.
	//
	// Licensed under the Apache License, Version 2.0 (the "License");
	// you may not use this file except in compliance with the License.
	// You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing, software
	// distributed under the License is distributed on an "AS IS" BASIS,
	// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	// See the License for the specific language governing permissions and
	// limitations under the License.

	[[release_notes]]
	= Kudu Release Notes

	:author: Kudu Team
	:imagesdir: ./images
	:icons: font
	:toc: left
	:toclevels: 3
	:doctype: book
	:backend: html5
	:sectlinks:
	:experimental:

	== Introducing Kudu

	Kudu is a columnar storage manager developed for the Hadoop platform. Kudu shares
	the common technical properties of Hadoop ecosystem applications: it runs on
	commodity hardware, is horizontally scalable, and supports highly available operation.

	Kudu’s design sets it apart. Some of Kudu’s benefits include:

	* Fast processing of OLAP workloads.
	* Integration with MapReduce, Spark and other Hadoop ecosystem components.
	* Tight integration with Cloudera Impala, making it a good, mutable alternative to
	using HDFS with Parquet. See link:kudu_impala_integration.html[Kudu Impala Integration].
	* Strong but flexible consistency model.
	* Strong performance for running sequential and random workloads simultaneously.
	* Easy to administer and manage with Cloudera Manager.
	* Efficient utilization of hardware resources.
	* High availability. Tablet Servers and Masters use the Raft Consensus Algorithm.
	Given a replication factor of `2f+1`, if `f` tablet servers serving a given tablet
	fail, the tablet is still available.
	+
	NOTE: High availability for masters is not supported during the public beta.

	By combining all of these properties, Kudu targets support for families of
	applications that are difficult or impossible to implement on current generation
	Hadoop storage technologies.

	=== Kudu-Impala Integration Features
	`CREATE TABLE`::
	Impala supports creating and dropping tables using Kudu as the persistence layer.
	The tables follow the same internal / external approach as other tables in Impala,
	allowing for flexible data ingestion and querying.
	`INSERT`::
	Data can be inserted into Kudu tables from Impala using the same mechanisms as
	any other table with HDFS or HBase persistence.
	`UPDATE` / `DELETE`::
	Impala supports the `UPDATE` and `DELETE` SQL commands to modify existing data in
	a Kudu table row-by-row or as a batch. The syntax of the SQL commands is chosen
	to be as compatible as possible to existing solutions. In addition to simple `DELETE`
	or `UPDATE` commands, you can specify complex joins in the `FROM` clause of the query
	using the same syntax as a regular `SELECT` statement.
	Flexible Partitioning::
	Similar to partitioning of tables in Hive, Kudu allows you to dynamically
	pre-split tables by hash or range into a predefined number of tablets, in order
	to distribute writes and queries evenly across your cluster. You can partition by
	any number of primary key columns, by any number of hashes and an optional list of
	split rows.
	Parallel Scan::
	To achieve the highest possible performance on modern hardware, the Kudu client
	within Impala parallelizes scans to multiple tablets.
	High-efficiency queries::
	Where possible, Impala pushes down predicate evaluation to Kudu, so that predicates
	are evaluated as close as possible to the data. Query performance is comparable
	to Parquet in many workloads.

	== About the Kudu Public Beta

	This release of Kudu is a public beta. Do not run this beta release on production clusters. During the
	public beta period, Kudu will be supported via a link:https://issues.cloudera.org/projects/KUDU
	[public JIRA] and a public mailing list link:mailto:kudu-user@cloudera.org[mailing
	list], which will be monitored by the Kudu development team and community members.
	Commercial support is not available at this time.

	* You can submit any issues or feedback related to your Kudu experience via either
	the JIRA system or the mailing list. The Kudu development team and community members
	will respond and assist as quickly as possible.
	* The Kudu team will work with early adopters to fix bugs and release new binary drops
	when fixes or features are ready. However, we cannot commit to issue resolution or
	bug fix delivery times during the public beta period, and it is possible that some
	fixes or enhancements will not be selected for a release.
	* We can't guarantee timeframes or contents for future beta code drops. However,
	they will be announced to the user group when they occur.
	* No guarantees are made regarding upgrades from this release to follow-on releases.
	While multiple drops of beta code are planned, we can't guarantee their schedules
	or contents.

	== Resources

	- link:http://getkudu.io[Kudu Website]
	- link:http://github.com/cloudera/kudu[Kudu Github Repository]
	- link:index.html.html[Kudu Documentation]

	== Installation Options
	* A Quickstart VM is provided to get you up and running quickly.
	* You can install parcels or packages in clusters managed by Cloudera Manager, or
	packages in standalone CDH clusters.
	* You can build Kudu from source.

	For full installation details, see link:installation.html[Kudu Installation].

	== Limitations of the Public Beta

	=== Operating System Limitations
	* RHEL 6.4 or newer, CentOS 6.4 or newer, and Ubuntu Trusty are are the only
	operating systems supported for installation in the public beta.
	Others may work but have not been tested.

	=== Storage Limitations
	* Kudu has been tested with up to 4 TB of data per tablet server. More testing
	is needed for denser storage configurations.

	=== Schema Limitations
	* Testing with more than 20 columns has been limited.
	* Multi-kilobyte rows have not been thoroughly tested.
	* The columns which make up the primary key must be listed first in the schema.
	* Key columns cannot be altered. You must drop and recreate a table to change its keys.
	* Key columns must not be null.
	* Columns with `DOUBLE`, `FLOAT`, or `BOOL` types are not allowed as part of a
	primary key definition.
	* Type and nullability of existing columns cannot be changed by altering the table.
	* A table’s primary key cannot be changed.
	* Dropping a column does not immediately reclaim space. Compaction must run first.
	There is no way to run compaction manually, but dropping the table will reclaim the
	space immediately.

	=== Ingest Limitations
	* Ingest via Sqoop or Flume is not supported in the public beta. The recommended
	approach for bulk ingest is to use Impala’s `CREATE TABLE AS SELECT` functionality
	or use the Kudu's Java or C++ API.
	* Tables must be manually pre-split into tablets using simple or compound primary
	keys. Automatic splitting is not yet possible. Instead, add split rows at table creation.
	* Tablets cannot currently be merged. Instead, create a new table with the contents
	of the old tables to be merged.

	=== Cloudera Manager Limitations
	* Some metrics, such as latency histograms, are not yet available in Cloudera Manager.
	* Some service and role chart pages are still under development. More charts and
	metrics will be visible in future releases.

	=== Replication and Backup Limitations
	* Replication and failover of Kudu masters is considered experimental. It is
	recommended to run a single master and periodically perform a manual backup of its data directories.

	=== Impala Limitations
	* To use Kudu with Impala, you must install a special release of Impala. Obtaining
	and installing a compatible Impala release is detailed in Kudu's
	link:kudu_impala_integration.html[Impala Integration] documentation.
	* To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
	* Updates, inserts, and deletes via Impala are non-transactional. If a query
	fails part-way, its partial effects will not be rolled back..
	* All queries will be distributed across all Impala nodes which host a replica
	of the target table(s), even if a predicate on a primary key could correctly
	restrict the query to a single tablet. This limits the maximum concurrency of
	short queries made via Impala.
	* ALTER TABLE on Kudu tables is not yet supported via Impala.
	* No timestamp and decimal type support.
	* The maximum parallelism of a single query is limited to the number of tablets
	in a table. For good analytic performance, aim for 10 or more tablets per host
	or large tables.
	* Impala is only able to push down predicates involving `=`, `<=`, `>=`,
	or `BETWEEN` comparisons between a column and a literal value. Impala pushes down
	predicates`<` and `>` for integer columns only. For example, for a table with
	an integer key `ts`, and a float key `income`, the predicate `WHERE ts >= 12345`
	will convert into an efficient range scan, whereas ‘where income > 1000000.0’
	will currently fetch all data from the table and evaluate the predicate within Impala.

	=== Security Limitations
	* Authentication and authorization are not included in the public beta.
	* Data encryption is not included in the public beta.

	=== Client and API Limitations
	* Potentially-incompatible C++ and Java API changes may be required during the
	public beta.
	* ALTER TABLE is not yet fully supported via the client APIs. More ALTER TABLE
	operations will become available in future betas.
	* The Python API is not supported.

	=== Application Integration Limitations
	* The Spark DataFrame implementation is not yet complete.

	=== Other Known Issues
	The following are known bugs and issues with the current beta release. They will
	be addressed in later beta releases.

	* Building Kudu from source using `gcc` 4.6 causes runtime and test failures. Be sure
	you are using a different version of `gcc` if you build Kudu from source.
	* If the Kudu master is configured with the `-log_fsync_all` option, tablet servers
	and clients will experience frequent timeouts, and the cluster may become unusable.
	* If a tablet server has a very large number of tablets, it may take several minutes
	to start up. It is recommended to limit the number of tablets per server to 100 or fewer.
	Consider this limitation when pre-splitting your tables. If you notice slow start-up times,
	you can monitor the number of tablets per server in the web UI.

	== Next Steps
	- link:quickstart.html[Kudu Quickstart]
	- link:installation.html[Installing Kudu]
	- link:configuration.html[Configuring Kudu]