docs/developing.adoc - kudu - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 [[developing]]
 = Developing Applications With Apache Kudu

 :author: Kudu Team
 :imagesdir: ./images
 :icons: font
 :toc: left
 :toclevels: 3
 :doctype: book
 :backend: html5
 :sectlinks:
 :experimental:

 Kudu provides C++ and Java client APIs, as well as reference examples to illustrate
 their use. A Python API is included, but it is currently considered experimental,
 unstable, and subject to change at any time.

 WARNING: Use of server-side or private interfaces is not supported, and interfaces
 which are not part of public APIs have no stability guarantees.

 == Viewing the API Documentation
 include::installation.adoc[tags=view_api]

 == Working Examples

 Several example applications are provided in the
 link:https://github.com/cloudera/kudu-examples[kudu-examples] Github
 repository. Each example includes a `README` that shows how to compile and run
 it. These examples illustrate correct usage of the Kudu APIs, as well as how to
 set up a virtual machine to run Kudu. The following list includes some of the
 examples that are available today. Check the repository itself in case this list goes
 out of date.

 `java-example`::
   A simple Java application which connects to a Kudu instance, creates a table, writes data to it, then drops the table.
 `collectl`::
   A small Java application which listens on a TCP socket for time series data corresponding to the Collectl wire protocol.
   The commonly-available collectl tool can be used to send example data to the server.
 `clients/python`::
   An experimental Python client for Kudu.
 `demo-vm-setup`::
   Scripts to download and run a VirtualBox virtual machine with Kudu already installed.
   See link:quickstart.html[Quickstart] for more information.

 These examples should serve as helpful starting points for your own Kudu applications and integrations.

 === Maven Artifacts
 The following Maven `<dependency>` element is valid for the Kudu public beta:

 [source,xml]
 ----
 <dependency>
   <groupId>org.apache.kudu</groupId>
   <artifactId>kudu-client</artifactId>
   <version>0.5.0</version>
 </dependency>
 ----

 Because the Maven artifacts are not in Maven Central, use the following `<repository>`
 element:

 [source,xml]
 ----
 <repository>
   <id>cdh.repo</id>
   <name>Cloudera Repositories</name>
   <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
   <snapshots>
     <enabled>false</enabled>
   </snapshots>
 </repository>
 ----

 See subdirectories of https://github.com/cloudera/kudu-examples/tree/master/java for
 example Maven pom.xml files.

 == Example Impala Commands With Kudu

 See link:kudu_impala_integration.html[Using Impala With Kudu] for guidance on installing
 and using Impala with Kudu, including several `impala-shell` examples.

 == Kudu Integration with Spark

 Kudu integrates with Spark through the Data Source API as of version 0.9.
 Include the kudu-spark jar using the --jars option:
 [source]
 ----
 spark-shell --jars kudu-spark-0.9.0.jar
 ----
 then import kudu-spark and create a dataframe:
 [source,scala]
 ----
 import org.apache.kudu.spark.kudu._

 // Read a table from Kudu
 val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu

 // Query using the Spark API...
 df.select("id").filter("id" >= 5).show()

 // ...or register a temporary table and use SQL
 df.registerTempTable("kudu_table")
 val filteredDF = sqlContext.sql("select id from kudu_table where id >= 5").show()

 // Use KuduContext to create, delete, or write to Kudu tables
 val kuduContext = new KuduContext("kudu.master:7051")

 // Create a new Kudu table from a dataframe schema
 // NB: No rows from the dataframe are inserted into the table
 kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1))

 // Insert data
 kuduContext.insertRows(df, "test_table")

 // Delete data
 kuduContext.deleteRows(filteredDF, "test_table")

 // Upsert data
 kuduContext.upsertRows(df, "test_table")

 // Update data
 val alteredDF = df.select("id", $"count" + 1)
 kuduContext.updateRows(filteredRows, "test_table"

 // Data can also be inserted into the Kudu table using the data source, though the methods on KuduContext are preferred
 // NB: The default is to upsert rows; to perform standard inserts instead, set operation = insert in the options map
 // NB: Only mode Append is supported
 df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"-> "test_table")).mode("append").kudu

 // Check for the existence of a Kudu table
 kuduContext.tableExists("another_table")

 // Delete a Kudu table
 kuduContext.deleteTable("unwanted_table")
 ----

 === Spark Integration Known Issues and Limitations

 - The Kudu Spark integration is tested and developed against Spark 1.6 and Scala
   2.10.
 - Kudu tables with a name containing upper case or non-ascii characters must be
   assigned an alternate name when registered as a temporary table.
 - Kudu tables with a column name containing upper case or non-ascii characters
   may not be used with SparkSQL. Non-primary key columns may be renamed in Kudu
   to work around this issue.
 - `NULL`, `NOT NULL`, `<>`, `OR`, `LIKE`, and `IN` predicates are not pushed to
   Kudu, and instead will be evaluated by the Spark task.
 - Kudu does not support all types supported by Spark SQL, such as `Date`,
   `Decimal` and complex types.

 == Integration with MapReduce, YARN, and Other Frameworks

 Kudu was designed to integrate with MapReduce, YARN, Spark, and other frameworks in
 the Hadoop ecosystem. See
 link:https://github.com/apache/kudu/blob/master/java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/RowCounter.java[RowCounter.java]
 and
 link:https://github.com/apache/kudu/blob/master/java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/ImportCsv.java[ImportCsv.java]
 for examples which you can model your own integrations on. Stay tuned for more examples
 using YARN and Spark in the future.
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	[[developing]]
	= Developing Applications With Apache Kudu

	:author: Kudu Team
	:imagesdir: ./images
	:icons: font
	:toc: left
	:toclevels: 3
	:doctype: book
	:backend: html5
	:sectlinks:
	:experimental:

	Kudu provides C++ and Java client APIs, as well as reference examples to illustrate
	their use. A Python API is included, but it is currently considered experimental,
	unstable, and subject to change at any time.

	WARNING: Use of server-side or private interfaces is not supported, and interfaces
	which are not part of public APIs have no stability guarantees.

	== Viewing the API Documentation
	include::installation.adoc[tags=view_api]

	== Working Examples

	Several example applications are provided in the
	link:https://github.com/cloudera/kudu-examples[kudu-examples] Github
	repository. Each example includes a `README` that shows how to compile and run
	it. These examples illustrate correct usage of the Kudu APIs, as well as how to
	set up a virtual machine to run Kudu. The following list includes some of the
	examples that are available today. Check the repository itself in case this list goes
	out of date.

	`java-example`::
	A simple Java application which connects to a Kudu instance, creates a table, writes data to it, then drops the table.
	`collectl`::
	A small Java application which listens on a TCP socket for time series data corresponding to the Collectl wire protocol.
	The commonly-available collectl tool can be used to send example data to the server.
	`clients/python`::
	An experimental Python client for Kudu.
	`demo-vm-setup`::
	Scripts to download and run a VirtualBox virtual machine with Kudu already installed.
	See link:quickstart.html[Quickstart] for more information.

	These examples should serve as helpful starting points for your own Kudu applications and integrations.

	=== Maven Artifacts
	The following Maven `<dependency>` element is valid for the Kudu public beta:

	[source,xml]
	----
	<dependency>
	<groupId>org.apache.kudu</groupId>
	<artifactId>kudu-client</artifactId>
	<version>0.5.0</version>
	</dependency>
	----

	Because the Maven artifacts are not in Maven Central, use the following `<repository>`
	element:

	[source,xml]
	----
	<repository>
	<id>cdh.repo</id>
	<name>Cloudera Repositories</name>
	<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
	<snapshots>
	<enabled>false</enabled>
	</snapshots>
	</repository>
	----

	See subdirectories of https://github.com/cloudera/kudu-examples/tree/master/java for
	example Maven pom.xml files.

	== Example Impala Commands With Kudu

	See link:kudu_impala_integration.html[Using Impala With Kudu] for guidance on installing
	and using Impala with Kudu, including several `impala-shell` examples.

	== Kudu Integration with Spark

	Kudu integrates with Spark through the Data Source API as of version 0.9.
	Include the kudu-spark jar using the --jars option:
	[source]
	----
	spark-shell --jars kudu-spark-0.9.0.jar
	----
	then import kudu-spark and create a dataframe:
	[source,scala]
	----
	import org.apache.kudu.spark.kudu._

	// Read a table from Kudu
	val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu

	// Query using the Spark API...
	df.select("id").filter("id" >= 5).show()

	// ...or register a temporary table and use SQL
	df.registerTempTable("kudu_table")
	val filteredDF = sqlContext.sql("select id from kudu_table where id >= 5").show()

	// Use KuduContext to create, delete, or write to Kudu tables
	val kuduContext = new KuduContext("kudu.master:7051")

	// Create a new Kudu table from a dataframe schema
	// NB: No rows from the dataframe are inserted into the table
	kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1))

	// Insert data
	kuduContext.insertRows(df, "test_table")

	// Delete data
	kuduContext.deleteRows(filteredDF, "test_table")

	// Upsert data
	kuduContext.upsertRows(df, "test_table")

	// Update data
	val alteredDF = df.select("id", $"count" + 1)
	kuduContext.updateRows(filteredRows, "test_table"

	// Data can also be inserted into the Kudu table using the data source, though the methods on KuduContext are preferred
	// NB: The default is to upsert rows; to perform standard inserts instead, set operation = insert in the options map
	// NB: Only mode Append is supported
	df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"-> "test_table")).mode("append").kudu

	// Check for the existence of a Kudu table
	kuduContext.tableExists("another_table")

	// Delete a Kudu table
	kuduContext.deleteTable("unwanted_table")
	----

	=== Spark Integration Known Issues and Limitations

	- The Kudu Spark integration is tested and developed against Spark 1.6 and Scala
	2.10.
	- Kudu tables with a name containing upper case or non-ascii characters must be
	assigned an alternate name when registered as a temporary table.
	- Kudu tables with a column name containing upper case or non-ascii characters
	may not be used with SparkSQL. Non-primary key columns may be renamed in Kudu
	to work around this issue.
	- `NULL`, `NOT NULL`, `<>`, `OR`, `LIKE`, and `IN` predicates are not pushed to
	Kudu, and instead will be evaluated by the Spark task.
	- Kudu does not support all types supported by Spark SQL, such as `Date`,
	`Decimal` and complex types.

	== Integration with MapReduce, YARN, and Other Frameworks

	Kudu was designed to integrate with MapReduce, YARN, Spark, and other frameworks in
	the Hadoop ecosystem. See
	link:https://github.com/apache/kudu/blob/master/java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/RowCounter.java[RowCounter.java]
	and
	link:https://github.com/apache/kudu/blob/master/java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/ImportCsv.java[ImportCsv.java]
	for examples which you can model your own integrations on. Stay tuned for more examples
	using YARN and Spark in the future.