| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| = Apache NiFi Quickstart |
| |
| Below is a brief example using Apache NiFi to ingest data in Apache Kudu. |
| |
| == Start the Kudu Quickstart Environment |
| |
| See the Apache Kudu |
| link:https://kudu.apache.org/docs/quickstart.html[quickstart documentation] |
| to setup and run the Kudu quickstart environment. |
| |
| == Run Apache NiFi |
| |
| Use the following command to run the latest Apache NiFi Docker image: |
| |
| [source,bash] |
| ---- |
| docker run -d --name kudu-nifi --network="docker_default" -p 8080:8080 apache/nifi:latest |
| ---- |
| |
| You can view the running NiFi instance at link:http://localhost:8080/nifi[localhost:8080/nifi]. |
| |
| NOTE: `--network="docker_default"` is specified to connect the container the |
| same network as the quickstart cluster. |
| |
| NOTE: You can remove the `-d` flag to run the container in the foreground. |
| |
| == Create the Kudu table |
| |
| Create the `random_user` Kudu table that matches the expected Schema. |
| |
| In order to do this without any dependencies on your host machine, we will |
| use the `jshell` REPL in a Docker container to create the table using the |
| Java API. First setup the Docker container, download the jar, and run the REPL: |
| |
| [source,bash] |
| ---- |
| docker run -it --rm --network="docker_default" maven:latest bin/bash |
| # Download the kudu-client-tools jar which has the kudu-client and all the dependencies. |
| mkdir jars |
| mvn dependency:copy \ |
| -Dartifact=org.apache.kudu:kudu-client-tools:1.16.0 \ |
| -DoutputDirectory=jars |
| # Run the jshell with the jar on the classpath. |
| jshell --class-path jars/* |
| ---- |
| |
| NOTE: `--network="docker_default"` is specified to connect the container the |
| same network as the quickstart cluster. |
| |
| Then, once in the `jshell` REPL, create the table using the Java API: |
| |
| [source,java] |
| ---- |
| import org.apache.kudu.client.CreateTableOptions |
| import org.apache.kudu.client.KuduClient |
| import org.apache.kudu.client.KuduClient.KuduClientBuilder |
| import org.apache.kudu.ColumnSchema.ColumnSchemaBuilder |
| import org.apache.kudu.Schema |
| import org.apache.kudu.Type |
| |
| KuduClient client = |
| new KuduClientBuilder("kudu-master-1:7051,kudu-master-2:7151,kudu-master-3:7251").build(); |
| |
| if(client.tableExists("random_user")) { |
| client.deleteTable("random_user"); |
| } |
| |
| Schema schema = new Schema(Arrays.asList( |
| new ColumnSchemaBuilder("ssn", Type.STRING).key(true).build(), |
| new ColumnSchemaBuilder("firstName", Type.STRING).build(), |
| new ColumnSchemaBuilder("lastName", Type.STRING).build(), |
| new ColumnSchemaBuilder("email", Type.STRING).build()) |
| ); |
| CreateTableOptions tableOptions = |
| new CreateTableOptions().setNumReplicas(3).addHashPartitions(Arrays.asList("ssn"), 4); |
| client.createTable("random_user", schema, tableOptions); |
| ---- |
| |
| Once complete, you can use `CTRL + D` to exit the REPL and `exit` to exit the container. |
| |
| == Load the Dataflow Template |
| |
| The `Random_User_Kudu.xml` template downloads randomly generated user data from |
| http://randomuser.me and then pushes the data into Kudu. The data is pulled in |
| 100 records at a time and then split into individual records. The incoming data |
| is in JSON Format. |
| |
| Next, the user's social security number, first name, last name, and e-mail |
| address are extract from the JSON into FlowFile Attributes and the content is |
| modified to become a new JSON document consisting of only 4 fields: |
| `ssn`, `firstName`, `lastName`, and `email`. Finally, this smaller JSON is then pushed to |
| Kudu as a single row, each field being a separate column in that row. |
| |
| To load the template follow the NiFi |
| link:https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Import_Template["Importing a Template" documentation] |
| to load `Random_User_Kudu.xml`. |
| |
| Then follow the NiFi |
| link:hhttps://nifi.apache.org/docs/nifi-docs/html/user-guide.html#instantiating-a-template["Instantiating a Template" documentation] |
| to add the `Random User Kudu` template to the canvas. |
| |
| Once the template is added to the canvas you need to start the JsonTreeReader |
| controller service. You can do this via the PutKudu processor configuration |
| or via the Nifi Flow configuration in the Operate panel. See the Nifi |
| link:https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Controller_Services_for_Dataflows["Controller Service" documentation] |
| for more details. |
| |
| Now you can start individual processors by right-clicking each processor and selecting `Start`. |
| You can also explore the configuration, queue contents, and more by right-clicking on each element. |
| Alternatively you can use the Operate panel and start the entire flow at once. |
| More about starting and stopping NiFi components can be read in the NiFi |
| link:https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#starting-a-component["Starting a Component" documentation]. |
| |
| == Shutdown NiFi |
| |
| Once you are done with the NiFi container you can shutdown in a couple of ways. |
| If you ran NiFi without the `-d` flag, you can use `ctrl + c` to stop the container. |
| |
| If you ran NiFi with the `-d` flag, you can use the following to |
| gracefully shutdown the container: |
| |
| [source,bash] |
| ---- |
| docker stop kudu-nifi |
| ---- |
| |
| To permanently remove the container run the following: |
| |
| [source,bash] |
| ---- |
| docker rm kudu-nifi |
| ---- |
| |
| == Next steps |
| |
| The above example showed how to ingest data into Kudu using Apache NiFi. |
| Next explore the other quickstart guides to learn how to query or process |
| the data using other tools. |
| |
| For example, the link:https://github.com/apache/kudu/tree/master/examples/quickstart/spark[Spark quickstart guide] |
| will walk you through how to setup and query Kudu tables with the `spark-kudu` |
| integration. |
| |
| If you have already run through the Spark quickstart the following is a brief |
| example of the code to allow you to query the `random_user` table: |
| |
| [source,bash] |
| ---- |
| spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.16.0 |
| ---- |
| |
| [source,scala] |
| ---- |
| :paste |
| val random_user = spark.read |
| .option("kudu.master", "localhost:7051,localhost:7151,localhost:7251") |
| .option("kudu.table", "random_user") |
| // We need to use leader_only because Kudu on Docker currently doesn't |
| // support Snapshot scans due to `--use_hybrid_clock=false`. |
| .option("kudu.scanLocality", "leader_only") |
| .format("kudu").load |
| random_user.createOrReplaceTempView("random_user") |
| spark.sql("SELECT count(*) FROM random_user").show() |
| spark.sql("SELECT * FROM random_user LIMIT 5").show() |
| ---- |
| |
| == Help |
| |
| If have questions, issues, or feedback on this quickstart guide, please reach out to the |
| link:https://kudu.apache.org/community.html[Apache Kudu community]. |