contrib/pinot-druid-benchmark/README.md - pinot - Git at Google

 # Running the benchmark

 For instructions on how to run the Pinot/Druid benchmark please refer to the
 ```run_benchmark.sh``` file.

 In order to run the Apache Pinot benchmark you'll need to create the appropriate
 data segments, which are too large to be included in this github repository and
 they may need to be recreated with new Apache Pinot versions.

 To create the neccessary segment data for the benchmark please follow the
 instructions below.

 # Creating Apache Pinot benchmark segments from TPC-H data

 To run the Pinot/Druid benchmark with Apache Pinot you'll need to download and run
 the TPC-H tools to generate the benchmark data sets.

 ## Downloading and building the TPC-H tools

 The TPC-H tools can be downloaded from the [TPC-H Website](http://www.tpc.org/tpch/default5.asp).
 Registration is required.

 **Note:**: The instructions below for dbgen assume a Linux OS.

 After downloading and extracing the TPC-H tools, you'll need to build the
 db generator tool: ```dbgen```. To do so, extract the package that you have
 downloaded from TPC-H's website and inside the dbgen sub directory edit the
 ```makefile``` file.

 Set the following variables in the makefile to:

 ```
 CC      = gcc
 ...
 DATABASE= SQLSERVER
 MACHINE = LINUX
 WORKLOAD = TPCH
 ```

 Next, build the dbgen tool as per the README instructions in the dbgen directory.

 ## Generating the TPC-H data and converting them for use in Apache Pinot

 After building ```dbgen``` run the following command line in the ```dbgen``` directory:

 ```
 ./dbgen -TL -s8
 ```

 The command above will generate a single large file called ```lineitem.tbl```.
 This is the data file for the TPC-H benchmark, which we'll need to post-process
 a bit to be imported into Apache Pinot.

 Next, build the Pinot/Druid Benchmark code if you haven't done so already.

 **Note:** Apache Pinot has JDK11 support, however for now it's
 best to use JDK8 for all build and run operations in this manual.

 Inside ```pinot_directory/contrib/pinot-druid-benchmark``` run:

 ```
 ./mvnw clean install
 ```

 Next, inside the same directory split the ```lineitem``` table:

 ```
 ./target/appassembler/bin/data-separator.sh <Path to lineitem.tbl> <Output Directory>
 ```

 Use the output directory from the split as the input directory for the merge
 command below:

 ```
 ./target/appassembler/bin/data-merger.sh <Input Directory> <Output Directory> YEAR
 ```

 If all ran well you should see a few CSV files produced, 1992.csv through 1998.csv.

 These files are the starting point for creating our Apache Pinot segments.

 ## Create the Apache Pinot segments

 The first step in the process is to launch a standalone Apache Pinot Cluster on one
 single server. This cluster will serve as a host to hold the initial segments,
 which we'll extract and copy for later re-use in the benchmark.

 Follow the steps outlined in the Apache Pinot Manual Cluster setup to launch the
 cluster:

 https://docs.pinot.apache.org/basics/getting-started/advanced-pinot-setup

 You don't need the Kafka service as we won't be using it.

 Next, we need to follow the instructions similar to the ones described in
 the [Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot)
 in the Apache Pinot documentation.

 ### Create the Apache Pinot tables

 Run:

 ```
 pinot-admin.sh AddTable \
   -tableConfigFile /absolute/path/to/table-config.json \
   -schemaFile /absolute/path/to/schema.json -exec
 ```

 For this command above you'll need the following configuration files:

 ```table_config.json```
 ```
 {
   "tableName": "tpch_lineitem",
   "segmentsConfig" : {
     "replication" : "1",
     "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
   },
   "tenants" : {
     "broker":"DefaultTenant",
     "server":"DefaultTenant"
   },
   "tableIndexConfig" : {
     "starTreeIndexConfigs":[{
       "maxLeafRecords": 100,
       "functionColumnPairs": ["SUM__l_extendedprice", "SUM__l_discount", "SUM__l_quantity"],
       "dimensionsSplitOrder": ["l_receiptdate", "l_shipdate", "l_shipmode", "l_returnflag"],
       "skipStarNodeCreationForDimensions": [],
       "skipMaterializationForDimensions": ["l_partkey", "l_commitdate", "l_linestatus", "l_comment", "l_orderkey", "l_shipinstruct", "l_linenumber", "l_suppkey"]
     }]
   },
   "tableType":"OFFLINE",
   "metadata": {}
 }
 ```

 ```schema.json```
 ```
 {
   "schemaName": "tpch_lineitem",
   "dimensionFieldSpecs": [
     {
       "name": "l_orderkey",
       "dataType": "INT"
     },
     {
       "name": "l_partkey",
       "dataType": "INT"
     },
     {
       "name": "l_suppkey",
       "dataType": "INT"
     },
     {
       "name": "l_linenumber",
       "dataType": "INT"
     },
     {
       "name": "l_returnflag",
       "dataType": "STRING"
     },
     {
       "name": "l_linestatus",
       "dataType": "STRING"
     },
     {
       "name": "l_shipdate",
       "dataType": "STRING"
     },
     {
       "name": "l_commitdate",
       "dataType": "STRING"
     },
     {
       "name": "l_receiptdate",
       "dataType": "STRING"
     },
     {
       "name": "l_shipinstruct",
       "dataType": "STRING"
     },
     {
       "name": "l_shipmode",
       "dataType": "STRING"
     },
     {
       "name": "l_comment",
       "dataType": "STRING"
     }
   ],
   "metricFieldSpecs": [
     {
       "name": "l_quantity",
       "dataType": "LONG"
     },
     {
       "name": "l_extendedprice",
       "dataType": "DOUBLE"
     },
     {
       "name": "l_discount",
       "dataType": "DOUBLE"
     },
     {
       "name": "l_tax",
       "dataType": "DOUBLE"
     }
   ]
 }
 ```

 **Note:** The configuration as specified above will give you
 the data with the **optimal star tree index**. The index configuration is
 specified in the ```tableIndexConfig``` section in the ```table_config.json``` file. If
 you want to generate a different type of indexed segment, then you
 should modify the tableIndexConfig section to reflect the correct index
 type as described in the [Indexing Section](https://docs.pinot.apache.org/basics/features/indexing)
 of the Apache Pinot Documentation.

 ### Create the Apache Pinot segments

 Next, we'll create the segments for this Apache Pinot table using the optimal
 star tree index configuration.

 For this purpose you'll need a job specification YAML file. Here's an example
 that does the TPC-H data import:

 ```job-spec.yml```
 ```
 executionFrameworkSpec:
   name: 'standalone'
   segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
   segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
   segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
 jobType: SegmentCreationAndTarPush
 inputDirURI: '/absolute/path/to/pinot/contrib/pinot-druid-benchmark/data_out/raw_data/'
 includeFileNamePattern: 'glob:**/*.csv'
 outputDirURI: '/absolute/path/to/pinot/contrib/pinot-druid-benchmark/data_out/segments/'
 overwriteOutput: true
 pinotFSSpecs:
   - scheme: file
     className: org.apache.pinot.spi.filesystem.LocalPinotFS
 recordReaderSpec:
   dataFormat: 'csv'
   className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
   configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
   configs:
     delimiter: '|'
     multiValueDelimiterEnabled: false
     header: 'l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment|'
 tableSpec:
   tableName: 'tpch_lineitem'
   schemaURI: 'http://localhost:9000/tables/tpch_lineitem/schema'
   tableConfigURI: 'http://localhost:9000/tables/tpch_lineitem'
 pinotClusterSpecs:
   - controllerURI: 'http://localhost:9000'
 ```

 **Note:** Make sure you modify the absolute path for **inputDirURI** and **outputDirURI**
 above. The inputDirURI should be pointing to the directory where you have
 generated the 7 YEAR CSV files, 1992.csv through 1998.csv.

 After you have modified the input and output dir, run the job as described in the
 [Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot) document:


 ```
 pinot-admin.sh LaunchDataIngestionJob \
     -jobSpecFile /absolute/path/to/job-spec.yml
 ```

 The segment creation output on the console will tell you where Apache Pinot will
 store the created segments (it should be your output dir). You should see a
 line appear in the output as:

 ```
 ...
 outputDirURI: /absolute/path/to/pinot/contrib/pinot-druid-benchmark/data_out/segments/
 ...
 ```

 Inside there you'll find the tpch_lineitem_OFFLINE directory with 7 separate
 segments, 0 through 6. Tar/GZip the whole directory and this will be your
 optimal_startree_small_yearly temp segment that the benchmark requires. However,
 wait first for the segment creation to finish.

 Try few queries to ensure that the segments are working. You can find some
 sample queries under the benchmark directory ```src/main/resources/pinot_queries```.
 Watch the console output from the Apache Pinot cluster as you run the queries, and make sure
 there are no complaints in there that the queries were slow since index wasn't found.
 If you see a message saying the query was slow, it means that the indexes weren't
 created properly. With the optimal star tree index your total query time should be
 few milliseconds at most.

 You can now shutdown the Apache Pinot cluster which you started manually and when you
 launch the benchmark server cluster it will pick up your new segments.
	# Running the benchmark

	For instructions on how to run the Pinot/Druid benchmark please refer to the
	```run_benchmark.sh``` file.

	In order to run the Apache Pinot benchmark you'll need to create the appropriate
	data segments, which are too large to be included in this github repository and
	they may need to be recreated with new Apache Pinot versions.

	To create the neccessary segment data for the benchmark please follow the
	instructions below.

	# Creating Apache Pinot benchmark segments from TPC-H data

	To run the Pinot/Druid benchmark with Apache Pinot you'll need to download and run
	the TPC-H tools to generate the benchmark data sets.

	## Downloading and building the TPC-H tools

	The TPC-H tools can be downloaded from the [TPC-H Website](http://www.tpc.org/tpch/default5.asp).
	Registration is required.

	Note:: The instructions below for dbgen assume a Linux OS.

	After downloading and extracing the TPC-H tools, you'll need to build the
	db generator tool: ```dbgen```. To do so, extract the package that you have
	downloaded from TPC-H's website and inside the dbgen sub directory edit the
	```makefile``` file.

	Set the following variables in the makefile to:

	```
	CC = gcc
	...
	DATABASE= SQLSERVER
	MACHINE = LINUX
	WORKLOAD = TPCH
	```

	Next, build the dbgen tool as per the README instructions in the dbgen directory.

	## Generating the TPC-H data and converting them for use in Apache Pinot

	After building ```dbgen``` run the following command line in the ```dbgen``` directory:

	```
	./dbgen -TL -s8
	```

	The command above will generate a single large file called ```lineitem.tbl```.
	This is the data file for the TPC-H benchmark, which we'll need to post-process
	a bit to be imported into Apache Pinot.

	Next, build the Pinot/Druid Benchmark code if you haven't done so already.

	Note: Apache Pinot has JDK11 support, however for now it's
	best to use JDK8 for all build and run operations in this manual.

	Inside ```pinot_directory/contrib/pinot-druid-benchmark``` run:

	```
	./mvnw clean install
	```

	Next, inside the same directory split the ```lineitem``` table:

	```
	./target/appassembler/bin/data-separator.sh <Path to lineitem.tbl> <Output Directory>
	```

	Use the output directory from the split as the input directory for the merge
	command below:

	```
	./target/appassembler/bin/data-merger.sh <Input Directory> <Output Directory> YEAR
	```

	If all ran well you should see a few CSV files produced, 1992.csv through 1998.csv.

	These files are the starting point for creating our Apache Pinot segments.

	## Create the Apache Pinot segments

	The first step in the process is to launch a standalone Apache Pinot Cluster on one
	single server. This cluster will serve as a host to hold the initial segments,
	which we'll extract and copy for later re-use in the benchmark.

	Follow the steps outlined in the Apache Pinot Manual Cluster setup to launch the
	cluster:

	https://docs.pinot.apache.org/basics/getting-started/advanced-pinot-setup

	You don't need the Kafka service as we won't be using it.

	Next, we need to follow the instructions similar to the ones described in
	the [Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot)
	in the Apache Pinot documentation.

	### Create the Apache Pinot tables

	Run:

	```
	pinot-admin.sh AddTable \
	-tableConfigFile /absolute/path/to/table-config.json \
	-schemaFile /absolute/path/to/schema.json -exec
	```

	For this command above you'll need the following configuration files:

	```table_config.json```
	```
	{
	"tableName": "tpch_lineitem",
	"segmentsConfig" : {
	"replication" : "1",
	"segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
	},
	"tenants" : {
	"broker":"DefaultTenant",
	"server":"DefaultTenant"
	},
	"tableIndexConfig" : {
	"starTreeIndexConfigs":[{
	"maxLeafRecords": 100,
	"functionColumnPairs": ["SUM__l_extendedprice", "SUM__l_discount", "SUM__l_quantity"],
	"dimensionsSplitOrder": ["l_receiptdate", "l_shipdate", "l_shipmode", "l_returnflag"],
	"skipStarNodeCreationForDimensions": [],
	"skipMaterializationForDimensions": ["l_partkey", "l_commitdate", "l_linestatus", "l_comment", "l_orderkey", "l_shipinstruct", "l_linenumber", "l_suppkey"]
	}]
	},
	"tableType":"OFFLINE",
	"metadata": {}
	}
	```

	```schema.json```
	```
	{
	"schemaName": "tpch_lineitem",
	"dimensionFieldSpecs": [
	{
	"name": "l_orderkey",
	"dataType": "INT"
	},
	{
	"name": "l_partkey",
	"dataType": "INT"
	},
	{
	"name": "l_suppkey",
	"dataType": "INT"
	},
	{
	"name": "l_linenumber",
	"dataType": "INT"
	},
	{
	"name": "l_returnflag",
	"dataType": "STRING"
	},
	{
	"name": "l_linestatus",
	"dataType": "STRING"
	},
	{
	"name": "l_shipdate",
	"dataType": "STRING"
	},
	{
	"name": "l_commitdate",
	"dataType": "STRING"
	},
	{
	"name": "l_receiptdate",
	"dataType": "STRING"
	},
	{
	"name": "l_shipinstruct",
	"dataType": "STRING"
	},
	{
	"name": "l_shipmode",
	"dataType": "STRING"
	},
	{
	"name": "l_comment",
	"dataType": "STRING"
	}
	],
	"metricFieldSpecs": [
	{
	"name": "l_quantity",
	"dataType": "LONG"
	},
	{
	"name": "l_extendedprice",
	"dataType": "DOUBLE"
	},
	{
	"name": "l_discount",
	"dataType": "DOUBLE"
	},
	{
	"name": "l_tax",
	"dataType": "DOUBLE"
	}
	]
	}
	```

	Note: The configuration as specified above will give you
	the data with the optimal star tree index. The index configuration is
	specified in the ```tableIndexConfig``` section in the ```table_config.json``` file. If
	you want to generate a different type of indexed segment, then you
	should modify the tableIndexConfig section to reflect the correct index
	type as described in the [Indexing Section](https://docs.pinot.apache.org/basics/features/indexing)
	of the Apache Pinot Documentation.

	### Create the Apache Pinot segments

	Next, we'll create the segments for this Apache Pinot table using the optimal
	star tree index configuration.

	For this purpose you'll need a job specification YAML file. Here's an example
	that does the TPC-H data import:

	```job-spec.yml```
	```
	executionFrameworkSpec:
	name: 'standalone'
	segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
	segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
	segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
	jobType: SegmentCreationAndTarPush
	inputDirURI: '/absolute/path/to/pinot/contrib/pinot-druid-benchmark/data_out/raw_data/'
	includeFileNamePattern: 'glob:*/.csv'
	outputDirURI: '/absolute/path/to/pinot/contrib/pinot-druid-benchmark/data_out/segments/'
	overwriteOutput: true
	pinotFSSpecs:
	- scheme: file
	className: org.apache.pinot.spi.filesystem.LocalPinotFS
	recordReaderSpec:
	dataFormat: 'csv'
	className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
	configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
	configs:
	delimiter: '\|'
	multiValueDelimiterEnabled: false
	header: 'l_orderkey\|l_partkey\|l_suppkey\|l_linenumber\|l_quantity\|l_extendedprice\|l_discount\|l_tax\|l_returnflag\|l_linestatus\|l_shipdate\|l_commitdate\|l_receiptdate\|l_shipinstruct\|l_shipmode\|l_comment\|'
	tableSpec:
	tableName: 'tpch_lineitem'
	schemaURI: 'http://localhost:9000/tables/tpch_lineitem/schema'
	tableConfigURI: 'http://localhost:9000/tables/tpch_lineitem'
	pinotClusterSpecs:
	- controllerURI: 'http://localhost:9000'
	```

	Note: Make sure you modify the absolute path for inputDirURI and outputDirURI
	above. The inputDirURI should be pointing to the directory where you have
	generated the 7 YEAR CSV files, 1992.csv through 1998.csv.

	After you have modified the input and output dir, run the job as described in the
	[Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot) document:


	```
	pinot-admin.sh LaunchDataIngestionJob \
	-jobSpecFile /absolute/path/to/job-spec.yml
	```

	The segment creation output on the console will tell you where Apache Pinot will
	store the created segments (it should be your output dir). You should see a
	line appear in the output as:

	```
	...
	outputDirURI: /absolute/path/to/pinot/contrib/pinot-druid-benchmark/data_out/segments/
	...
	```

	Inside there you'll find the tpch_lineitem_OFFLINE directory with 7 separate
	segments, 0 through 6. Tar/GZip the whole directory and this will be your
	optimal_startree_small_yearly temp segment that the benchmark requires. However,
	wait first for the segment creation to finish.

	Try few queries to ensure that the segments are working. You can find some
	sample queries under the benchmark directory ```src/main/resources/pinot_queries```.
	Watch the console output from the Apache Pinot cluster as you run the queries, and make sure
	there are no complaints in there that the queries were slow since index wasn't found.
	If you see a message saying the query was slow, it means that the indexes weren't
	created properly. With the optimal star tree index your total query time should be
	few milliseconds at most.

	You can now shutdown the Apache Pinot cluster which you started manually and when you
	launch the benchmark server cluster it will pick up your new segments.