This tutorial is designed to provide a quick introduction to create and use partition table in Apache CarbonData.
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type , ...)] PARTITIONED BY (partition_col_name data_type) STORED BY 'carbondata' [TBLPROPERTIES ('PARTITION_TYPE'='HASH', 'PARTITION_NUM'='N' ...)] //N is the number of hash partitions
Example:
create table if not exists hash_partition_table( col_A String, col_B Int, col_C Long, col_D Decimal(10,2), col_F Timestamp ) partitioned by (col_E Long) stored by 'carbondata' tblproperties('partition_type'='Hash','partition_num'='9')
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type , ...)] PARTITIONED BY (partition_col_name data_type) STORED BY 'carbondata' [TBLPROPERTIES ('PARTITION_TYPE'='RANGE', 'RANGE_INFO'='2014-01-01, 2015-01-01, 2016-01-01' ...)]
Note:
The ‘RANGE_INFO’ must be defined in ascending order in the table properties.
The default format for partition column of Date/Timestamp type is yyyy-MM-dd. Alternate formats for Date/Timestamp could be defined in CarbonProperties.
Example:
create table if not exists hash_partition_table( col_A String, col_B Int, col_C Long, col_D Decimal(10,2), col_E Long ) partitioned by (col_F Timestamp) stored by 'carbondata' tblproperties('partition_type'='Range', 'range_info'='2015-01-01, 2016-01-01, 2017-01-01, 2017-02-01')
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type , ...)] PARTITIONED BY (partition_col_name data_type) STORED BY 'carbondata' [TBLPROPERTIES ('PARTITION_TYPE'='LIST', 'LIST_INFO'='A, B, C' ...)]
Note :
Example:
create table if not exists hash_partition_table( col_B Int, col_C Long, col_D Decimal(10,2), col_E Long, col_F Timestamp ) partitioned by (col_A String) stored by 'carbondata' tblproperties('partition_type'='List', 'list_info'='aaaa, bbbb, (cccc, dddd), eeee')
The following command is executed to get the partition information of the table
SHOW PARTITIONS [db_name.]table_name
ALTER TABLE [db_name].table_name ADD PARTITION('new_partition')
ALTER TABLE [db_name].table_name SPLIT PARTITION(partition_id) INTO('new_partition1', 'new_partition2'...)
//Drop partition definition only and keep data ALTER TABLE [db_name].table_name DROP PARTITION(partition_id) //Drop both partition definition and data ALTER TABLE [db_name].table_name DROP PARTITION(partition_id) WITH DATA
Note:
In the first case where the data in the table is preserved there can be multiple scenarios as described below :
if the table is a range partition table, data will be merged into the next partition, and if the dropped partition is the last partition, then data will be merged into the default partition.
if the table is a list partition table, data will be merged into default partition.
Dropping the default partition is not allowed, but DELETE statement can be used to delete data in default partition.
The partition_id could be fetched using the SHOW PARTITIONS command.
Hash partition table is not supported for ADD, SPLIT and DROP commands.
In CarbonData like the hive, folders are not used to divide partitions instead partition id is used to replace the task id. It could make use of the characteristic and meanwhile reduce some metadata.
SegmentDir/0_batchno0-0-1502703086921.carbonindex ^ SegmentDir/part-0-0_batchno0-0-1502703086921.carbondata ^
Here are some useful tips to improve query performance of carbonData partition table:
Prior analysis of proper partition column
The distribution of data based on some random column could be skewed, building a skewed partition table is meaningless. Some basic statistical analysis before the creation of partition table can avoid an extremely skewed column.
Exclude partition column from sort columns
If you have many dimensions, that need to be sorted then one must exclude column present in the partition from sort columns, this will allow another dimension to do the efficient sorting.
Remember to add filter on partition column when writing SQL
When writing SQL on a partition table, try to use filters on the partition column.