Hive Bitmap UDF provides UDFs for generating bitmap and bitmap operations in hive tables. The bitmap in Hive is exactly the same as the Doris bitmap. The bitmap in Hive can be imported into doris through (spark bitmap load).
the main purpose:
-- Example: Create Hive Bitmap Table CREATE TABLE IF NOT EXISTS `hive_bitmap_table`( `k1` int COMMENT '', `k2` String COMMENT '', `k3` String COMMENT '', `uuid` binary COMMENT 'bitmap' ) comment 'comment' -- Example:Create Hive Table CREATE TABLE IF NOT EXISTS `hive_table`( `k1` int COMMENT '', `k2` String COMMENT '', `k3` String COMMENT '', `uuid` int COMMENT '' ) comment 'comment'
Hive Bitmap UDF used in Hive/Spark,First, you need to compile fe to get hive-udf-jar-with-dependencies.jar. Compilation preparation:If you have compiled the ldb source code, you can directly compile fe,If you have compiled the ldb source code, you can compile it directly. If you have not compiled the ldb source code, you need to manually install thrift, Reference:Setting Up dev env for FE.
--clone doris code git clone https://github.com/apache/doris.git cd doris git submodule update --init --recursive --install thrift --Enter the fe directory cd fe --Execute the maven packaging command(All sub modules of fe will be packaged) mvn package -Dmaven.test.skip=true --You can also just package the hive-udf module mvn package -pl hive-udf -am -Dmaven.test.skip=true
After packaging and compiling, enter the hive-udf directory and there will be a target directory,There will be hive-udf.jar package
-- Load the Hive Bitmap Udf jar package (Upload the compiled hive-udf jar package to HDFS) add jar hdfs://node:9001/hive-udf-jar-with-dependencies.jar; -- Create Hive Bitmap UDAF function create temporary function to_bitmap as 'org.apache.doris.udf.ToBitmapUDAF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar'; create temporary function bitmap_union as 'org.apache.doris.udf.BitmapUnionUDAF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar'; -- Create Hive Bitmap UDF function create temporary function bitmap_count as 'org.apache.doris.udf.BitmapCountUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar'; create temporary function bitmap_and as 'org.apache.doris.udf.BitmapAndUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar'; create temporary function bitmap_or as 'org.apache.doris.udf.BitmapOrUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar'; create temporary function bitmap_xor as 'org.apache.doris.udf.BitmapXorUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar'; -- Example: Generate bitmap by to_bitmap function and write to Hive Bitmap table insert into hive_bitmap_table select k1, k2, k3, to_bitmap(uuid) as uuid from hive_table group by k1, k2, k3 -- Example: The bitmap_count function calculate the number of elements in the bitmap select k1,k2,k3,bitmap_count(uuid) from hive_bitmap_table -- Example: The bitmap_union function calculate the grouped bitmap union select k1,bitmap_union(uuid) from hive_bitmap_table group by k1
When create a Hive table in the format specified as TEXT, for Binary type, Hive will be saved as a bash64 encoded string. Therefore, the binary data can be directly saved as Bitmap through bitmap_from_base64 function by using Doris's Hive Catalog.
Here is a full example:
CREATE TABLE IF NOT EXISTS `test`.`hive_bitmap_table`( `k1` int COMMENT '', `k2` String COMMENT '', `k3` String COMMENT '', `uuid` binary COMMENT 'bitmap' ) stored as textfile
CREATE CATALOG hive PROPERTIES ( 'type'='hms', 'hive.metastore.uris' = 'thrift://127.0.0.1:9083' );
CREATE TABLE IF NOT EXISTS `test`.`doris_bitmap_table`( `k1` int COMMENT '', `k2` String COMMENT '', `k3` String COMMENT '', `uuid` BITMAP BITMAP_UNION COMMENT 'bitmap' ) AGGREGATE KEY(k1, k2, k3) DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 1" );
insert into doris_bitmap_table select k1, k2, k3, bitmap_from_base64(uuid) from hive.test.hive_bitmap_table;
see details: Spark Load -> Basic operation -> Create load(Example 3: when the upstream data source is hive binary type table)