This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL.
CONTAINER_UID in custom_settings.envdoris--CONTAINER_UID="doris--bender--"hudi.env.tpl)HIVE_METASTORE_PORT: Port for Hive Metastore Thrift service (default: 19083)MINIO_API_PORT: MinIO S3 API port (default: 19100)MINIO_CONSOLE_PORT: MinIO web console port (default: 19101)SPARK_UI_PORT: Spark web UI port (default: 18080)hudi.env.tpl)MINIO_ROOT_USER: MinIO access key (default: minio)MINIO_ROOT_PASSWORD: MinIO secret key (default: minio123)HUDI_BUCKET: S3 bucket name for Hudi data (default: datalake)⚠️ Important: Hadoop versions must match Spark's built-in Hadoop version
hudi.env.tpl)All JAR file versions and URLs are configurable:
HUDI_BUNDLE_VERSION / HUDI_BUNDLE_URL: Hudi Spark bundleHADOOP_AWS_VERSION / HADOOP_AWS_URL: Hadoop S3A filesystem supportAWS_SDK_BUNDLE_VERSION / AWS_SDK_BUNDLE_URL: AWS Java SDK Bundle v1 (required for Hadoop 3.3.4 S3A support, 1.12.x series)Note: hadoop-common is already included in Spark‘s built-in Hadoop distribution, so it’s not configured here.
POSTGRESQL_JDBC_VERSION / POSTGRESQL_JDBC_URL: PostgreSQL JDBC driver# Start Hudi environment ./docker/thirdparties/run-thirdparties-docker.sh -c hudi # Stop Hudi environment ./docker/thirdparties/run-thirdparties-docker.sh -c hudi --stop
⚠️ Important: To ensure data consistency after Docker restarts, only use SQL scripts to add data. Data added through spark-sql interactive shell is temporary and will not persist after container restart.
Add new SQL files in scripts/create_preinstalled_scripts/hudi/ directory:
01_config_and_database.sql, 02_create_user_activity_log_tables.sql, etc.)${HIVE_METASTORE_URIS} and ${HUDI_BUCKET}Example: Create 08_create_custom_table.sql:
USE regression_hudi; CREATE TABLE IF NOT EXISTS my_hudi_table ( id BIGINT, name STRING, created_at TIMESTAMP ) USING hudi TBLPROPERTIES ( type = 'cow', primaryKey = 'id', preCombineField = 'created_at', hoodie.datasource.hive_sync.enable = 'true', hoodie.datasource.hive_sync.metastore.uris = '${HIVE_METASTORE_URIS}', hoodie.datasource.hive_sync.mode = 'hms' ) LOCATION 's3a://${HUDI_BUCKET}/warehouse/regression_hudi/my_hudi_table'; INSERT INTO my_hudi_table VALUES (1, 'Alice', TIMESTAMP '2024-01-01 10:00:00'), (2, 'Bob', TIMESTAMP '2024-01-02 11:00:00');
After adding SQL files, restart the container to execute them:
docker restart doris--hudi-spark
After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables:
-- Create Hudi catalog CREATE CATALOG IF NOT EXISTS hudi_catalog PROPERTIES ( 'type' = 'hms', 'hive.metastore.uris' = 'thrift://<externalEnvIp>:19083', 's3.endpoint' = 'http://<externalEnvIp>:19100', 's3.access_key' = 'minio', 's3.secret_key' = 'minio123', 's3.region' = 'us-east-1', 'use_path_style' = 'true' ); -- Switch to Hudi catalog SWITCH hudi_catalog; -- Use database USE regression_hudi; -- Show tables SHOW TABLES; -- Query Hudi table SELECT * FROM user_activity_log_cow_partition LIMIT 10;
Configuration Parameters:
hive.metastore.uris: Hive Metastore Thrift service address (default port: 19083)s3.endpoint: MinIO S3 API endpoint (default port: 19100)s3.access_key: MinIO access key (default: minio)s3.secret_key: MinIO secret key (default: minio123)s3.region: S3 region (default: us-east-1)use_path_style: Use path-style access for MinIO (required: true)Replace <externalEnvIp> with your actual external environment IP address (e.g., 127.0.0.1 for localhost).
⚠️ Note: The methods below are for debugging purposes only. Data created through spark-sql interactive shell will not persist after Docker restart. To add persistent data, use SQL scripts as described in the “Adding Data” section.
docker exec -it doris--hudi-spark bash
/opt/spark/bin/spark-sql \ --master local[*] \ --name hudi-debug \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalogImplementation=hive \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.warehouse.dir=s3a://datalake/warehouse
-- Show databases SHOW DATABASES; -- Use database USE regression_hudi; -- Show tables SHOW TABLES; -- Describe table structure DESCRIBE EXTENDED user_activity_log_cow_partition; -- Query data SELECT * FROM user_activity_log_cow_partition LIMIT 10; -- Check Hudi table properties SHOW TBLPROPERTIES user_activity_log_cow_partition; -- View Spark configuration SET -v; -- Check Hudi-specific configurations SET hoodie.datasource.write.hive_style_partitioning;
Access Spark Web UI at: http://localhost:18080 (or configured SPARK_UI_PORT)
# View Spark container logs docker logs doris--hudi-spark --tail 100 -f # View Hive Metastore logs docker logs doris--hudi-metastore --tail 100 -f # View MinIO logs docker logs doris--hudi-minio --tail 100 -f
# Access MinIO console # URL: http://localhost:19101 (or configured MINIO_CONSOLE_PORT) # Username: minio (or MINIO_ROOT_USER) # Password: minio123 (or MINIO_ROOT_PASSWORD) # Or use MinIO client docker exec -it doris--hudi-minio-mc mc ls myminio/datalake/warehouse/regression_hudi/
docker logs doris--hudi-sparkdocker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESSdocker ps | grep metastoredocker exec doris--hudi-spark ls -lh /opt/hudi-cache/hudi.env.tpl for correct version numbersdocker ps | grep miniohudi.env.tpldocker exec doris--hudi-minio-mc mc ls myminio/docker logs doris--hudi-metastore | grep "Metastore is ready"docker ps | grep metastore-dbdocker exec doris--hudi-metastore-db pg_isready -U hivehudi/ ├── hudi.yaml.tpl # Docker Compose template ├── hudi.env.tpl # Environment variables template ├── scripts/ │ ├── init.sh # Initialization script │ ├── create_preinstalled_scripts/ │ │ └── hudi/ # SQL scripts (01_config_and_database.sql, 02_create_user_activity_log_tables.sql, ...) │ └── SUCCESS # Initialization marker (generated) └── cache/ # Downloaded JAR files (generated)
.yaml, .env, cache/, SUCCESS) are ignored by Git${VARIABLE_NAME} syntax