Hive2/Hive3 Docker Compose templates and bootstrap scripts used by Doris thirdparty regression tests.
中文版: README_ZH.md
Hive startup is structured in three independent layers:
All services run with network_mode: host, so ports are bound directly on the host.
| Service | Role | Hive3 Port | Hive2 Port |
|---|---|---|---|
hive-server | HiveServer2 (SQL/JDBC entry) | 13000 | 10000 |
hive-metastore | Hive Metastore (HMS) | 9383 | 9083 |
hive-metastore-postgresql | Metastore backend DB | 5732 | 5432 |
namenode | HDFS NameNode | 8320 | 8020 |
datanode | HDFS DataNode | — | — |
Container names are prefixed by CONTAINER_UID (set in custom_settings.env). Example: CONTAINER_UID=doris-jack- → container name doris-jack-hive3-server.
--hive-modules)Each module maps to a directory under scripts/data/ or a dedicated script set. Modules are refreshed incrementally: only modules whose content SHA changed are re-executed.
| Module | Source path | Content |
|---|---|---|
default | scripts/data/default/ | Basic external tables in the default database |
multi_catalog | scripts/data/multi_catalog/ | Multi-format, multi-path external table cases |
partition_type | scripts/data/partition_type/ | Partition type coverage (int, string, date, …) |
statistics | scripts/data/statistics/ | Table stats and empty-table stats cases |
tvf | scripts/data/tvf/ | TVF test data (HDFS upload) |
regression | scripts/data/regression/ | Special regression datasets (serde, delimiters, …) |
test | scripts/data/test/ | Lightweight smoke-test datasets |
preinstalled_hql | scripts/create_preinstalled_scripts/*.hql | ~77 HQL files, executed in parallel via xargs -P |
view | scripts/create_view_scripts/create_view.hql | View definitions |
The startup scripts automatically choose the right file set for each Hive version:
bootstrap/hive2_only.*.listbootstrap/hive3_only.*.listThis selection is an internal implementation detail; developers normally do not need to configure it manually.
Hive state (HDFS data, Postgres metastore, and the module SHA tracker) lives in four Docker named volumes per version, not host bind mounts. The shared volume prefix is fixed to doris-shared.
| Volume | Mounted into |
|---|---|
doris-shared-<hive_version>-namenode | NameNode metadata |
doris-shared-<hive_version>-datanode | DataNode blocks |
doris-shared-<hive_version>-pgdata | Hive Metastore Postgres data |
doris-shared-<hive_version>-state | /mnt/state — per-module SHA files used for incremental refresh |
Lifecycle:
--hive-mode fast: volumes are preserved across runs.--hive-mode refresh: volumes are reset, then restored from the published baseline tarball before module refresh.--hive-mode rebuild: volumes are removed (docker volume rm) and recreated empty.The script primes volumes from a pre-built baseline tarball in two cases:
--hive-mode refresh: always reset the volumes, then restore the published baseline before reconciling changed modules.--hive-mode fast: restore the baseline only when the volumes are empty (fresh CI host, or after manual cleanup).Baseline restore flow:
${HIVE_BASELINE_TARBALL_CACHE:-docker/thirdparties/docker-compose/hive/scripts/baseline}/<hive_version>-baseline-<version>.tar.gz.https://${s3BucketName}.${s3Endpoint}/regression/datalake/pipeline_data/hive_baseline/<hive_version>-baseline-<version>.tar.gz.${HIVE_BASELINE_TARBALL_CACHE:-docker/thirdparties/docker-compose/hive/scripts/baseline}/<hive_version>-baseline-<version>/; if missing, extract the tarball there once.alpine tar container.HIVE_BASELINE_VERSION changes both the cache filename and the auto-constructed OSS URL, so CI hosts fetch the newly published tarball instead of reusing an older cached artifact.Relevant env vars:
| Variable | Default | Purpose |
|---|---|---|
HIVE_BASELINE_TARBALL_CACHE | docker/thirdparties/docker-compose/hive/scripts/baseline in custom_settings.env | Local cache dir for downloaded tarballs and extracted baseline directories; cache names include HIVE_BASELINE_VERSION |
HIVE_BASELINE_VERSION | 20260415 in custom_settings.env | Baseline publication key: embedded in the cache filename and the auto-constructed OSS tarball URL |
After bootstrapping a clean Hive stack, stop the containers and run:
sudo docker compose -p "${CONTAINER_UID}hive3" \ -f docker/thirdparties/docker-compose/hive/hive-3x.yaml down bash docker/thirdparties/docker-compose/hive/scripts/snapshot-hive-baseline.sh \ "${CONTAINER_UID}hive3" /tmp/hive3-baseline.tar.gz
Then upload the resulting tarball to OSS at oss://<s3BucketName>/regression/datalake/pipeline_data/hive_baseline/hive3-baseline-<version>.tar.gz (same convention for hive2). To publish a new baseline, update HIVE_BASELINE_VERSION once in docker/thirdparties/custom_settings.env, produce the new tarballs, and upload them with the matching versioned filenames.
# Start Hive3 (default: refresh mode) ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 # Start Hive2 ./docker/thirdparties/run-thirdparties-docker.sh -c hive2 # Start both ./docker/thirdparties/run-thirdparties-docker.sh -c hive2,hive3 # Stop Hive3 ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --stop
--hive-mode)| Mode | Behavior | When to use |
|---|---|---|
fast | Reuse existing volumes, skip compose up if the stack is already healthy, and skip data refresh entirely | Machine reboot / Docker restart recovery when you want the previous Hive environment back as quickly as possible |
refresh | Reset volumes to the published baseline, then re-run only modules/HQL files whose SHA changed (default) | Daily development and PR verification when case scripts or HQL changed and you want a clean baseline before reconciling your changes |
rebuild | Tear down stack, wipe all volumes, and rebuild everything from scratch without baseline restore | Full local bootstrap from current scripts, typically before exporting or validating a new baseline tarball |
# Fast: reuse the existing volumes and restore the previous docker environment quickly ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode fast # Refresh: reset to baseline and reconcile changed HQL/scripts (default) ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh # Rebuild: clean slate from local scripts, typically before exporting a new baseline ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode rebuild
--hive-modules)Refresh only the modules you care about:
# Re-run only changed preinstalled HQL files (parallel execution) ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 \ --hive-mode refresh --hive-modules preinstalled_hql # Refresh two specific modules ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 \ --hive-mode refresh --hive-modules default,multi_catalog # All modules (explicit) ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 \ --hive-mode refresh --hive-modules all
Each refresh ends with a summary line showing what was actually re-executed, for example:
[hive-refresh] summary refreshed_modules=2 modules=multi_catalog,preinstalled_hql [hive-refresh] summary details=multi_catalog:run_sh=74;preinstalled_hql:files=3(create_preinstalled_scripts/run40.hql,create_preinstalled_scripts/run69.hql,create_preinstalled_scripts/run76.hql)
fast when the machine or Docker daemon restarted and you only need the previous Hive containers and data back without any refresh work.refresh for normal development. This is the safe default when you changed Hive case data, run.sh, or HQL files and want those changes applied on top of a clean published baseline.rebuild when you intentionally want to ignore the published baseline and bootstrap everything from the current repository state, usually before generating a new baseline tarball../docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hqlscripts/data/multi_catalog: ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules multi_catalog./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode fast./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode rebuildThere are two patterns depending on where the data should live.
run.sh (HDFS data + DDL)Use this when the test data files need to be uploaded to HDFS.
Create a directory under the appropriate module:
scripts/data/<module>/<your_dataset>/ ├── run.sh # required: executed during module refresh └── <data files> # csv, parquet, orc, etc.
Write run.sh to be idempotent (safe to run multiple times):
#!/bin/bash set -x CUR_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" # Upload data only if not already present hadoop fs -mkdir -p /user/doris/preinstalled_data/your_dataset if [[ -z "$(hadoop fs -ls /user/doris/preinstalled_data/your_dataset 2>/dev/null)" ]]; then hadoop fs -put "${CUR_DIR}"/data/* /user/doris/preinstalled_data/your_dataset/ fi # Create table (drop-then-create for idempotency) hive -e " DROP TABLE IF EXISTS your_table; CREATE EXTERNAL TABLE your_table (...) STORED AS PARQUET LOCATION '/user/doris/preinstalled_data/your_dataset'; "
If the dataset is Hive2-only or Hive3-only, add the run.sh path to the corresponding list:
bootstrap/hive2_only.run_sh.list bootstrap/hive3_only.run_sh.list
create_preinstalled_scripts/ (HQL only)Use this when no HDFS file upload is needed (external table pointing to pre-existing HDFS data, or managed table with inline INSERT values).
Add a new file scripts/create_preinstalled_scripts/runNN.hql:
use default; DROP TABLE IF EXISTS `your_new_table`; CREATE EXTERNAL TABLE `your_new_table` ( id INT, name STRING ) STORED AS PARQUET LOCATION '/user/doris/preinstalled_data/existing_path';
Rules:
DROP TABLE IF EXISTS before CREATE — never CREATE IF NOT EXISTS alonerunNN numberbootstrap/hive2_only.preinstalled_hql.list or bootstrap/hive3_only.preinstalled_hql.listbootstrap/tpch.preinstalled_hql.listTrigger a refresh to pick it up:
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 \ --hive-mode refresh --hive-modules preinstalled_hql
All containers use network_mode: host, so ports are directly accessible on the host.
# Enter the hive-server container docker exec -it ${CONTAINER_UID}hive3-server bash # Connect via beeline (the hive shim on PATH routes here automatically) beeline -u "jdbc:hive2://localhost:13000/default" -n root # Or use the hive shim shorthand hive -e "show databases;" hive -e "show tables in default;" hive -f /path/to/your.hql
# Requires beeline on PATH locally; use the host IP beeline -u "jdbc:hive2://127.0.0.1:13000/default" -n root
# Execute a single query docker exec ${CONTAINER_UID}hive3-server \ beeline -u "jdbc:hive2://localhost:13000/default" -n root \ -e "SELECT * FROM default.your_table LIMIT 10;" # Run a HQL file (file must exist inside the container or on a mounted path) docker exec ${CONTAINER_UID}hive3-server \ hive -f /mnt/scripts/create_preinstalled_scripts/run02.hql
# List top-level HDFS directories docker exec ${CONTAINER_UID}hadoop3-namenode \ hadoop fs -ls /user/doris/ # Check if a specific path exists docker exec ${CONTAINER_UID}hadoop3-namenode \ hadoop fs -ls /user/doris/preinstalled_data/your_dataset/
# Connect to the metastore DB directly (port 5732 for Hive3) psql -h 127.0.0.1 -p 5732 -U postgres -d metastore \ -c "SELECT TBL_NAME, DB_ID FROM TBLS LIMIT 20;"
| Log file | Content |
|---|---|
docker/thirdparties/logs/start_hive3.log | Full Hive3 startup output |
docker/thirdparties/logs/start_hive2.log | Full Hive2 startup output |
Enable verbose xtrace for detailed script execution:
HIVE_DEBUG=1 ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
Startup timing is printed at the end of each phase:
[14:02:31] [hive3] compose up done took=18s [14:02:49] [hive3] init-hive-baseline begin [14:03:11] [hive3] init-hive-baseline done took=22s [14:03:11] [hive3] refresh-hive-modules begin (mode=refresh modules=all) [14:05:44] [hive3] refresh-hive-modules done took=153s
Metastore health check fails
${CONTAINER_UID}hive3-metastore-postgresql is healthy: docker pstail -100 docker/thirdparties/logs/start_hive3.logHiveServer2 not reachable
docker ps | grep hive3-servernc -z 127.0.0.1 13000docker exec ${CONTAINER_UID}hive3-server tail -50 /tmp/hive-server2.logJuiceFS format/init fails
JFS_CLUSTER_META is reachable (default: mysql://root:123456@(127.0.0.1:3316)/juicefs_meta)export JFS_CLUSTER_META=<your_uri>Refresh is unexpectedly slow
--hive-modules preinstalled_hqlState is stale after a hard container kill
--hive-mode rebuild to reset cleanlyBaseline download is slow or fails
https://${s3BucketName}.${s3Endpoint}/regression/datalake/pipeline_data/hive_baseline/${HIVE_BASELINE_TARBALL_CACHE:-docker/thirdparties/docker-compose/hive/scripts/baseline}/<hive_version>-baseline-<version>.tar.gz to skip the downloads3BucketName and s3Endpoint are set correctly in docker/thirdparties/custom_settings.envInspect or delete volumes manually
# List the four volumes for a version docker volume ls | grep "${CONTAINER_UID}hive3-" # Remove all four (equivalent to --hive-mode rebuild's reset step) for s in namenode datanode pgdata state; do docker volume rm -f "${CONTAINER_UID}hive3-${s}" done