Debugging the Druid Image and Integration Tests

The integration setup has as a primary goal the ability to quickly debug the Druid image and any individual tests. A first step is to move the image build into a separate project. A second step is to ensure each test can run in JUnit in an IDE against a cluster you start by hand.

This section discusses how to use the various debugging features.

See:

Docker Configuration for information on debugging docker builds.

General Debug Process

Ease of debugging is a key goal of the revised structure.

Rebuild the Docker image only when the Druid code changes. Do a normal distribution build, then build a new image.
Reuse the same image over and over if you only change tests (such as when adding a new test.)
Reuse the same shared directory when the test does not make permanent changes.
Change Druid configuration by changing the Docker compose files, no need to rebuild the image.
Work primarily in the IDE when debugging tests.
To add more logging, change the log4j2.xml file in the shared directory to increase the logging level.
Remote debug Druid services if needed.

Exploring the Test Cluster

When run in Docker Compose, the endpoints known to Druid nodes differ from those needed by a client sitting outside the cluster. We could provide an explicit mapping. Better is to use the Router to proxy requests. Fortunately, the Druid Console already does this.

Docker Build Output

Modern Docker seems to hide the output of commands, which is a hassle to debug a build. Oddly, the details appear for a failed build, but not for success. Use the followig to see at least some output:

export DOCKER_BUILDKIT=0

Once the base container is built, you can run it, log in and poke around. First identify the name. See the last line of the container build:

Successfully tagged org.apache.druid/test:<version>

Or ask Docker:

docker images

Debug the Docker Image

You can log into the Docker image and poke around to see what's what:

docker run --rm -it --entrypoint bash org.apache.druid/test:<version>

Quite a few environment variables are provided by Docker and the setup scripts to see them, within the container, use:

env

Debug an Integration Test

To debug an integration test, you need a Docker image with the latest Druid. To get that, you need a full Druid build. So, we break the debugging process down into steps that depend on the state of your code. Assume DRUID_DEV points to your Druid development area.

On Each Druid Build

If you need to rebuild Druid (because you fixed something), do:

Do a distribution build of Druid:
Build the test image.

See quickstart for the commands.

Start the Test Cluster

Pick a test “group” to use.
Start a test cluster configured for this test.
Run a test from the command line:

Again, see quickstart for the commands.

Debug the Test

To run from your IDE, find the test to run and run it as a JUnit test (with the cluster up.)

Depending on the test, you may be able to run the test over and over against the same cluster. (In fact, you should try to design your tests so that this is true: clean up after each run.)

The tests are just plain old JUnit tests that happen to reach out to the test cluster and/or Docker to do their work. You can set breakpoints and debug in the usual way.

Each test will first verify that the cluster is fully up before it starts, so you can launch your debug session immediately after starting the cluster: the tests will wait as needed.

Stop the Test Cluster

When done, stop the cluster: quickstart again for details.

Typical Issues

For the most part, you can stop and restart the Druid services as often as you like and Druid will just work. There are a few combinations that can lead to trouble, however.

Services won‘t start: When doing a new build, stop the existing cluster before doing the build. The build removes and rebuilds the shared directory: services can’t survive that.
Metastore failure: The metastore container will recreate the DB on each restart. This will fail if your shared directory already contains a DB. Do a rm -r target/<category>/db before restarting the DB container.
Coordinator fails with DB errors. The Coordinator will create the Druid tables when it starts. This means the DB has to be created. If the DB is removed after the Coordinator starts (to fix the above issue, say) then you have to restart the Coordinator so it can create the needed tables.