docs/28.0.1/development/extensions-core/hdfs.md - druid-website-src - Git at Google

 ---
 id: hdfs
 title: "HDFS"
 ---

 <!--
   ~ Licensed to the Apache Software Foundation (ASF) under one
   ~ or more contributor license agreements.  See the NOTICE file
   ~ distributed with this work for additional information
   ~ regarding copyright ownership.  The ASF licenses this file
   ~ to you under the Apache License, Version 2.0 (the
   ~ "License"); you may not use this file except in compliance
   ~ with the License.  You may obtain a copy of the License at
   ~
   ~   http://www.apache.org/licenses/LICENSE-2.0
   ~
   ~ Unless required by applicable law or agreed to in writing,
   ~ software distributed under the License is distributed on an
   ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   ~ KIND, either express or implied.  See the License for the
   ~ specific language governing permissions and limitations
   ~ under the License.
   -->


 To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-hdfs-storage` in the extensions load list and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.

 ## Deep Storage

 ### Configuration for HDFS

 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
 |`druid.storage.type`|hdfs||Must be set.|
 |`druid.storage.storageDirectory`||Directory for storing segments.|Must be set.|
 |`druid.hadoop.security.kerberos.principal`|`druid@EXAMPLE.COM`| Principal user name |empty|
 |`druid.hadoop.security.kerberos.keytab`|`/etc/security/keytabs/druid.headlessUser.keytab`|Path to keytab file|empty|

 Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`)
 in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`.

 If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work.
 If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.

 ### Configuration for Cloud Storage

 You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.

 #### Configuration for AWS S3

 To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.

 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
 |`druid.storage.type`|hdfs| |Must be set.|
 |`druid.storage.storageDirectory`|s3a://bucket/example/directory or s3n://bucket/example/directory|Path to the deep storage|Must be set.|

 You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/), especially the `hadoop-aws.jar` in the Druid classpath.
 Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.

 ```bash
 ${DRUID_HOME}/bin/run-java -classpath "${DRUID_HOME}/lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
 cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
 ```

 Finally, you need to add the below properties in the `core-site.xml`.
 For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/).

 ```xml
 <property>
   <name>fs.s3a.impl</name>
   <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
   <description>The implementation class of the S3A Filesystem</description>
 </property>

 <property>
   <name>fs.AbstractFileSystem.s3a.impl</name>
   <value>org.apache.hadoop.fs.s3a.S3A</value>
   <description>The implementation class of the S3A AbstractFileSystem.</description>
 </property>

 <property>
   <name>fs.s3a.access.key</name>
   <description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description>
   <value>your access key</value>
 </property>

 <property>
   <name>fs.s3a.secret.key</name>
   <description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description>
   <value>your secret key</value>
 </property>
 ```

 #### Configuration for Google Cloud Storage

 To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.

 |Property|Possible Values|Description|Default|
 |--------|---------------|-----------|-------|
 |`druid.storage.type`|hdfs||Must be set.|
 |`druid.storage.storageDirectory`|gs://bucket/example/directory|Path to the deep storage|Must be set.|

 All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters) in their class path.
 Please read the [install instructions](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
 to properly set up the necessary libraries and configurations.
 One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`.

 Finally, you need to configure the `core-site.xml` file with the filesystem
 and authentication properties needed for GCS. You may want to copy the below
 example properties. Please follow the instructions at
 [https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
 for more details.
 For more configurations, [GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml)
 and [GCS core template](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/src/test/resources/core-site.xml).

 ```xml
 <property>
   <name>fs.gs.impl</name>
   <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
   <description>The FileSystem for gs: (GCS) uris.</description>
 </property>

 <property>
   <name>fs.AbstractFileSystem.gs.impl</name>
   <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
   <description>The AbstractFileSystem for gs: uris.</description>
 </property>

 <property>
   <name>google.cloud.auth.service.account.enable</name>
   <value>true</value>
   <description>
     Whether to use a service account for GCS authorization.
     Setting this property to `false` will disable use of service accounts for
     authentication.
   </description>
 </property>

 <property>
   <name>google.cloud.auth.service.account.json.keyfile</name>
   <value>/path/to/keyfile</value>
   <description>
     The JSON key file of the service account used for GCS
     access when google.cloud.auth.service.account.enable is true.
   </description>
 </property>
 ```

 ## Reading data from HDFS or Cloud Storage

 ### Native batch ingestion

 The [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md)
 to read files directly from the HDFS Storage. You may be able to read objects from cloud storage
 with the HDFS input source, but we highly recommend to use a proper
 [Input Source](../../ingestion/input-sources.md) instead if possible because
 it is simple to set up. For now, only the [S3 input source](../../ingestion/input-sources.md#s3-input-source)
 and the [Google Cloud Storage input source](../../ingestion/input-sources.md#google-cloud-storage-input-source)
 are supported for cloud storage types, and so you may still want to use the HDFS input source
 to read from cloud storage other than those two.

 ### Hadoop-based ingestion

 If you use the [Hadoop ingestion](../../ingestion/hadoop.md), you can read data from HDFS
 by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
 See the [Static](../../ingestion/hadoop.md#static) inputSpec for details.
	---
	id: hdfs
	title: "HDFS"
	---

	<!--
	~ Licensed to the Apache Software Foundation (ASF) under one
	~ or more contributor license agreements. See the NOTICE file
	~ distributed with this work for additional information
	~ regarding copyright ownership. The ASF licenses this file
	~ to you under the Apache License, Version 2.0 (the
	~ "License"); you may not use this file except in compliance
	~ with the License. You may obtain a copy of the License at
	~
	~ http://www.apache.org/licenses/LICENSE-2.0
	~
	~ Unless required by applicable law or agreed to in writing,
	~ software distributed under the License is distributed on an
	~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	~ KIND, either express or implied. See the License for the
	~ specific language governing permissions and limitations
	~ under the License.
	-->


	To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-hdfs-storage` in the extensions load list and run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.

	## Deep Storage

	### Configuration for HDFS

	\|Property\|Possible Values\|Description\|Default\|
	\|--------\|---------------\|-----------\|-------\|
	\|`druid.storage.type`\|hdfs\|\|Must be set.\|
	\|`druid.storage.storageDirectory`\|\|Directory for storing segments.\|Must be set.\|
	\|`druid.hadoop.security.kerberos.principal`\|`druid@EXAMPLE.COM`\| Principal user name \|empty\|
	\|`druid.hadoop.security.kerberos.keytab`\|`/etc/security/keytabs/druid.headlessUser.keytab`\|Path to keytab file\|empty\|

	Besides the above settings, you also need to include all Hadoop configuration files (such as `core-site.xml`, `hdfs-site.xml`)
	in the Druid classpath. One way to do this is copying all those files under `${DRUID_HOME}/conf/_common`.

	If you are using the Hadoop ingestion, set your output directory to be a location on Hadoop and it will work.
	If you want to eagerly authenticate against a secured hadoop/hdfs cluster you must set `druid.hadoop.security.kerberos.principal` and `druid.hadoop.security.kerberos.keytab`, this is an alternative to the cron job method that runs `kinit` command periodically.

	### Configuration for Cloud Storage

	You can also use the AWS S3 or the Google Cloud Storage as the deep storage via HDFS.

	#### Configuration for AWS S3

	To use the AWS S3 as the deep storage, you need to configure `druid.storage.storageDirectory` properly.

	\|Property\|Possible Values\|Description\|Default\|
	\|--------\|---------------\|-----------\|-------\|
	\|`druid.storage.type`\|hdfs\| \|Must be set.\|
	\|`druid.storage.storageDirectory`\|s3a://bucket/example/directory or s3n://bucket/example/directory\|Path to the deep storage\|Must be set.\|

	You also need to include the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/), especially the `hadoop-aws.jar` in the Druid classpath.
	Run the below command to install the `hadoop-aws.jar` file under `${DRUID_HOME}/extensions/druid-hdfs-storage` in all nodes.

	```bash
	${DRUID_HOME}/bin/run-java -classpath "${DRUID_HOME}/lib/*" org.apache.druid.cli.Main tools pull-deps -h "org.apache.hadoop:hadoop-aws:${HADOOP_VERSION}";
	cp ${DRUID_HOME}/hadoop-dependencies/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar ${DRUID_HOME}/extensions/druid-hdfs-storage/
	```

	Finally, you need to add the below properties in the `core-site.xml`.
	For more configurations, see the [Hadoop AWS module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/).

	```xml
	<property>
	<name>fs.s3a.impl</name>
	<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
	<description>The implementation class of the S3A Filesystem</description>
	</property>

	<property>
	<name>fs.AbstractFileSystem.s3a.impl</name>
	<value>org.apache.hadoop.fs.s3a.S3A</value>
	<description>The implementation class of the S3A AbstractFileSystem.</description>
	</property>

	<property>
	<name>fs.s3a.access.key</name>
	<description>AWS access key ID. Omit for IAM role-based or provider-based authentication.</description>
	<value>your access key</value>
	</property>

	<property>
	<name>fs.s3a.secret.key</name>
	<description>AWS secret key. Omit for IAM role-based or provider-based authentication.</description>
	<value>your secret key</value>
	</property>
	```

	#### Configuration for Google Cloud Storage

	To use the Google Cloud Storage as the deep storage, you need to configure `druid.storage.storageDirectory` properly.

	\|Property\|Possible Values\|Description\|Default\|
	\|--------\|---------------\|-----------\|-------\|
	\|`druid.storage.type`\|hdfs\|\|Must be set.\|
	\|`druid.storage.storageDirectory`\|gs://bucket/example/directory\|Path to the deep storage\|Must be set.\|

	All services that need to access GCS need to have the [GCS connector jar](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters) in their class path.
	Please read the [install instructions](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
	to properly set up the necessary libraries and configurations.
	One option is to place this jar in `${DRUID_HOME}/lib/` and `${DRUID_HOME}/extensions/druid-hdfs-storage/`.

	Finally, you need to configure the `core-site.xml` file with the filesystem
	and authentication properties needed for GCS. You may want to copy the below
	example properties. Please follow the instructions at
	[https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md)
	for more details.
	For more configurations, [GCS core default](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.0.0/gcs/conf/gcs-core-default.xml)
	and [GCS core template](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/src/test/resources/core-site.xml).

	```xml
	<property>
	<name>fs.gs.impl</name>
	<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
	<description>The FileSystem for gs: (GCS) uris.</description>
	</property>

	<property>
	<name>fs.AbstractFileSystem.gs.impl</name>
	<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
	<description>The AbstractFileSystem for gs: uris.</description>
	</property>

	<property>
	<name>google.cloud.auth.service.account.enable</name>
	<value>true</value>
	<description>
	Whether to use a service account for GCS authorization.
	Setting this property to `false` will disable use of service accounts for
	authentication.
	</description>
	</property>

	<property>
	<name>google.cloud.auth.service.account.json.keyfile</name>
	<value>/path/to/keyfile</value>
	<description>
	The JSON key file of the service account used for GCS
	access when google.cloud.auth.service.account.enable is true.
	</description>
	</property>
	```

	## Reading data from HDFS or Cloud Storage

	### Native batch ingestion

	The [HDFS input source](../../ingestion/input-sources.md#hdfs-input-source) is supported by the [Parallel task](../../ingestion/native-batch.md)
	to read files directly from the HDFS Storage. You may be able to read objects from cloud storage
	with the HDFS input source, but we highly recommend to use a proper
	[Input Source](../../ingestion/input-sources.md) instead if possible because
	it is simple to set up. For now, only the [S3 input source](../../ingestion/input-sources.md#s3-input-source)
	and the [Google Cloud Storage input source](../../ingestion/input-sources.md#google-cloud-storage-input-source)
	are supported for cloud storage types, and so you may still want to use the HDFS input source
	to read from cloud storage other than those two.

	### Hadoop-based ingestion

	If you use the [Hadoop ingestion](../../ingestion/hadoop.md), you can read data from HDFS
	by specifying the paths in your [`inputSpec`](../../ingestion/hadoop.md#inputspec).
	See the [Static](../../ingestion/hadoop.md#static) inputSpec for details.