docs/pluggable_storage.rst - pinot - Git at Google

 ..
 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at
 ..
 ..   http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.
 ..

 .. warning::  The documentation is not up-to-date and has moved to `Apache Pinot Docs <https://docs.pinot.apache.org/>`_.

 .. _pluggable-storage:

 Pluggable Storage
 =================

 Pinot enables its users to write a PinotFS abstraction layer to store data in a data layer of their choice for realtime and offline segments.

 Some examples of storage backends(other than local storage) currently supported are:

 * `HadoopFS <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html>`_
 * `Azure Data Lake <https://azure.microsoft.com/en-us/solutions/data-lake/>`_

 If the above two filesystems do not meet your needs, you can extend the current `PinotFS <https://github.com/apache/incubator-pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/filesystem/PinotFS.java>`_ to customize for your needs.

 New Storage Type implementation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 In order to add a new type of storage backend (say, Amazon s3) implement the following class:

 S3FS extends `PinotFS <https://github.com/apache/incubator-pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/filesystem/PinotFS.java>`_

 Configurations for Realtime Tables
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 The example here uses the existing org.apache.pinot.filesystem.HadoopPinotFS to store realtime segments in a HDFS filesytem. In the Pinot controller config, add the following new configs:

 .. code-block:: none

     "controller.data.dir": "SET_TO_YOUR_HDFS_ROOT_DIR"
     "controller.local.temp.dir": "SET_TO_A_LOCAL_FILESYSTEM_DIR"
     "pinot.controller.storage.factory.class.hdfs": "org.apache.pinot.filesystem.HadoopPinotFS"
     "pinot.controller.storage.factory.hdfs.hadoop.conf.path": "SET_TO_YOUR_HDFS_CONFIG_DIR"
     "pinot.controller.storage.factory.hdfs.hadoop.kerberos.principle": "SET_IF_YOU_USE_KERBEROS"
     "pinot.controller.storage.factory.hdfs.hadoop.kerberos.keytab": "SET_IF_YOU_USE_KERBEROS"
     "controller.enable.split.commit": "true"

 In the Pinot controller config, add the following new configs:

 .. code-block:: none

     "pinot.server.instance.enable.split.commit": "true"


 Note: currently there is a bug in the controller (`issue <https://github.com/apache/incubator-pinot/issues/3847>`), for now you can cherrypick the PR https://github.com/apache/incubator-pinot/pull/3849 to fix the issue as tested already. The PR is under review now.

 Configurations for Offline Tables
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 These properties for the stream implementation are to be set in your controller and server configurations.

 In your controller and server configs, please set the FS class you would like to support. pinot.controller.storage.factory.class.${YOUR_URI_SCHEME} to the full path of the FS class you would like to include

 You also need to configure pinot.controller.local.temp.dir for the local dir on the controller machine.

 For filesystem specific configs, you can pass in the following with either the pinot.controller prefix or the pinot.server prefix.

 All the following configs need to be prefixed with storage.factory.

 AzurePinotFS requires the following configs according to your environment:

 adl.accountId, adl.authEndpoint, adl.clientId, adl.clientSecret

 Sample Controller Config

 .. code-block:: none

     "pinot.controller.storage.factory.class.adl": "org.apache.pinot.filesystem.AzurePinotFS"
     "pinot.controller.storage.factory.adl.accountId": "xxxx"
     "pinot.controller.storage.factory.adl.authEndpoint": "xxxx"
     "pinot.controller.storage.factory.adl.clientId": "xxxx"
     "pinot.controller.storage.factory.adl.clientId": "xxxx"
     "pinot.controller.segment.fetcher.protocols": "adl"


 Sample Server Config

 .. code-block:: none

     "pinot.server.storage.factory.class.adl": "org.apache.pinot.filesystem.AzurePinotFS"
     "pinot.server.storage.factory.adl.accountId": "xxxx"
     "pinot.server.storage.factory.adl.authEndpoint": "xxxx"
     "pinot.server.storage.factory.adl.clientId": "xxxx"
     "pinot.server.storage.factory.adl.clientId": "xxxx"
     "pinot.server.segment.fetcher.protocols": "adl"


 You can find the parameters in your account as follows:
 https://stackoverflow.com/questions/56349040/what-is-clientid-authtokenendpoint-clientkey-for-accessing-azure-data-lake

 Please also make sure to set the following config with the value "adl"

 .. code-block:: none

   "segment.fetcher.protocols" : "adl"


 To see how to upload segments to different storage systems, check
 :file:`../segment_fetcher.rst`.

 HadoopPinotFS requires the following configs according to your environment:

 hadoop.kerberos.principle, hadoop.kerberos.keytab, hadoop.conf.path

 Please make sure to also set the following config with the value "hdfs"

 .. code-block:: none

   "segment.fetcher.protocols" : "hdfs"
	..
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at
	..
	.. http://www.apache.org/licenses/LICENSE-2.0
	..
	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.
	..

	.. warning:: The documentation is not up-to-date and has moved to `Apache Pinot Docs <https://docs.pinot.apache.org/>`_.

	.. _pluggable-storage:

	Pluggable Storage
	=================

	Pinot enables its users to write a PinotFS abstraction layer to store data in a data layer of their choice for realtime and offline segments.

	Some examples of storage backends(other than local storage) currently supported are:

	* `HadoopFS <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html>`_
	* `Azure Data Lake <https://azure.microsoft.com/en-us/solutions/data-lake/>`_

	If the above two filesystems do not meet your needs, you can extend the current `PinotFS <https://github.com/apache/incubator-pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/filesystem/PinotFS.java>`_ to customize for your needs.

	New Storage Type implementation
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	In order to add a new type of storage backend (say, Amazon s3) implement the following class:

	S3FS extends `PinotFS <https://github.com/apache/incubator-pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/filesystem/PinotFS.java>`_

	Configurations for Realtime Tables
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	The example here uses the existing org.apache.pinot.filesystem.HadoopPinotFS to store realtime segments in a HDFS filesytem. In the Pinot controller config, add the following new configs:

	.. code-block:: none

	"controller.data.dir": "SET_TO_YOUR_HDFS_ROOT_DIR"
	"controller.local.temp.dir": "SET_TO_A_LOCAL_FILESYSTEM_DIR"
	"pinot.controller.storage.factory.class.hdfs": "org.apache.pinot.filesystem.HadoopPinotFS"
	"pinot.controller.storage.factory.hdfs.hadoop.conf.path": "SET_TO_YOUR_HDFS_CONFIG_DIR"
	"pinot.controller.storage.factory.hdfs.hadoop.kerberos.principle": "SET_IF_YOU_USE_KERBEROS"
	"pinot.controller.storage.factory.hdfs.hadoop.kerberos.keytab": "SET_IF_YOU_USE_KERBEROS"
	"controller.enable.split.commit": "true"

	In the Pinot controller config, add the following new configs:

	.. code-block:: none

	"pinot.server.instance.enable.split.commit": "true"


	Note: currently there is a bug in the controller (`issue <https://github.com/apache/incubator-pinot/issues/3847>`), for now you can cherrypick the PR https://github.com/apache/incubator-pinot/pull/3849 to fix the issue as tested already. The PR is under review now.

	Configurations for Offline Tables
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	These properties for the stream implementation are to be set in your controller and server configurations.

	In your controller and server configs, please set the FS class you would like to support. pinot.controller.storage.factory.class.${YOUR_URI_SCHEME} to the full path of the FS class you would like to include

	You also need to configure pinot.controller.local.temp.dir for the local dir on the controller machine.

	For filesystem specific configs, you can pass in the following with either the pinot.controller prefix or the pinot.server prefix.

	All the following configs need to be prefixed with storage.factory.

	AzurePinotFS requires the following configs according to your environment:

	adl.accountId, adl.authEndpoint, adl.clientId, adl.clientSecret

	Sample Controller Config

	.. code-block:: none

	"pinot.controller.storage.factory.class.adl": "org.apache.pinot.filesystem.AzurePinotFS"
	"pinot.controller.storage.factory.adl.accountId": "xxxx"
	"pinot.controller.storage.factory.adl.authEndpoint": "xxxx"
	"pinot.controller.storage.factory.adl.clientId": "xxxx"
	"pinot.controller.storage.factory.adl.clientId": "xxxx"
	"pinot.controller.segment.fetcher.protocols": "adl"


	Sample Server Config

	.. code-block:: none

	"pinot.server.storage.factory.class.adl": "org.apache.pinot.filesystem.AzurePinotFS"
	"pinot.server.storage.factory.adl.accountId": "xxxx"
	"pinot.server.storage.factory.adl.authEndpoint": "xxxx"
	"pinot.server.storage.factory.adl.clientId": "xxxx"
	"pinot.server.storage.factory.adl.clientId": "xxxx"
	"pinot.server.segment.fetcher.protocols": "adl"


	You can find the parameters in your account as follows:
	https://stackoverflow.com/questions/56349040/what-is-clientid-authtokenendpoint-clientkey-for-accessing-azure-data-lake

	Please also make sure to set the following config with the value "adl"

	.. code-block:: none

	"segment.fetcher.protocols" : "adl"


	To see how to upload segments to different storage systems, check
	:file:`../segment_fetcher.rst`.

	HadoopPinotFS requires the following configs according to your environment:

	hadoop.kerberos.principle, hadoop.kerberos.keytab, hadoop.conf.path

	Please make sure to also set the following config with the value "hdfs"

	.. code-block:: none

	"segment.fetcher.protocols" : "hdfs"