hadoop-project/src/site/markdown/index.md.vm - hadoop - Git at Google

 <!---
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. See accompanying LICENSE file.
 -->

 Apache Hadoop ${project.version}
 ================================

 Apache Hadoop ${project.version} is an update to the Hadoop 3.3.x release branch.

 Overview of Changes
 ===================

 Users are encouraged to read the full set of release notes.
 This page provides an overview of the major changes.

 Azure ABFS: Critical Stream Prefetch Fix
 ---------------------------------------------

 The abfs has a critical bug fix
 [HADOOP-18546](https://issues.apache.org/jira/browse/HADOOP-18546).
 *ABFS. Disable purging list of in-progress reads in abfs stream close().*

 All users of the abfs connector in hadoop releases 3.3.2+ MUST either upgrade
 or disable prefetching by setting `fs.azure.readaheadqueue.depth` to `0`

 Consult the parent JIRA [HADOOP-18521](https://issues.apache.org/jira/browse/HADOOP-18521)
 *ABFS ReadBufferManager buffer sharing across concurrent HTTP requests*
 for root cause analysis, details on what is affected, and mitigations.


 Vectored IO API
 ---------------

 [HADOOP-18103](https://issues.apache.org/jira/browse/HADOOP-18103).
 *High performance vectored read API in Hadoop*

 The `PositionedReadable` interface has now added an operation for
 Vectored IO (also known as Scatter/Gather IO):

 ```java
 void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate)
 ```

 All the requested ranges will be retrieved into the supplied byte buffers -possibly asynchronously,
 possibly in parallel, with results potentially coming in out-of-order.

 1. The default implementation uses a series of `readFully()` calls, so delivers
    equivalent performance.
 2. The local filesystem uses java native IO calls for higher performance reads than `readFully()`.
 3. The S3A filesystem issues parallel HTTP GET requests in different threads.

 Benchmarking of enhanced Apache ORC and Apache Parquet clients through `file://` and `s3a://`
 show significant improvements in query performance.

 Further Reading: [FsDataInputStream](./hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html).

 Mapreduce: Manifest Committer for Azure ABFS and google GCS
 ----------------------------------------------------------

 The new _Intermediate Manifest Committer_ uses a manifest file
 to commit the work of successful task attempts, rather than
 renaming directories.
 Job commit is matter of reading all the manifests, creating the
 destination directories (parallelized) and renaming the files,
 again in parallel.

 This is both fast and correct on Azure Storage and Google GCS,
 and should be used there instead of the classic v1/v2 file
 output committers.

 It is also safe to use on HDFS, where it should be faster
 than the v1 committer. It is however optimized for
 cloud storage where list and rename operations are significantly
 slower; the benefits may be less.

 More details are available in the
 [manifest committer](./hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html).
 documentation.


 HDFS: Router Based Federation
 -----------------------------

 A lot of effort has been invested into stabilizing/improving the HDFS Router Based Federation feature.

 1. HDFS-13522, HDFS-16767 & Related Jiras: Allow Observer Reads in HDFS Router Based Federation.
 2. HDFS-13248: RBF supports Client Locality

 HDFS: Dynamic Datanode Reconfiguration
 --------------------------------------

 HDFS-16400, HDFS-16399, HDFS-16396, HDFS-16397, HDFS-16413, HDFS-16457.

 A number of Datanode configuration options can be changed without having to restart
 the datanode. This makes it possible to tune deployment configurations without
 cluster-wide Datanode Restarts.

 See [DataNode.java](https://github.com/apache/hadoop/blob/branch-3.3.5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L346-L361)
 for the list of dynamically reconfigurable attributes.


 Transitive CVE fixes
 --------------------

 A lot of dependencies have been upgraded to address recent CVEs.
 Many of the CVEs were not actually exploitable through the Hadoop
 so much of this work is just due diligence.
 However applications which have all the library is on a class path may
 be vulnerable, and the ugprades should also reduce the number of false
 positives security scanners report.

 We have not been able to upgrade every single dependency to the latest
 version there is. Some of those changes are just going to be incompatible.
 If you have concerns about the state of a specific library, consult the pache JIRA
 issue tracker to see whether a JIRA has been filed, discussions have taken place about
 the library in question, and whether or not there is already a fix in the pipeline.
 *Please don't file new JIRAs about dependency-X.Y.Z having a CVE without
 searching for any existing issue first*

 As an open source project, contributions in this area are always welcome,
 especially in testing the active branches, testing applications downstream of
 those branches and of whether updated dependencies trigger regressions.

 Getting Started
 ===============

 The Hadoop documentation includes the information you need to get started using
 Hadoop. Begin with the
 [Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
 which shows you how to set up a single-node Hadoop installation.
 Then move on to the
 [Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
 to learn how to set up a multi-node Hadoop installation.
	<!---
	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License. See accompanying LICENSE file.
	-->

	Apache Hadoop ${project.version}
	================================

	Apache Hadoop ${project.version} is an update to the Hadoop 3.3.x release branch.

	Overview of Changes
	===================

	Users are encouraged to read the full set of release notes.
	This page provides an overview of the major changes.

	Azure ABFS: Critical Stream Prefetch Fix
	---------------------------------------------

	The abfs has a critical bug fix
	[HADOOP-18546](https://issues.apache.org/jira/browse/HADOOP-18546).
	ABFS. Disable purging list of in-progress reads in abfs stream close().

	All users of the abfs connector in hadoop releases 3.3.2+ MUST either upgrade
	or disable prefetching by setting `fs.azure.readaheadqueue.depth` to `0`

	Consult the parent JIRA [HADOOP-18521](https://issues.apache.org/jira/browse/HADOOP-18521)
	ABFS ReadBufferManager buffer sharing across concurrent HTTP requests
	for root cause analysis, details on what is affected, and mitigations.


	Vectored IO API
	---------------

	[HADOOP-18103](https://issues.apache.org/jira/browse/HADOOP-18103).
	High performance vectored read API in Hadoop

	The `PositionedReadable` interface has now added an operation for
	Vectored IO (also known as Scatter/Gather IO):

	```java
	void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate)
	```

	All the requested ranges will be retrieved into the supplied byte buffers -possibly asynchronously,
	possibly in parallel, with results potentially coming in out-of-order.

	1. The default implementation uses a series of `readFully()` calls, so delivers
	equivalent performance.
	2. The local filesystem uses java native IO calls for higher performance reads than `readFully()`.
	3. The S3A filesystem issues parallel HTTP GET requests in different threads.

	Benchmarking of enhanced Apache ORC and Apache Parquet clients through `file://` and `s3a://`
	show significant improvements in query performance.

	Further Reading: [FsDataInputStream](./hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html).

	Mapreduce: Manifest Committer for Azure ABFS and google GCS
	----------------------------------------------------------

	The new _Intermediate Manifest Committer_ uses a manifest file
	to commit the work of successful task attempts, rather than
	renaming directories.
	Job commit is matter of reading all the manifests, creating the
	destination directories (parallelized) and renaming the files,
	again in parallel.

	This is both fast and correct on Azure Storage and Google GCS,
	and should be used there instead of the classic v1/v2 file
	output committers.

	It is also safe to use on HDFS, where it should be faster
	than the v1 committer. It is however optimized for
	cloud storage where list and rename operations are significantly
	slower; the benefits may be less.

	More details are available in the
	[manifest committer](./hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html).
	documentation.


	HDFS: Router Based Federation
	-----------------------------

	A lot of effort has been invested into stabilizing/improving the HDFS Router Based Federation feature.

	1. HDFS-13522, HDFS-16767 & Related Jiras: Allow Observer Reads in HDFS Router Based Federation.
	2. HDFS-13248: RBF supports Client Locality

	HDFS: Dynamic Datanode Reconfiguration
	--------------------------------------

	HDFS-16400, HDFS-16399, HDFS-16396, HDFS-16397, HDFS-16413, HDFS-16457.

	A number of Datanode configuration options can be changed without having to restart
	the datanode. This makes it possible to tune deployment configurations without
	cluster-wide Datanode Restarts.

	See [DataNode.java](https://github.com/apache/hadoop/blob/branch-3.3.5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L346-L361)
	for the list of dynamically reconfigurable attributes.


	Transitive CVE fixes
	--------------------

	A lot of dependencies have been upgraded to address recent CVEs.
	Many of the CVEs were not actually exploitable through the Hadoop
	so much of this work is just due diligence.
	However applications which have all the library is on a class path may
	be vulnerable, and the ugprades should also reduce the number of false
	positives security scanners report.

	We have not been able to upgrade every single dependency to the latest
	version there is. Some of those changes are just going to be incompatible.
	If you have concerns about the state of a specific library, consult the pache JIRA
	issue tracker to see whether a JIRA has been filed, discussions have taken place about
	the library in question, and whether or not there is already a fix in the pipeline.
	*Please don't file new JIRAs about dependency-X.Y.Z having a CVE without
	searching for any existing issue first*

	As an open source project, contributions in this area are always welcome,
	especially in testing the active branches, testing applications downstream of
	those branches and of whether updated dependencies trigger regressions.

	Getting Started
	===============

	The Hadoop documentation includes the information you need to get started using
	Hadoop. Begin with the
	[Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
	which shows you how to set up a single-node Hadoop installation.
	Then move on to the
	[Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
	to learn how to set up a multi-node Hadoop installation.