hadoop-project/src/site/markdown/index.md.vm - hadoop - Git at Google

 <!---
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. See accompanying LICENSE file.
 -->

 Apache Hadoop ${project.version}
 ================================

 Apache Hadoop ${project.version} incorporates a number of significant
 enhancements over the previous major release line (hadoop-2.x).

 This is an alpha release to facilitate testing and the collection of
 feedback from downstream application developers and users. There are
 no guarantees regarding API stability or quality.

 Overview
 ========

 Users are encouraged to read the full set of release notes.
 This page provides an overview of the major changes.

 Minimum required Java version increased from Java 7 to Java 8
 ------------------

 All Hadoop JARs are now compiled targeting a runtime version of Java 8.
 Users still using Java 7 or below must upgrade to Java 8.

 Support for erasure encoding in HDFS
 ------------------

 Erasure coding is a method for durably storing data with significant space
 savings compared to replication. Standard encodings like Reed-Solomon (10,4)
 have a 1.4x space overhead, compared to the 3x overhead of standard HDFS
 replication.

 Since erasure coding imposes additional overhead during reconstruction
 and performs mostly remote reads, it has traditionally been used for
 storing colder, less frequently accessed data. Users should consider
 the network and CPU overheads of erasure coding when deploying this
 feature.

 More details are available in the
 [HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html)
 documentation.

 YARN Timeline Service v.2
 -------------------

 We are introducing an early preview (alpha 1) of a major revision of YARN
 Timeline Service: v.2. YARN Timeline Service v.2 addresses two major
 challenges: improving scalability and reliability of Timeline Service, and
 enhancing usability by introducing flows and aggregation.

 YARN Timeline Service v.2 alpha 1 is provided so that users and developers
 can test it and provide feedback and suggestions for making it a ready
 replacement for Timeline Service v.1.x. It should be used only in a test
 capacity. Most importantly, security is not enabled. Do not set up or use
 Timeline Service v.2 until security is implemented if security is a
 critical requirement.

 More details are available in the
 [YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
 documentation.

 Shell script rewrite
 -------------------

 The Hadoop shell scripts have been rewritten to fix many long-standing
 bugs and include some new features.  While an eye has been kept towards
 compatibility, some changes may break existing installations.

 Incompatible changes are documented in the release notes, with related
 discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902).

 More details are available in the
 [Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html)
 documentation. Power users will also be pleased by the
 [Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html)
 documentation, which describes much of the new functionality, particularly
 related to extensibility.

 MapReduce task-level native optimization
 --------------------

 MapReduce has added support for a native implementation of the map output
 collector. For shuffle-intensive jobs, this can lead to a performance
 improvement of 30% or more.

 See the release notes for
 [MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841)
 for more detail.

 Support for more than 2 NameNodes.
 --------------------

 The initial implementation of HDFS NameNode high-availability provided
 for a single active NameNode and a single Standby NameNode. By replicating
 edits to a quorum of three JournalNodes, this architecture is able to
 tolerate the failure of any one node in the system.

 However, some deployments require higher degrees of fault-tolerance.
 This is enabled by this new feature, which allows users to run multiple
 standby NameNodes. For instance, by configuring three NameNodes and
 five JournalNodes, the cluster is able to tolerate the failure of two
 nodes rather than just one.

 The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)
 has been updated with instructions on how to configure more than two
 NameNodes.

 Default ports of multiple services have been changed.
 ------------------------

 Previously, the default ports of multiple Hadoop services were in the
 Linux ephemeral port range (32768-61000). This meant that at startup,
 services would sometimes fail to bind to the port due to a conflict
 with another application.

 These conflicting ports have been moved out of the ephemeral range,
 affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our
 documentation has been updated appropriately, but see the release
 notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and
 [HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811)
 for a list of port changes.

 Support for Microsoft Azure Data Lake filesystem connector
 ---------------------

 Hadoop now supports integration with Microsoft Azure Data Lake as
 an alternative Hadoop-compatible filesystem.

 Intra-datanode balancer
 -------------------

 A single DataNode manages multiple disks. During normal write operation,
 disks will be filled up evenly. However, adding or replacing disks can
 lead to significant skew within a DataNode. This situation is not handled
 by the existing HDFS balancer, which concerns itself with inter-, not intra-,
 DN skew.

 This situation is handled by the new intra-DataNode balancing
 functionality, which is invoked via the `hdfs diskbalancer` CLI.
 See the disk balancer section in the
 [HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
 for more information.

 Reworked daemon and task heap management
 ---------------------

 A series of changes have been made to heap management for Hadoop daemons
 as well as MapReduce tasks.

 [HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces
 new methods for configuring daemon heap sizes.
 Notably, auto-tuning is now possible based on the memory size of the host,
 and the `HADOOP_HEAPSIZE` variable has been deprecated.
 See the full release notes of HADOOP-10950 for more detail.

 [MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785)
 simplifies the configuration of map and reduce task
 heap sizes, so the desired heap size no longer needs to be specified
 in both the task configuration and as a Java option.
 Existing configs that already specify both are not affected by this change.
 See the full release notes of MAPREDUCE-5785 for more details.

 Getting Started
 ===============

 The Hadoop documentation includes the information you need to get started using
 Hadoop. Begin with the
 [Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
 which shows you how to set up a single-node Hadoop installation.
 Then move on to the
 [Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
 to learn how to set up a multi-node Hadoop installation.
	<!---
	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License. See accompanying LICENSE file.
	-->

	Apache Hadoop ${project.version}
	================================

	Apache Hadoop ${project.version} incorporates a number of significant
	enhancements over the previous major release line (hadoop-2.x).

	This is an alpha release to facilitate testing and the collection of
	feedback from downstream application developers and users. There are
	no guarantees regarding API stability or quality.

	Overview
	========

	Users are encouraged to read the full set of release notes.
	This page provides an overview of the major changes.

	Minimum required Java version increased from Java 7 to Java 8
	------------------

	All Hadoop JARs are now compiled targeting a runtime version of Java 8.
	Users still using Java 7 or below must upgrade to Java 8.

	Support for erasure encoding in HDFS
	------------------

	Erasure coding is a method for durably storing data with significant space
	savings compared to replication. Standard encodings like Reed-Solomon (10,4)
	have a 1.4x space overhead, compared to the 3x overhead of standard HDFS
	replication.

	Since erasure coding imposes additional overhead during reconstruction
	and performs mostly remote reads, it has traditionally been used for
	storing colder, less frequently accessed data. Users should consider
	the network and CPU overheads of erasure coding when deploying this
	feature.

	More details are available in the
	[HDFS Erasure Coding](./hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html)
	documentation.

	YARN Timeline Service v.2
	-------------------

	We are introducing an early preview (alpha 1) of a major revision of YARN
	Timeline Service: v.2. YARN Timeline Service v.2 addresses two major
	challenges: improving scalability and reliability of Timeline Service, and
	enhancing usability by introducing flows and aggregation.

	YARN Timeline Service v.2 alpha 1 is provided so that users and developers
	can test it and provide feedback and suggestions for making it a ready
	replacement for Timeline Service v.1.x. It should be used only in a test
	capacity. Most importantly, security is not enabled. Do not set up or use
	Timeline Service v.2 until security is implemented if security is a
	critical requirement.

	More details are available in the
	[YARN Timeline Service v.2](./hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html)
	documentation.

	Shell script rewrite
	-------------------

	The Hadoop shell scripts have been rewritten to fix many long-standing
	bugs and include some new features. While an eye has been kept towards
	compatibility, some changes may break existing installations.

	Incompatible changes are documented in the release notes, with related
	discussion on [HADOOP-9902](https://issues.apache.org/jira/browse/HADOOP-9902).

	More details are available in the
	[Unix Shell Guide](./hadoop-project-dist/hadoop-common/UnixShellGuide.html)
	documentation. Power users will also be pleased by the
	[Unix Shell API](./hadoop-project-dist/hadoop-common/UnixShellAPI.html)
	documentation, which describes much of the new functionality, particularly
	related to extensibility.

	MapReduce task-level native optimization
	--------------------

	MapReduce has added support for a native implementation of the map output
	collector. For shuffle-intensive jobs, this can lead to a performance
	improvement of 30% or more.

	See the release notes for
	[MAPREDUCE-2841](https://issues.apache.org/jira/browse/MAPREDUCE-2841)
	for more detail.

	Support for more than 2 NameNodes.
	--------------------

	The initial implementation of HDFS NameNode high-availability provided
	for a single active NameNode and a single Standby NameNode. By replicating
	edits to a quorum of three JournalNodes, this architecture is able to
	tolerate the failure of any one node in the system.

	However, some deployments require higher degrees of fault-tolerance.
	This is enabled by this new feature, which allows users to run multiple
	standby NameNodes. For instance, by configuring three NameNodes and
	five JournalNodes, the cluster is able to tolerate the failure of two
	nodes rather than just one.

	The [HDFS high-availability documentation](./hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)
	has been updated with instructions on how to configure more than two
	NameNodes.

	Default ports of multiple services have been changed.
	------------------------

	Previously, the default ports of multiple Hadoop services were in the
	Linux ephemeral port range (32768-61000). This meant that at startup,
	services would sometimes fail to bind to the port due to a conflict
	with another application.

	These conflicting ports have been moved out of the ephemeral range,
	affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our
	documentation has been updated appropriately, but see the release
	notes for [HDFS-9427](https://issues.apache.org/jira/browse/HDFS-9427) and
	[HADOOP-12811](https://issues.apache.org/jira/browse/HADOOP-12811)
	for a list of port changes.

	Support for Microsoft Azure Data Lake filesystem connector
	---------------------

	Hadoop now supports integration with Microsoft Azure Data Lake as
	an alternative Hadoop-compatible filesystem.

	Intra-datanode balancer
	-------------------

	A single DataNode manages multiple disks. During normal write operation,
	disks will be filled up evenly. However, adding or replacing disks can
	lead to significant skew within a DataNode. This situation is not handled
	by the existing HDFS balancer, which concerns itself with inter-, not intra-,
	DN skew.

	This situation is handled by the new intra-DataNode balancing
	functionality, which is invoked via the `hdfs diskbalancer` CLI.
	See the disk balancer section in the
	[HDFS Commands Guide](./hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
	for more information.

	Reworked daemon and task heap management
	---------------------

	A series of changes have been made to heap management for Hadoop daemons
	as well as MapReduce tasks.

	[HADOOP-10950](https://issues.apache.org/jira/browse/HADOOP-10950) introduces
	new methods for configuring daemon heap sizes.
	Notably, auto-tuning is now possible based on the memory size of the host,
	and the `HADOOP_HEAPSIZE` variable has been deprecated.
	See the full release notes of HADOOP-10950 for more detail.

	[MAPREDUCE-5785](https://issues.apache.org/jira/browse/MAPREDUCE-5785)
	simplifies the configuration of map and reduce task
	heap sizes, so the desired heap size no longer needs to be specified
	in both the task configuration and as a Java option.
	Existing configs that already specify both are not affected by this change.
	See the full release notes of MAPREDUCE-5785 for more details.

	Getting Started
	===============

	The Hadoop documentation includes the information you need to get started using
	Hadoop. Begin with the
	[Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html)
	which shows you how to set up a single-node Hadoop installation.
	Then move on to the
	[Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html)
	to learn how to set up a multi-node Hadoop installation.