| <!--- |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. See accompanying LICENSE file. |
| --> |
| |
| Apache Hadoop ${project.version} |
| ================================ |
| |
| Apache Hadoop ${project.version} is an update to the Hadoop 3.3.x release branch. |
| |
| Overview of Changes |
| =================== |
| |
| Users are encouraged to read the full set of release notes. |
| This page provides an overview of the major changes. |
| |
| Azure ABFS: Critical Stream Prefetch Fix |
| ---------------------------------------- |
| |
| The abfs has a critical bug fix |
| [HADOOP-18546](https://issues.apache.org/jira/browse/HADOOP-18546). |
| *ABFS. Disable purging list of in-progress reads in abfs stream close().* |
| |
| All users of the abfs connector in hadoop releases 3.3.2+ MUST either upgrade |
| or disable prefetching by setting `fs.azure.readaheadqueue.depth` to `0` |
| |
| Consult the parent JIRA [HADOOP-18521](https://issues.apache.org/jira/browse/HADOOP-18521) |
| *ABFS ReadBufferManager buffer sharing across concurrent HTTP requests* |
| for root cause analysis, details on what is affected, and mitigations. |
| |
| |
| Vectored IO API |
| --------------- |
| |
| [HADOOP-18103](https://issues.apache.org/jira/browse/HADOOP-18103). |
| *High performance vectored read API in Hadoop* |
| |
| The `PositionedReadable` interface has now added an operation for |
| Vectored IO (also known as Scatter/Gather IO): |
| |
| ```java |
| void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate) |
| ``` |
| |
| All the requested ranges will be retrieved into the supplied byte buffers -possibly asynchronously, |
| possibly in parallel, with results potentially coming in out-of-order. |
| |
| 1. The default implementation uses a series of `readFully()` calls, so delivers |
| equivalent performance. |
| 2. The local filesystem uses java native IO calls for higher performance reads than `readFully()`. |
| 3. The S3A filesystem issues parallel HTTP GET requests in different threads. |
| |
| Benchmarking of enhanced Apache ORC and Apache Parquet clients through `file://` and `s3a://` |
| show significant improvements in query performance. |
| |
| Further Reading: |
| * [FsDataInputStream](./hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html). |
| * [Hadoop Vectored IO: Your Data Just Got Faster!](https://apachecon.com/acasia2022/sessions/bigdata-1148.html) |
| Apachecon 2022 talk. |
| |
| Mapreduce: Manifest Committer for Azure ABFS and google GCS |
| ---------------------------------------------------------- |
| |
| The new _Intermediate Manifest Committer_ uses a manifest file |
| to commit the work of successful task attempts, rather than |
| renaming directories. |
| Job commit is matter of reading all the manifests, creating the |
| destination directories (parallelized) and renaming the files, |
| again in parallel. |
| |
| This is both fast and correct on Azure Storage and Google GCS, |
| and should be used there instead of the classic v1/v2 file |
| output committers. |
| |
| It is also safe to use on HDFS, where it should be faster |
| than the v1 committer. It is however optimized for |
| cloud storage where list and rename operations are significantly |
| slower; the benefits may be less. |
| |
| More details are available in the |
| [manifest committer](./hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html). |
| documentation. |
| |
| |
| HDFS: Dynamic Datanode Reconfiguration |
| -------------------------------------- |
| |
| HDFS-16400, HDFS-16399, HDFS-16396, HDFS-16397, HDFS-16413, HDFS-16457. |
| |
| A number of Datanode configuration options can be changed without having to restart |
| the datanode. This makes it possible to tune deployment configurations without |
| cluster-wide Datanode Restarts. |
| |
| See [DataNode.java](https://github.com/apache/hadoop/blob/branch-3.3.5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L346-L361) |
| for the list of dynamically reconfigurable attributes. |
| |
| |
| Transitive CVE fixes |
| -------------------- |
| |
| A lot of dependencies have been upgraded to address recent CVEs. |
| Many of the CVEs were not actually exploitable through the Hadoop |
| so much of this work is just due diligence. |
| However applications which have all the library is on a class path may |
| be vulnerable, and the ugprades should also reduce the number of false |
| positives security scanners report. |
| |
| We have not been able to upgrade every single dependency to the latest |
| version there is. Some of those changes are fundamentally incompatible. |
| If you have concerns about the state of a specific library, consult the Apache JIRA |
| issue tracker to see if an issue has been filed, discussions have taken place about |
| the library in question, and whether or not there is already a fix in the pipeline. |
| *Please don't file new JIRAs about dependency-X.Y.Z having a CVE without |
| searching for any existing issue first* |
| |
| As an open-source project, contributions in this area are always welcome, |
| especially in testing the active branches, testing applications downstream of |
| those branches and of whether updated dependencies trigger regressions. |
| |
| |
| Security Advisory |
| ================= |
| |
| Hadoop HDFS is a distributed filesystem allowing remote |
| callers to read and write data. |
| |
| Hadoop YARN is a distributed job submission/execution |
| engine allowing remote callers to submit arbitrary |
| work into the cluster. |
| |
| Unless a Hadoop cluster is deployed with |
| [caller authentication with Kerberos](./hadoop-project-dist/hadoop-common/SecureMode.html), |
| anyone with network access to the servers has unrestricted access to the data |
| and the ability to run whatever code they want in the system. |
| |
| In production, there are generally three deployment patterns which |
| can, with care, keep data and computing resources private. |
| 1. Physical cluster: *configure Hadoop security*, usually bonded to the |
| enterprise Kerberos/Active Directory systems. |
| Good. |
| 1. Cloud: transient or persistent single or multiple user/tenant cluster |
| with private VLAN *and security*. |
| Good. |
| Consider [Apache Knox](https://knox.apache.org/) for managing remote |
| access to the cluster. |
| 1. Cloud: transient single user/tenant cluster with private VLAN |
| *and no security at all*. |
| Requires careful network configuration as this is the sole |
| means of securing the cluster.. |
| Consider [Apache Knox](https://knox.apache.org/) for managing |
| remote access to the cluster. |
| |
| *If you deploy a Hadoop cluster in-cloud without security, and without configuring a VLAN |
| to restrict access to trusted users, you are implicitly sharing your data and |
| computing resources with anyone with network access* |
| |
| If you do deploy an insecure cluster this way then port scanners will inevitably |
| find it and submit crypto-mining jobs. If this happens to you, please do not report |
| this as a CVE or security issue: it is _utterly predictable_. Secure *your cluster* if |
| you want to remain exclusively *your cluster*. |
| |
| Finally, if you are using Hadoop as a service deployed/managed by someone else, |
| do determine what security their products offer and make sure it meets your requirements. |
| |
| |
| Getting Started |
| =============== |
| |
| The Hadoop documentation includes the information you need to get started using |
| Hadoop. Begin with the |
| [Single Node Setup](./hadoop-project-dist/hadoop-common/SingleCluster.html) |
| which shows you how to set up a single-node Hadoop installation. |
| Then move on to the |
| [Cluster Setup](./hadoop-project-dist/hadoop-common/ClusterSetup.html) |
| to learn how to set up a multi-node Hadoop installation. |
| |
| Before deploying Hadoop in production, read |
| [Hadoop in Secure Mode](./hadoop-project-dist/hadoop-common/SecureMode.html), |
| and follow its instructions to secure your cluster. |
| |
| |