blob: 3685090ca4425ced988a0b329d6ffc2c27b363ad [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2021-06-15
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Tencent COS Support &#x2013; Integeration of Tencent COS in Hadoop</title>
<style type="text/css" media="all">
@import url("../css/maven-base.css");
@import url("../css/maven-theme.css");
@import url("../css/site.css");
</style>
<link rel="stylesheet" href="../css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20210615" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xleft">
<a href="http://www.apache.org/" class="externalLink">Apache</a>
&gt;
<a href="http://hadoop.apache.org/" class="externalLink">Hadoop</a>
&gt;
<a href="../index.html">Apache Hadoop Tencent COS Support</a>
&gt;
Integeration of Tencent COS in Hadoop
</div>
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2021-06-15
&nbsp;| Version: 3.3.1
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../hadoop-openstack/index.html">OpenStack Swift</a>
</li>
<li class="none">
<a href="../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="../images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Integeration of Tencent COS in Hadoop</h1>
<div class="section">
<h2><a name="Introduction"></a>Introduction</h2>
<p><a class="externalLink" href="https://intl.cloud.tencent.com/product/cos">Tencent COS</a> is a famous object storage system provided by Tencent Corp. Hadoop-COS is a client that makes the upper computing systems based on HDFS be able to use the COS as its underlying storage system. The big data-processing systems that have been identified for support are: Hadoop MR, Spark, Alluxio and etc. In addition, Druid also can use COS as its deep storage by configuring HDFS-Load-Plugin integerating with HADOOP-COS.</p></div>
<div class="section">
<h2><a name="Features"></a>Features</h2>
<ul>
<li>
<p>Support Hadoop MapReduce and Spark write data into COS and read from it directly.</p>
</li>
<li>
<p>Implements the interfaces of the Hadoop file system and provides the pseudo-hierarchical directory structure same as HDFS.</p>
</li>
<li>
<p>Supports multipart uploads for a large file. Single file supports up to 19TB</p>
</li>
<li>
<p>High performance and high availability. The performance difference between Hadoop-COS and HDFS is not more than 30%.</p>
</li>
</ul>
<blockquote>
<p>Notes:</p>
<p>Object Storage is not a file system and it has some limitations:</p>
<ol style="list-style-type: decimal">
<li>
<p>Object storage is a key-value storage and it does not support hierarchical directory naturally. Usually, using the directory separatory in object key to simulate the hierarchical directory, such as &#x201c;/hadoop/data/words.dat&#x201d;.</p>
</li>
<li>
<p>COS Object storage can not support the object&#x2019;s append operation currently. It means that you can not append content to the end of an existing object(file).</p>
</li>
<li>
<p>Both <tt>delete</tt> and <tt>rename</tt> operations are non-atomic, which means that the operations are interrupted, the operation result may be inconsistent state.</p>
</li>
<li>
<p>Object storages have different authorization models:</p>
</li>
</ol>
<ul>
<li>
<p>Directory permissions are reported as 777.</p>
</li>
<li>
<p>File permissions are reported as 666.</p>
</li>
<li>
<p>File owner is reported as the local current user.</p>
</li>
<li>
<p>File group is also reported as the local current user.</p>
</li>
</ul>
<ol style="list-style-type: decimal">
<li>
<p>Supports multipart uploads for a large file(up to 40TB), but the number of part is limited as 10000.</p>
</li>
<li>
<p>The number of files listed each time is limited to 1000.</p>
</li>
</ol>
</blockquote></div>
<div class="section">
<h2><a name="Quick_Start"></a>Quick Start</h2>
<div class="section">
<h3><a name="Concepts"></a>Concepts</h3>
<ul>
<li>
<p><b>Bucket</b>: A container for storing data in COS. Its name is made up of user-defined bucketname and user appid.</p>
</li>
<li>
<p><b>Appid</b>: Unique resource identifier for the user dimension.</p>
</li>
<li>
<p><b>SecretId</b>: ID used to authenticate the user</p>
</li>
<li>
<p><b>SecretKey</b>: Key used to authenticate the user</p>
</li>
<li>
<p><b>Region</b>: The region where a bucket locates.</p>
</li>
<li>
<p><b>CosN</b>: Hadoop-COS uses <tt>cosn</tt> as its URI scheme, so CosN is often used to refer to Hadoop-COS.</p>
</li>
</ul></div>
<div class="section">
<h3><a name="Usage"></a>Usage</h3>
<div class="section">
<h4><a name="System_Requirements"></a>System Requirements</h4>
<p>Linux kernel 2.6+</p></div>
<div class="section">
<h4><a name="Dependencies"></a>Dependencies</h4>
<ul>
<li>cos_api (version 5.4.10 or later )</li>
<li>cos-java-sdk (version 2.0.6 recommended)</li>
<li>joda-time (version 2.9.9 recommended)</li>
<li>httpClient (version 4.5.1 or later recommended)</li>
<li>Jackson: jackson-core, jackson-databind, jackson-annotations (version 2.9.8 or later)</li>
<li>bcprov-jdk15on (version 1.59 recommended)</li>
</ul></div>
<div class="section">
<h4><a name="Configure_Properties"></a>Configure Properties</h4>
<div class="section">
<h5><a name="URI_and_Region_Properties"></a>URI and Region Properties</h5>
<p>If you plan to use COS as the default file system for Hadoop or other big data systems, you need to configure <tt>fs.defaultFS</tt> as the URI of Hadoop-COS in core-site.xml. Hadoop-COS uses <tt>cosn</tt> as its URI scheme, and the bucket as its URI host. At the same time, you need to explicitly set <tt>fs.cosn.userinfo.region</tt> to indicate the region your bucket locates.</p>
<p><b>NOTE</b>:</p>
<ul>
<li>
<p>For Hadoop-COS, <tt>fs.defaultFS</tt> is an option. If you are only temporarily using the COS as a data source for Hadoop, you do not need to set the property, just specify the full URI when you use it. For example: <tt>hadoop fs -ls cosn://testBucket-125236746/testDir/test.txt</tt>.</p>
</li>
<li>
<p><tt>fs.cosn.userinfo.region</tt> is an required property for Hadoop-COS. The reason is that Hadoop-COS must know the region of the using bucket in order to accurately construct a URL to access it.</p>
</li>
<li>
<p>COS supports multi-region storage, and different regions have different access domains by default. It is recommended to choose the nearest storage region according to your own business scenarios, so as to improve the object upload and download speed. You can find the available region from <a class="externalLink" href="https://intl.cloud.tencent.com/document/product/436/6224">https://intl.cloud.tencent.com/document/product/436/6224</a></p>
</li>
</ul>
<p>The following is an example for the configuration format:</p>
<div>
<div>
<pre class="source"> &lt;property&gt;
&lt;name&gt;fs.defaultFS&lt;/name&gt;
&lt;value&gt;cosn://&lt;bucket-appid&gt;&lt;/value&gt;
&lt;description&gt;
Optional: If you don't want to use CosN as the default file system, you don't need to configure it.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.bucket.region&lt;/name&gt;
&lt;value&gt;ap-xxx&lt;/value&gt;
&lt;description&gt;The region where the bucket is located&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h5><a name="User_Authentication_Properties"></a>User Authentication Properties</h5>
<p>Each user needs to properly configure the credentials ( User&#x2019;s secreteId and secretKey ) properly to access the object stored in COS. These credentials can be obtained from the official console provided by Tencent Cloud.</p>
<div>
<div>
<pre class="source"> &lt;property&gt;
&lt;name&gt;fs.cosn.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.auth.SimpleCredentialsProvider&lt;/value&gt;
&lt;description&gt;
This option allows the user to specify how to get the credentials.
Comma-separated class names of credential provider classes which implement
com.qcloud.cos.auth.COSCredentialsProvider:
1.org.apache.hadoop.fs.auth.SimpleCredentialsProvider: Obtain the secret id and secret key from fs.cosn.userinfo.secretId and fs.cosn.userinfo.secretKey in core-site.xml
2.org.apache.hadoop.fs.auth.EnvironmentVariableCredentialsProvider: Obtain the secret id and secret key from system environment variables named COS_SECRET_ID and COS_SECRET_KEY
If unspecified, the default order of credential providers is:
1. org.apache.hadoop.fs.auth.SimpleCredentialsProvider
2. org.apache.hadoop.fs.auth.EnvironmentVariableCredentialsProvider
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.userinfo.secretId&lt;/name&gt;
&lt;value&gt;xxxxxxxxxxxxxxxxxxxxxxxxx&lt;/value&gt;
&lt;description&gt;Tencent Cloud Secret Id &lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.userinfo.secretKey&lt;/name&gt;
&lt;value&gt;xxxxxxxxxxxxxxxxxxxxxxxx&lt;/value&gt;
&lt;description&gt;Tencent Cloud Secret Key&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h5><a name="Integration_Properties"></a>Integration Properties</h5>
<p>You need to explicitly specify the A and B options in order for Hadoop to properly integrate the COS as the underlying file system</p>
<p>Only correctly set <tt>fs.cosn.impl</tt> and <tt>fs.AbstractFileSystem.cosn.impl</tt> to enable Hadoop to integrate COS as its underlying file system. <tt>fs.cosn.impl</tt> must be set as <tt>org.apache.hadoop.fs.cos.CosFileSystem</tt> and <tt>fs.AbstractFileSystem.cosn.impl</tt> must be set as <tt>org.apache.hadoop.fs.cos.CosN</tt>.</p>
<div>
<div>
<pre class="source"> &lt;property&gt;
&lt;name&gt;fs.cosn.impl&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.cosn.CosNFileSystem&lt;/value&gt;
&lt;description&gt;The implementation class of the CosN Filesystem&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.AbstractFileSystem.cosn.impl&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.cos.CosN&lt;/value&gt;
&lt;description&gt;The implementation class of the CosN AbstractFileSystem.&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h5><a name="Other_Runtime_Properties"></a>Other Runtime Properties</h5>
<p>Hadoop-COS provides rich runtime properties to set, and most of these do not require custom values because a well-run default value provided for them.</p>
<p><b>It is important to note that</b>:</p>
<ul>
<li>
<p>Hadoop-COS will generate some temporary files and consumes some disk space. All temporary files would be placed in the directory specified by option <tt>fs.cosn.tmp.dir</tt> (Default: /tmp/hadoop_cos);</p>
</li>
<li>
<p>The default block size is 8MB and it means that you can only upload a single file up to 78GB into the COS blob storage system. That is mainly due to the fact that the multipart-upload can only support up to 10,000 blocks. For this reason, if needing to support larger single files, you must increase the block size accordingly by setting the property <tt>fs.cosn.block.size</tt>. For example, the size of the largest single file is 1TB, the block size is at least greater than or equal to (1 * 1024 * 1024 * 1024 * 1024)/10000 = 109951163. Currently, the maximum support file is 19TB (block size: 2147483648)</p>
</li>
</ul>
<div>
<div>
<pre class="source"> &lt;property&gt;
&lt;name&gt;fs.cosn.tmp.dir&lt;/name&gt;
&lt;value&gt;/tmp/hadoop_cos&lt;/value&gt;
&lt;description&gt;Temporary files would be placed here.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.buffer.size&lt;/name&gt;
&lt;value&gt;33554432&lt;/value&gt;
&lt;description&gt;The total size of the buffer pool.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.block.size&lt;/name&gt;
&lt;value&gt;8388608&lt;/value&gt;
&lt;description&gt;
Block size to use cosn filesysten, which is the part size for MultipartUpload. Considering the COS supports up to 10000 blocks, user should estimate the maximum size of a single file. For example, 8MB part size can allow writing a 78GB single file.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.maxRetries&lt;/name&gt;
&lt;value&gt;3&lt;/value&gt;
&lt;description&gt;
The maximum number of retries for reading or writing files to COS, before throwing a failure to the application.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.retry.interval.seconds&lt;/name&gt;
&lt;value&gt;3&lt;/value&gt;
&lt;description&gt;The number of seconds to sleep between each COS retry.&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h5><a name="Properties_Summary"></a>Properties Summary</h5>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="center"> properties </th>
<th align="left"> description </th>
<th align="center"> default value </th>
<th align="center"> required </th></tr>
</thead><tbody>
<tr class="b">
<td align="center"> fs.defaultFS </td>
<td align="left"> Configure the default file system used by Hadoop.</td>
<td align="center"> None </td>
<td align="center"> NO </td></tr>
<tr class="a">
<td align="center"> fs.cosn.credentials.provider </td>
<td align="left"> This option allows the user to specify how to get the credentials. Comma-separated class names of credential provider classes which implement com.qcloud.cos.auth.COSCredentialsProvider: <br /> 1. org.apache.hadoop.fs.cos.auth.SimpleCredentialsProvider: Obtain the secret id and secret key from <tt>fs.cosn.userinfo.secretId</tt> and <tt>fs.cosn.userinfo.secretKey</tt> in core-site.xml; <br /> 2. org.apache.hadoop.fs.auth.EnvironmentVariableCredentialsProvider: Obtain the secret id and secret key from system environment variables named <tt>COSN_SECRET_ID</tt> and <tt>COSN_SECRET_KEY</tt>. <br /> <br /> If unspecified, the default order of credential providers is: <br /> 1. org.apache.hadoop.fs.auth.SimpleCredentialsProvider; <br /> 2. org.apache.hadoop.fs.auth.EnvironmentVariableCredentialsProvider. </td>
<td align="center"> None </td>
<td align="center"> NO </td></tr>
<tr class="b">
<td align="center"> fs.cosn.userinfo.secretId/secretKey </td>
<td align="left"> The API key information of your account </td>
<td align="center"> None </td>
<td align="center"> YES </td></tr>
<tr class="a">
<td align="center"> fs.cosn.bucket.region </td>
<td align="left"> The region where the bucket is located. </td>
<td align="center"> None </td>
<td align="center"> YES </td></tr>
<tr class="b">
<td align="center"> fs.cosn.impl </td>
<td align="left"> The implementation class of the CosN filesystem. </td>
<td align="center"> None </td>
<td align="center"> YES </td></tr>
<tr class="a">
<td align="center"> fs.AbstractFileSystem.cosn.impl </td>
<td align="left"> The implementation class of the CosN AbstractFileSystem. </td>
<td align="center"> None </td>
<td align="center"> YES </td></tr>
<tr class="b">
<td align="center"> fs.cosn.tmp.dir </td>
<td align="left"> Temporary files generated by cosn would be stored here during the program running. </td>
<td align="center"> /tmp/hadoop_cos </td>
<td align="center"> NO </td></tr>
<tr class="a">
<td align="center"> fs.cosn.buffer.size </td>
<td align="left"> The total size of the buffer pool. Require greater than or equal to block size. </td>
<td align="center"> 33554432 </td>
<td align="center"> NO </td></tr>
<tr class="b">
<td align="center"> fs.cosn.block.size </td>
<td align="left"> The size of file block. Considering the limitation that each file can be divided into a maximum of 10,000 to upload, the option must be set according to the maximum size of used single file. For example, 8MB part size can allow writing a 78GB single file. </td>
<td align="center"> 8388608 </td>
<td align="center"> NO </td></tr>
<tr class="a">
<td align="center"> fs.cosn.upload_thread_pool </td>
<td align="left"> Number of threads used for concurrent uploads when files are streamed to COS. </td>
<td align="center"> CPU core number * 3 </td>
<td align="center"> NO </td></tr>
<tr class="b">
<td align="center"> fs.cosn.read.ahead.block.size </td>
<td align="left"> The size of each read-ahead block. </td>
<td align="center"> 524288 (512KB) </td>
<td align="center"> NO </td></tr>
<tr class="a">
<td align="center"> fs.cosn.read.ahead.queue.size </td>
<td align="left"> The length of readahead queue. </td>
<td align="center"> 10 </td>
<td align="center"> NO </td></tr>
<tr class="b">
<td align="center"> fs.cosn.maxRetries </td>
<td align="left"> The maxium number of retries for reading or writing files to COS, before throwing a failure to the application. </td>
<td align="center"> 3 </td>
<td align="center"> NO </td></tr>
<tr class="a">
<td align="center"> fs.cosn.retry.interval.seconds </td>
<td align="left"> The number of seconds to sleep between each retry </td>
<td align="center"> 3 </td>
<td align="center"> NO </td></tr>
</tbody>
</table></div></div>
<div class="section">
<h4><a name="Command_Usage"></a>Command Usage</h4>
<p>Command format: <tt>hadoop fs -ls -R cosn://bucket-appid/&lt;path&gt;</tt> or <tt>hadoop fs -ls -R /&lt;path&gt;</tt>, the latter requires the defaultFs option to be set as <tt>cosn</tt>.</p></div>
<div class="section">
<h4><a name="Example"></a>Example</h4>
<p>Use CosN as the underlying file system to run the WordCount routine:</p>
<div>
<div>
<pre class="source">bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.x.x.jar wordcount cosn://example/mr/input.txt cosn://example/mr/output
</pre></div></div>
<p>If setting CosN as the default file system for Hadoop, you can run it as follows:</p>
<div>
<div>
<pre class="source">bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.x.x.jar wordcount /mr/input.txt /mr/output
</pre></div></div>
</div></div></div>
<div class="section">
<h2><a name="Testing_the_hadoop-cos_Module"></a>Testing the hadoop-cos Module</h2>
<p>To test CosN filesystem, the following two files which pass in authentication details to the test runner are needed.</p>
<ol style="list-style-type: decimal">
<li>auth-keys.xml</li>
<li>core-site.xml</li>
</ol>
<p>These two files need to be created under the <tt>hadoop-cloud-storage-project/hadoop-cos/src/test/resource</tt> directory.</p>
<div class="section">
<h3><a name="auth-key.xml"></a><tt>auth-key.xml</tt></h3>
<p>COS credentials can specified in <tt>auth-key.xml</tt>. At the same time, it is also a trigger for the CosN filesystem tests. COS bucket URL should be provided by specifying the option: <tt>test.fs.cosn.name</tt>.</p>
<p>An example of the <tt>auth-keys.xml</tt> is as follow:</p>
<div>
<div>
<pre class="source">&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;test.fs.cosn.name&lt;/name&gt;
&lt;value&gt;cosn://testbucket-12xxxxxx&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.bucket.region&lt;/name&gt;
&lt;value&gt;ap-xxx&lt;/value&gt;
&lt;description&gt;The region where the bucket is located&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.userinfo.secretId&lt;/name&gt;
&lt;value&gt;AKIDXXXXXXXXXXXXXXXXXXXX&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.userinfo.secretKey&lt;/name&gt;
&lt;value&gt;xxxxxxxxxxxxxxxxxxxxxxxxx&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;
</pre></div></div>
<p>Without this file, all tests in this module will be skipped.</p></div>
<div class="section">
<h3><a name="core-site.xml"></a><tt>core-site.xml</tt></h3>
<p>This file pre-exists and sources the configurations created in auth-keys.xml. For most cases, no modification is needed, unless a specific, non-default property needs to be set during the testing.</p></div>
<div class="section">
<h3><a name="contract-test-options.xml"></a><tt>contract-test-options.xml</tt></h3>
<p>All configurations related to support contract tests need to be specified in <tt>contract-test-options.xml</tt>. Here is an example of <tt>contract-test-options.xml</tt>.</p>
<div>
<div>
<pre class="source">&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&gt;
&lt;configuration&gt;
&lt;include xmlns=&quot;http://www.w3.org/2001/XInclude&quot;
href=&quot;auth-keys.xml&quot;/&gt;
&lt;property&gt;
&lt;name&gt;fs.contract.test.fs.cosn&lt;/name&gt;
&lt;value&gt;cosn://testbucket-12xxxxxx&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.cosn.bucket.region&lt;/name&gt;
&lt;value&gt;ap-xxx&lt;/value&gt;
&lt;description&gt;The region where the bucket is located&lt;/description&gt;
&lt;/property&gt;
&lt;/configuration&gt;
</pre></div></div>
<p>If the option <tt>fs.contract.test.fs.cosn</tt> not definded in the file, all contract tests will be skipped.</p></div></div>
<div class="section">
<h2><a name="Other_issues"></a>Other issues</h2>
<div class="section">
<h3><a name="Performance_Loss"></a>Performance Loss</h3>
<p>The IO performance of COS is lower than HDFS in principle, even on virtual clusters running on Tencent CVM.</p>
<p>The main reason can be attributed to the following points:</p>
<ul>
<li>
<p>HDFS replicates data for faster query.</p>
</li>
<li>
<p>HDFS is significantly faster for many &#x201c;metadata&#x201d; operations: listing the contents of a directory, calling getFileStatus() on path, creating or deleting directories.</p>
</li>
<li>
<p>HDFS stores the data on the local hard disks, avoiding network traffic if the code can be executed on that host. But access to the object storing in COS requires access to network almost each time. It is a critical point in damaging IO performance. Hadoop-COS also do a lot of optimization work for it, such as the pre-read queue, the upload buffer pool, the concurrent upload thread pool, etc.</p>
</li>
<li>
<p>File IO performing many seek calls/positioned read calls will also encounter performance problems due to the size of the HTTP requests made. Despite the pre-read cache optimizations, a large number of random reads can still cause frequent network requests.</p>
</li>
<li>
<p>On HDFS, both the <tt>rename</tt> and <tt>mv</tt> for a directory or a file are an atomic and O(1)-level operation, but in COS, the operation need to combine <tt>copy</tt> and <tt>delete</tt> sequentially. Therefore, performing rename and move operations on a COS object is not only low performance, but also difficult to guarantee data consistency.</p>
</li>
</ul>
<p>At present, using the COS blob storage system through Hadoop-COS occurs about 20% ~ 25% performance loss compared to HDFS. But, the cost of using COS is lower than HDFS, which includes both storage and maintenance costs.</p></div></div>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2021
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>