hide:
license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
ReleaseSlots
.WorkerRemove
.Since 0.6.0, Celeborn deprecate celeborn.client.spark.fetch.throwsFetchFailure
. Please use celeborn.client.spark.stageRerun.enabled
instead.
Since 0.6.0, Celeborn modified celeborn.quota.tenant.diskBytesWritten
to celeborn.quota.user.diskBytesWritten
. Please use celeborn.quota.user.diskBytesWritten
if you want to set user level quota.
Since 0.6.0, Celeborn modified celeborn.quota.tenant.diskFileCount
to celeborn.quota.user.diskFileCount
. Please use celeborn.quota.user.diskFileCount
if you want to set user level quota.
Since 0.6.0, Celeborn modified celeborn.quota.tenant.hdfsBytesWritten
to celeborn.quota.user.hdfsBytesWritten
. Please use celeborn.quota.user.hdfsBytesWritten
if you want to set user level quota.
Since 0.6.0, Celeborn modified celeborn.quota.tenant.hdfsFileCount
to celeborn.quota.user.hdfsFileCount
. Please use celeborn.quota.user.hdfsFileCount
if you want to set user level quota.
Since 0.6.0, Celeborn modified celeborn.master.hdfs.expireDirs.timeout
to celeborn.master.dfs.expireDirs.timeout
. Please use cceleborn.master.dfs.expireDirs.timeout
if you want to set timeout for an expired dirs to be deleted.
Since 0.6.0, Celeborn introduced celeborn.master.slot.assign.minWorkers
with default value of 100
, which means Celeborn will involve more workers in offering slots when number of reducers are less.
Since 0.6.0, Celeborn deprecate celeborn.worker.congestionControl.low.watermark
. Please use celeborn.worker.congestionControl.diskBuffer.low.watermark
instead.
Since 0.6.0, Celeborn deprecate celeborn.worker.congestionControl.high.watermark
. Please use celeborn.worker.congestionControl.diskBuffer.high.watermark
instead.
Since 0.6.0, Celeborn changed the default value of celeborn.client.spark.fetch.throwsFetchFailure
from false
to true
, which means Celeborn will enable spark stage rerun at default.
Since 0.6.0, Celeborn changed celeborn.<module>.io.mode
optional, of which the default value changed from NIO
to EPOLL
if epoll mode is available, falling back to NIO
otherwise.
Since 0.6.0, Celeborn removed celeborn.client.shuffle.mapPartition.split.enabled
to enable shuffle partition split at default for MapPartition.
Since 0.6.0, Celeborn has introduced a new RESTful API namespace: /api/v1, which uses the application/json media type for requests and responses. The celeborn-openapi-client
SDK is also available to help users interact with the new RESTful APIs. The legacy RESTful APIs have been deprecated and will be removed in future releases. Access the full RESTful API documentation for detailed information.
The mappings of the old RESTful APIs to the new RESTful APIs for Master.
Old RESTful API | New RESTful API | Note |
---|---|---|
GET /conf | GET /api/v1/conf | |
GET /listDynamicConfigs | GET /api/v1/conf/dynamic | |
GET /threadDump | GET /api/v1/thread_dump | |
GET /applications | GET /api/v1/applications | |
GET /hostnames | GET /api/v1/applications/hostnames | |
GET /shuffle | GET /api/v1/shuffles | |
GET /masterGroupInfo | GET /api/v1/masters | |
GET /workerInfo | GET /api/v1/workers | |
GET /lostWorkers | GET /api/v1/workers | get the lostWorkers field in response |
GET /excludedWorkers | GET /api/v1/workers | get the excludedWorkers field in response |
GET /shutdownWorkers | GET /api/v1/workers | get the shutdownWorkers filed in response |
GET /decommissionWorkers | GET /api/v1/workers | get the decommissioningWorkers filed in response |
POST /exclude | POST /api/v1/workers/exclude | |
GET /workerEventInfo | GET /api/v1/workers/events | |
POST /sendWorkerEvent | POST /api/v1/workers/events |
The mappings of the old RESTful APIs to the new RESTful APIs for Worker.
Old RESTful API | New RESTful API | Note |
---|---|---|
GET /conf | GET /api/v1/conf | |
GET /listDynamicConfigs | GET /api/v1/conf/dynamic | |
GET /threadDump | GET /api/v1/thread_dump | |
GET /applications | GET /api/v1/applications | |
GET /shuffle | GET /api/v1/shuffles | |
GET /listPartitionLocationInfo | GET /api/v1/shuffles/partitions | |
GET /workerInfo | GET /api/v1/workers | |
GET /isRegistered | GET /api/v1/workers | get the isRegistered field in response |
GET /isDecommissioning | GET /api/v1/workers | get the isDecommissioning field in response |
GET /isShutdown | GET /api/v1/workers | get the isShutdown field in response |
GET /unavailablePeers | GET /api/v1/workers/unavailable_peers | |
POST /exit | POST /api/v1/workers/exit |
Since 0.6.0, the RESTful api /listTopDiskUsedApps
both in Master and Worker has been removed. Please use the following PromQL query instead.
topK(50, sum by (applicationId) (metrics_diskBytesWritten_Value{role="Worker", applicationId!=""}))
Since 0.6.0, the out-of-dated Flink 1.14 and Flink 1.15 have been removed from the official support list.
Since 0.6.0, the client respects the spark.celeborn.storage.availableTypes configuration, ensuring revived partition locations no longer default to memory storage. In contrast, clients prior to 0.6.0 default to memory storage for revived partitions. This means that if memory storage is enabled in worker nodes, clients prior to 0.6.0 may inadvertently utilize memory storage for an application even when memory storage is not enabled for that app.
Since 0.6.0, we have added a new sink org.apache.celeborn.common.metrics.sink.LoggerSink
to make sure that Celeborn metrics will be scraped periodically. It‘s recommended to enable this sink to make sure that worker’s metrics data won‘t be too large to cause worker OOM if you don’t have a collector to scrape metrics periodically. Don't forget to update the metrics.properties
.
Since 0.5.1, Celeborn master REST API /exclude
request uses media type application/x-www-form-urlencoded
instead of text/plain
.
Since 0.5.1, Celeborn master REST API /sendWorkerEvent
request uses POST method and the parameters type
and workers
use form parameters instead, and uses media type application/x-www-form-urlencoded
instead of text/plain
.
Since 0.5.1, Celeborn worker REST API /exit
request uses media type application/x-www-form-urlencoded
instead of text/plain
.
Since 0.5.0, Celeborn master metrics LostWorkers
is renamed as LostWorkerCount
.
Since 0.5.0, Celeborn worker metrics ChunkStreamCount
is renamed as ActiveChunkStreamCount
.
Since 0.5.0, Celeborn worker metrics CreditStreamCount
is renamed as ActiveCreditStreamCount
.
Since 0.5.0, Celeborn configurations support new tag isDynamic
to represent whether the configuration is dynamic config.
Since 0.5.0, Celeborn changed the default value of celeborn.worker.graceful.shutdown.recoverDbBackend
from LEVELDB
to ROCKSDB
, which means Celeborn will use RocksDB store for recovery backend. To restore the behavior before Celeborn 0.5, you can set celeborn.worker.graceful.shutdown.recoverDbBackend
to LEVELDB
.
Since 0.5.0, Celeborn deprecate celeborn.quota.configuration.path
. Please use celeborn.dynamicConfig.store.fs.path
instead.
Since 0.5.0, Celeborn client removes configuration celeborn.client.push.splitPartition.threads
, celeborn.client.flink.inputGate.minMemory
and celeborn.client.flink.resultPartition.minMemory
.
Since 0.5.0, Celeborn deprecate celeborn.client.spark.shuffle.forceFallback.enabled
. Please use celeborn.client.spark.shuffle.fallback.policy
instead.
Since 0.5.0, Celeborn master REST API /exclude
uses POST method and the parameters add
and remove
use form parameters instead.
Since 0.5.0, Celeborn worker REST API /exit
uses POST method and the parameter type
uses form parameter instead.
Since 0.5.0, Celeborn master and worker REST API /shuffles
is renamed as /shuffle
, and will be deprecated since 0.6.0.
Since 0.4.1, Celeborn master adds a limit to the estimated partition size used for computing worker slots. This size is now constrained within the range specified by celeborn.master.estimatedPartitionSize.minSize
and celeborn.master.estimatedPartitionSize.maxSize
.
Since 0.4.1, Celeborn changed the fallback configuration of celeborn.client.rpc.getReducerFileGroup.askTimeout
, celeborn.client.rpc.registerShuffle.askTimeout
and celeborn.client.rpc.requestPartition.askTimeout
from celeborn.<module>.io.connectionTimeout
to celeborn.rpc.askTimeout
.
Since 0.4.0, Celeborn won‘t be compatible with Celeborn client that versions below 0.3.0. Note that: It’s strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
Since 0.4.0, Celeborn won't support org.apache.spark.shuffle.celeborn.RssShuffleManager
.
Since 0.4.0, Celeborn changed the default value of celeborn.<module>.io.numConnectionsPerPeer
from 2
to 1
.
Since 0.4.0, Celeborn has changed the names of the prometheus master and worker configuration as shown in the table below:
Key Before v0.4.0 | Key After v0.4.0 |
---|---|
celeborn.metrics.master.prometheus.host | celeborn.master.http.host |
celeborn.metrics.master.prometheus.port | celeborn.master.http.port |
celeborn.metrics.worker.prometheus.host | celeborn.worker.http.host |
celeborn.metrics.worker.prometheus.port | celeborn.worker.http.port |
Since 0.4.0, Celeborn deprecate celeborn.worker.storage.baseDir.prefix
and celeborn.worker.storage.baseDir.number
. Please use celeborn.worker.storage.dirs
instead.
Since 0.4.0, Celeborn deprecate celeborn.storage.activeTypes
. Please use celeborn.storage.availableTypes
instead.
Since 0.4.0, Celeborn worker removes configuration celeborn.worker.userResourceConsumption.update.interval
.
Since 0.4.0, Celeborn master metrics PartitionWritten
is renamed as ActiveShuffleSize
.
Since 0.4.0, Celeborn master metrics PartitionFileCount
is renamed as ActiveShuffleFileCount
.
Since 0.3.1, Celeborn changed the default value of raft.client.rpc.request.timeout
from 3s
to 10s
.
Since 0.3.1, Celeborn changed the default value of raft.client.rpc.watch.request.timeout
from 10s
to 20s
.
Since 0.3.1, Celeborn changed the default value of celeborn.worker.directMemoryRatioToResume
from 0.5
to 0.7
.
Since 0.3.1, Celeborn changed the default value of celeborn.worker.monitor.disk.check.interval
from 60
to 30
.
Since 0.3.1, name of JVM metrics changed, see details at CELEBORN-1007.
Celeborn 0.2 Client is compatible with 0.3 Master/Server, it allows to upgrade Master/Worker first then Client. Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
Since 0.3.0, the support of deprecated configurations rss.*
is removed. All configurations listed in 0.2.1 docs still take effect, but some of those are deprecated too, please read the bootstrap logs and follow the suggestion to migrate to the new configuration.
From 0.3.0 on the default value for celeborn.client.push.replicate.enabled
is changed from true
to false
, users who want replication on should explicitly enable replication. For example, to enable replication for Spark users should add the spark config when submitting job: spark.celeborn.client.push.replicate.enabled=true
From 0.3.0 on the default value for celeborn.worker.storage.workingDir
is changed from hadoop/rss-worker/shuffle_data
to celeborn-worker/shuffle_data
, users who want to use origin working dir path should set this configuration.
Since 0.3.0, configuration namespace celeborn.ha.master
is deprecated, and will be removed in the future versions. All configurations celeborn.ha.master.*
should migrate to celeborn.master.ha.*
.
Since 0.3.0, environment variables CELEBORN_MASTER_HOST
and CELEBORN_MASTER_PORT
are removed. Instead CELEBORN_LOCAL_HOSTNAME
works on both master and worker, which takes high priority than configurations defined in properties file.
Since 0.3.0, the Celeborn Master URL schema is changed from rss://
to celeborn://
, for users who start Worker by sbin/start-worker.sh rss://<master-host>:<master-port>
, should migrate to sbin/start-worker.sh celeborn://<master-host>:<master-port>
.
Since 0.3.0, Celeborn supports overriding Hadoop configuration(core-site.xml
, hdfs-site.xml
, etc.) from Celeborn configuration with the additional prefix celeborn.hadoop.
. On Spark client side, user should set Hadoop configuration like spark.celeborn.hadoop.foo=bar
, note that spark.hadoop.foo=bar
does not take effect; on Flink client and Celeborn Master/Worker side, user should set like celeborn.hadoop.foo=bar
.
Since 0.3.0, Celeborn master metrics BlacklistedWorkerCount
is renamed as ExcludedWorkerCount
.
Since 0.3.0, Celeborn master http request url /blacklistedWorkers
is renamed as /excludedWorkers
.
Since 0.3.0, introduces a terminology update for Celeborn worker data replication, replacing the previous master/slave
terminology with primary/replica
. In alignment with this change, corresponding metrics keywords have been adjusted. The following table presents a comprehensive overview of the changes:
Key Before v0.3.0 | Key After v0.3.0 |
---|---|
MasterPushDataTime | PrimaryPushDataTime |
MasterPushDataHandshakeTime | PrimaryPushDataHandshakeTime |
MasterRegionStartTime | PrimaryRegionStartTime |
MasterRegionFinishTime | PrimaryRegionFinishTime |
SlavePushDataTime | ReplicaPushDataTime |
SlavePushDataHandshakeTime | ReplicaPushDataHandshakeTime |
SlaveRegionStartTime | ReplicaRegionStartTime |
SlaveRegionFinishTime | ReplicaRegionFinishTime |
Since 0.3.0, Celeborn's spark shuffle manager change from org.apache.spark.shuffle.celeborn.RssShuffleManager
to org.apache.spark.shuffle.celeborn.SparkShuffleManager
. User can set spark property spark.shuffle.manager
to org.apache.spark.shuffle.celeborn.SparkShuffleManager
to use Celeborn remote shuffle service. In 0.3.0, Celeborn still support org.apache.spark.shuffle.celeborn.RssShuffleManager
, it will be removed in 0.4.0.