blob: df953e39ab343c662a5f36174d3399cbae93d2a4 [file] [log] [blame] [view]
---
hide:
- navigation
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
# Migration Guide
# Upgrading from 0.6 to 0.7
- Since 0.7.0, Celeborn removed `ReleaseSlots`.
- Since 0.7.0, Celeborn removed `WorkerRemove`.
# Upgrading from 0.5 to 0.6
- Since 0.6.0, Celeborn deprecate `celeborn.client.spark.fetch.throwsFetchFailure`. Please use `celeborn.client.spark.stageRerun.enabled` instead.
- Since 0.6.0, Celeborn modified `celeborn.quota.tenant.diskBytesWritten` to `celeborn.quota.user.diskBytesWritten`. Please use `celeborn.quota.user.diskBytesWritten` if you want to set user level quota.
- Since 0.6.0, Celeborn modified `celeborn.quota.tenant.diskFileCount` to `celeborn.quota.user.diskFileCount`. Please use `celeborn.quota.user.diskFileCount` if you want to set user level quota.
- Since 0.6.0, Celeborn modified `celeborn.quota.tenant.hdfsBytesWritten` to `celeborn.quota.user.hdfsBytesWritten`. Please use `celeborn.quota.user.hdfsBytesWritten` if you want to set user level quota.
- Since 0.6.0, Celeborn modified `celeborn.quota.tenant.hdfsFileCount` to `celeborn.quota.user.hdfsFileCount`. Please use `celeborn.quota.user.hdfsFileCount` if you want to set user level quota.
- Since 0.6.0, Celeborn modified `celeborn.master.hdfs.expireDirs.timeout` to `celeborn.master.dfs.expireDirs.timeout`. Please use `cceleborn.master.dfs.expireDirs.timeout` if you want to set timeout for an expired dirs to be deleted.
- Since 0.6.0, Celeborn introduced `celeborn.master.slot.assign.minWorkers` with default value of `100`, which means Celeborn will involve more workers in offering slots when number of reducers are less.
- Since 0.6.0, Celeborn deprecate `celeborn.worker.congestionControl.low.watermark`. Please use `celeborn.worker.congestionControl.diskBuffer.low.watermark` instead.
- Since 0.6.0, Celeborn deprecate `celeborn.worker.congestionControl.high.watermark`. Please use `celeborn.worker.congestionControl.diskBuffer.high.watermark` instead.
- Since 0.6.0, Celeborn changed the default value of `celeborn.client.spark.fetch.throwsFetchFailure` from `false` to `true`, which means Celeborn will enable spark stage rerun at default.
- Since 0.6.0, Celeborn changed `celeborn.<module>.io.mode` optional, of which the default value changed from `NIO` to `EPOLL` if epoll mode is available, falling back to `NIO` otherwise.
- Since 0.6.0, Celeborn removed `celeborn.client.shuffle.mapPartition.split.enabled` to enable shuffle partition split at default for MapPartition.
- Since 0.6.0, Celeborn has introduced a new RESTful API namespace: /api/v1, which uses the application/json media type for requests and responses.
The `celeborn-openapi-client` SDK is also available to help users interact with the new RESTful APIs.
The legacy RESTful APIs have been deprecated and will be removed in future releases.
Access the full [RESTful API documentation](./restapi.md) for detailed information.
- The mappings of the old RESTful APIs to the new RESTful APIs for Master.
| Old RESTful API | New RESTful API | Note |
|--------------------------|------------------------------------------|--------------------------------------------------|
| GET /conf | GET /api/v1/conf | |
| GET /listDynamicConfigs | GET /api/v1/conf/dynamic | |
| GET /threadDump | GET /api/v1/thread_dump | |
| GET /applications | GET /api/v1/applications | |
| GET /hostnames | GET /api/v1/applications/hostnames | |
| GET /shuffle | GET /api/v1/shuffles | |
| GET /masterGroupInfo | GET /api/v1/masters | |
| GET /workerInfo | GET /api/v1/workers | |
| GET /lostWorkers | GET /api/v1/workers | get the lostWorkers field in response |
| GET /excludedWorkers | GET /api/v1/workers | get the excludedWorkers field in response |
| GET /shutdownWorkers | GET /api/v1/workers | get the shutdownWorkers filed in response |
| GET /decommissionWorkers | GET /api/v1/workers | get the decommissioningWorkers filed in response |
| POST /exclude | POST /api/v1/workers/exclude | |
| GET /workerEventInfo | GET /api/v1/workers/events | |
| POST /sendWorkerEvent | POST /api/v1/workers/events | |
- The mappings of the old RESTful APIs to the new RESTful APIs for Worker.
| Old RESTful API | New RESTful API | Note |
|--------------------------------|------------------------------------------|---------------------------------------------|
| GET /conf | GET /api/v1/conf | |
| GET /listDynamicConfigs | GET /api/v1/conf/dynamic | |
| GET /threadDump | GET /api/v1/thread_dump | |
| GET /applications | GET /api/v1/applications | |
| GET /shuffle | GET /api/v1/shuffles | |
| GET /listPartitionLocationInfo | GET /api/v1/shuffles/partitions | |
| GET /workerInfo | GET /api/v1/workers | |
| GET /isRegistered | GET /api/v1/workers | get the isRegistered field in response |
| GET /isDecommissioning | GET /api/v1/workers | get the isDecommissioning field in response |
| GET /isShutdown | GET /api/v1/workers | get the isShutdown field in response |
| GET /unavailablePeers | GET /api/v1/workers/unavailable_peers | |
| POST /exit | POST /api/v1/workers/exit | |
- Since 0.6.0, the RESTful api `/listTopDiskUsedApps` both in Master and Worker has been removed. Please use the following PromQL query instead.
```text
topK(50, sum by (applicationId) (metrics_diskBytesWritten_Value{role="Worker", applicationId!=""}))
```
- Since 0.6.0, the out-of-dated Flink 1.14 and Flink 1.15 have been removed from the official support list.
- Since 0.6.0, the client respects the spark.celeborn.storage.availableTypes configuration,
ensuring revived partition locations no longer default to memory storage. In contrast, clients prior
to 0.6.0 default to memory storage for revived partitions. This means that if memory storage is enabled in
worker nodes, clients prior to 0.6.0 may inadvertently utilize memory storage for an application even when memory
storage is not enabled for that app.
- Since 0.6.0, we have added a new sink `org.apache.celeborn.common.metrics.sink.LoggerSink` to make sure that Celeborn
metrics will be scraped periodically. It's recommended to enable this sink to make sure that worker's metrics data won't
be too large to cause worker OOM if you don't have a collector to scrape metrics periodically. Don't forget to update
the `metrics.properties`.
## Upgrading from 0.5.0 to 0.5.1
- Since 0.5.1, Celeborn master REST API `/exclude` request uses media type `application/x-www-form-urlencoded` instead of `text/plain`.
- Since 0.5.1, Celeborn master REST API `/sendWorkerEvent` request uses POST method and the parameters `type` and `workers` use form parameters instead, and uses media type `application/x-www-form-urlencoded` instead of `text/plain`.
- Since 0.5.1, Celeborn worker REST API `/exit` request uses media type `application/x-www-form-urlencoded` instead of `text/plain`.
## Upgrading from 0.4 to 0.5
- Since 0.5.0, Celeborn master metrics `LostWorkers` is renamed as `LostWorkerCount`.
- Since 0.5.0, Celeborn worker metrics `ChunkStreamCount` is renamed as `ActiveChunkStreamCount`.
- Since 0.5.0, Celeborn worker metrics `CreditStreamCount` is renamed as `ActiveCreditStreamCount`.
- Since 0.5.0, Celeborn configurations support new tag `isDynamic` to represent whether the configuration is dynamic config.
- Since 0.5.0, Celeborn changed the default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` from `LEVELDB` to `ROCKSDB`, which means Celeborn will use RocksDB store for recovery backend.
To restore the behavior before Celeborn 0.5, you can set `celeborn.worker.graceful.shutdown.recoverDbBackend` to `LEVELDB`.
- Since 0.5.0, Celeborn deprecate `celeborn.quota.configuration.path`. Please use `celeborn.dynamicConfig.store.fs.path` instead.
- Since 0.5.0, Celeborn client removes configuration `celeborn.client.push.splitPartition.threads`, `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory`.
- Since 0.5.0, Celeborn deprecate `celeborn.client.spark.shuffle.forceFallback.enabled`. Please use `celeborn.client.spark.shuffle.fallback.policy` instead.
- Since 0.5.0, Celeborn master REST API `/exclude` uses POST method and the parameters `add` and `remove` use form parameters instead.
- Since 0.5.0, Celeborn worker REST API `/exit` uses POST method and the parameter `type` uses form parameter instead.
- Since 0.5.0, Celeborn master and worker REST API `/shuffles` is renamed as `/shuffle`, and will be deprecated since 0.6.0.
## Upgrading from 0.4.0 to 0.4.1
- Since 0.4.1, Celeborn master adds a limit to the estimated partition size used for computing worker slots.
This size is now constrained within the range specified by `celeborn.master.estimatedPartitionSize.minSize` and `celeborn.master.estimatedPartitionSize.maxSize`.
- Since 0.4.1, Celeborn changed the fallback configuration of `celeborn.client.rpc.getReducerFileGroup.askTimeout`, `celeborn.client.rpc.registerShuffle.askTimeout` and `celeborn.client.rpc.requestPartition.askTimeout` from `celeborn.<module>.io.connectionTimeout` to `celeborn.rpc.askTimeout`.
## Upgrading from 0.3 to 0.4
- Since 0.4.0, Celeborn won't be compatible with Celeborn client that versions below 0.3.0.
Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
- Since 0.4.0, Celeborn won't support `org.apache.spark.shuffle.celeborn.RssShuffleManager`.
- Since 0.4.0, Celeborn changed the default value of `celeborn.<module>.io.numConnectionsPerPeer` from `2` to `1`.
- Since 0.4.0, Celeborn has changed the names of the prometheus master and worker configuration as shown in the table below:
| Key Before v0.4.0 | Key After v0.4.0 |
|-------------------------------------------|-----------------------------|
| `celeborn.metrics.master.prometheus.host` | `celeborn.master.http.host` |
| `celeborn.metrics.master.prometheus.port` | `celeborn.master.http.port` |
| `celeborn.metrics.worker.prometheus.host` | `celeborn.worker.http.host` |
| `celeborn.metrics.worker.prometheus.port` | `celeborn.worker.http.port` |
- Since 0.4.0, Celeborn deprecate `celeborn.worker.storage.baseDir.prefix` and `celeborn.worker.storage.baseDir.number`.
Please use `celeborn.worker.storage.dirs` instead.
- Since 0.4.0, Celeborn deprecate `celeborn.storage.activeTypes`. Please use `celeborn.storage.availableTypes` instead.
- Since 0.4.0, Celeborn worker removes configuration `celeborn.worker.userResourceConsumption.update.interval`.
- Since 0.4.0, Celeborn master metrics `PartitionWritten` is renamed as `ActiveShuffleSize`.
- Since 0.4.0, Celeborn master metrics `PartitionFileCount` is renamed as `ActiveShuffleFileCount`.
## Upgrading from 0.3.1 to 0.3.2
- Since 0.3.1, Celeborn changed the default value of `raft.client.rpc.request.timeout` from `3s` to `10s`.
- Since 0.3.1, Celeborn changed the default value of `raft.client.rpc.watch.request.timeout` from `10s` to `20s`.
## Upgrading from 0.3.0 to 0.3.1
- Since 0.3.1, Celeborn changed the default value of `celeborn.worker.directMemoryRatioToResume` from `0.5` to `0.7`.
- Since 0.3.1, Celeborn changed the default value of `celeborn.worker.monitor.disk.check.interval` from `60` to `30`.
- Since 0.3.1, name of JVM metrics changed, see details at CELEBORN-1007.
## Upgrading from 0.2 to 0.3
- Celeborn 0.2 Client is compatible with 0.3 Master/Server, it allows to upgrade Master/Worker first then Client.
Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
- Since 0.3.0, the support of deprecated configurations `rss.*` is removed.
All configurations listed in 0.2.1 docs still take effect, but some of those are deprecated too, please read
the bootstrap logs and follow the suggestion to migrate to the new configuration.
- From 0.3.0 on the default value for `celeborn.client.push.replicate.enabled` is changed from `true` to `false`, users
who want replication on should explicitly enable replication. For example, to enable replication for Spark
users should add the spark config when submitting job: `spark.celeborn.client.push.replicate.enabled=true`
- From 0.3.0 on the default value for `celeborn.worker.storage.workingDir` is changed from `hadoop/rss-worker/shuffle_data` to `celeborn-worker/shuffle_data`,
users who want to use origin working dir path should set this configuration.
- Since 0.3.0, configuration namespace `celeborn.ha.master` is deprecated, and will be removed in the future versions.
All configurations `celeborn.ha.master.*` should migrate to `celeborn.master.ha.*`.
- Since 0.3.0, environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT` are removed.
Instead `CELEBORN_LOCAL_HOSTNAME` works on both master and worker, which takes high priority than configurations defined in properties file.
- Since 0.3.0, the Celeborn Master URL schema is changed from `rss://` to `celeborn://`, for users who start Worker by
`sbin/start-worker.sh rss://<master-host>:<master-port>`, should migrate to `sbin/start-worker.sh celeborn://<master-host>:<master-port>`.
- Since 0.3.0, Celeborn supports overriding Hadoop configuration(`core-site.xml`, `hdfs-site.xml`, etc.) from Celeborn configuration with the additional prefix `celeborn.hadoop.`.
On Spark client side, user should set Hadoop configuration like `spark.celeborn.hadoop.foo=bar`, note that `spark.hadoop.foo=bar` does not take effect;
on Flink client and Celeborn Master/Worker side, user should set like `celeborn.hadoop.foo=bar`.
- Since 0.3.0, Celeborn master metrics `BlacklistedWorkerCount` is renamed as `ExcludedWorkerCount`.
- Since 0.3.0, Celeborn master http request url `/blacklistedWorkers` is renamed as `/excludedWorkers`.
- Since 0.3.0, introduces a terminology update for Celeborn worker data replication, replacing the previous `master/slave` terminology with `primary/replica`. In alignment with this change, corresponding metrics keywords have been adjusted.
The following table presents a comprehensive overview of the changes:
| Key Before v0.3.0 | Key After v0.3.0 |
|-------------------------------|--------------------------------|
| `MasterPushDataTime` | `PrimaryPushDataTime` |
| `MasterPushDataHandshakeTime` | `PrimaryPushDataHandshakeTime` |
| `MasterRegionStartTime` | `PrimaryRegionStartTime` |
| `MasterRegionFinishTime` | `PrimaryRegionFinishTime` |
| `SlavePushDataTime` | `ReplicaPushDataTime` |
| `SlavePushDataHandshakeTime` | `ReplicaPushDataHandshakeTime` |
| `SlaveRegionStartTime` | `ReplicaRegionStartTime` |
| `SlaveRegionFinishTime` | `ReplicaRegionFinishTime` |
- Since 0.3.0, Celeborn's spark shuffle manager change from `org.apache.spark.shuffle.celeborn.RssShuffleManager` to `org.apache.spark.shuffle.celeborn.SparkShuffleManager`. User can set spark property `spark.shuffle.manager` to `org.apache.spark.shuffle.celeborn.SparkShuffleManager` to use Celeborn remote shuffle service.
In 0.3.0, Celeborn still support `org.apache.spark.shuffle.celeborn.RssShuffleManager`, it will be removed in 0.4.0.