Celeborn has three primary components: Master, Worker, and Client. Master manages all resources and syncs shared states based on Raft. Worker processes read-write requests and merges data for each reducer. LifecycleManager maintains metadata of each shuffle and runs within the Spark driver.
We introduce slots to achieve load balance. We will equally distribute partitions on every Celeborn worker by tracking slot usage. The Slot is a logical concept in Celeborn Worker that represents how many partitions can be allocated to each Celeborn Worker. Celeborn Worker‘s slot count is decided by total usable disk size / average shuffle file size
. Celeborn worker’s slot count decreases when a partition is allocated and increments when a partition is freed.
1.Celeborn supports Spark 2.4/3.0/3.1/3.2/3.3/3.4 and flink 1.14/1.15/1.17. 2.Celeborn tested under Java 8 environment.
Build Celeborn
./build/make-distribution.sh -Pspark-2.4/-Pspark-3.0/-Pspark-3.1/-Pspark-3.2/-Pspark-3.3/-Pspark-3.4/-Pflink-1.14/-Pflink-1.15/-Pflink-1.17
package apache-celeborn-${project.version}-bin.tgz will be generated.
Build procedure will create a compressed package.
Spark package layout:
├── RELEASE ├── bin ├── conf ├── jars // common jars for master and worker ├── master-jars ├── worker-jars ├── sbin └── spark // Spark client jars
Flink package layout:
├── RELEASE ├── bin ├── conf ├── jars // common jars for master and worker ├── master-jars ├── worker-jars ├── sbin └── flink // flink client jars
Celeborn server is compatible with all clients inside various engines. However, Celeborn clients must be consistent with the version of the specified engine. For example, if you are running Spark 2.4, you must compile Celeborn client with -Pspark-2.4; if you are running Spark 3.2, you must compile Celeborn client with -Pspark-3.2; if you are running flink 1.14, you must compile Celeborn client with -Pflink-1.14.
Celeborn cluster composes of Master and Worker nodes, the Master supports both single and HA mode(Raft-based) deployments.
$CELEBORN_HOME
$CELEBORN_HOME/conf/celeborn-env.sh
EXAMPLE:
#!/usr/bin/env bash CELEBORN_MASTER_MEMORY=4g CELEBORN_WORKER_MEMORY=2g CELEBORN_WORKER_OFFHEAP_MEMORY=4g
$CELEBORN_HOME/conf/celeborn-defaults.conf
EXAMPLE: single master cluster
# used by client and worker to connect to master celeborn.master.endpoints clb-master:9097 # used by master to bootstrap celeborn.master.host clb-master celeborn.master.port 9097 celeborn.metrics.enabled true celeborn.worker.flush.buffer.size 256k # Disk type is HDD by defaut. celeborn.worker.storage.dirs /mnt/disk1:disktype=SSD,/mnt/disk2:disktype=SSD # If your hosts have disk raid or use lvm, set celeborn.worker.monitor.disk.enabled to false celeborn.worker.monitor.disk.enabled false
EXAMPLE: HA cluster
# used by client and worker to connect to master celeborn.master.endpoints clb-1:9097,clb-2:9097,clb-3:9097 # used by master nodes to bootstrap, every node should know the topology of whole cluster, for each node, # `celeborn.ha.master.node.id` should be unique, and `celeborn.ha.master.node.<id>.host` is required. celeborn.ha.enabled true celeborn.ha.master.node.id 1 celeborn.ha.master.node.1.host clb-1 celeborn.ha.master.node.1.port 9097 celeborn.ha.master.node.1.ratis.port 9872 celeborn.ha.master.node.2.host clb-2 celeborn.ha.master.node.2.port 9097 celeborn.ha.master.node.2.ratis.port 9872 celeborn.ha.master.node.3.host clb-3 celeborn.ha.master.node.3.port 9097 celeborn.ha.master.node.3.ratis.port 9872 celeborn.ha.master.ratis.raft.server.storage.dir /mnt/disk1/rss_ratis/ celeborn.metrics.enabled true # If you want to use HDFS as shuffle storage, make sure that flush buffer size is at least 4MB or larger. celeborn.worker.flush.buffer.size 256k # Disk type is HDD by defaut. celeborn.worker.storage.dirs /mnt/disk1:disktype=SSD,/mnt/disk2:disktype=SSD # If your hosts have disk raid or use lvm, set celeborn.worker.monitor.disk.enabled to false celeborn.worker.monitor.disk.enabled false
Flink engine related configurations:
# if you are using Celeborn for flink, these settings will be needed celeborn.worker.directMemoryRatioForReadBuffer 0.4 celeborn.worker.directMemoryRatioToResume 0.6 # these setting will affect performance. # If there is enough off-heap memory, you can try to increase read buffers. # Read buffer max memory usage for a data partition is `taskmanager.memory.segment-size * readBuffersMax` celeborn.worker.partition.initial.readBuffersMin 512 celeborn.worker.partition.initial.readBuffersMax 1024 celeborn.worker.readBuffer.allocationWait 10ms
$CELEBORN_HOME/conf/hosts
and use $CELEBORN_HOME/sbin/start-all.sh
to start all services. If the installation paths are not identical, you will need to start the service manually.$CELEBORN_HOME/sbin/start-master.sh
$CELEBORN_HOME/sbin/start-worker.sh
22/10/08 19:29:11,805 INFO [main] Dispatcher: Dispatcher numThreads: 64 22/10/08 19:29:11,875 INFO [main] TransportClientFactory: mode NIO threads 64 22/10/08 19:29:12,057 INFO [main] Utils: Successfully started service 'MasterSys' on port 9097. 22/10/08 19:29:12,113 INFO [main] Master: Metrics system enabled. 22/10/08 19:29:12,125 INFO [main] HttpServer: master: HttpServer started on port 9098. 22/10/08 19:29:12,126 INFO [main] Master: Master started. 22/10/08 19:29:57,842 INFO [dispatcher-event-loop-19] Master: Registered worker Host: 192.168.15.140 RpcPort: 37359 PushPort: 38303 FetchPort: 37569 ReplicatePort: 37093 SlotsUsed: 0() LastHeartbeat: 0 Disks: {/mnt/disk1=DiskInfo(maxSlots: 6679, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk1, usableSpace: 448284381184, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk3=DiskInfo(maxSlots: 6716, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk3, usableSpace: 450755608576, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk2=DiskInfo(maxSlots: 6713, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk2, usableSpace: 450532900864, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk4=DiskInfo(maxSlots: 6712, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk4, usableSpace: 450456805376, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs } WorkerRef: null
Please refer to our website
Copy $CELEBORN_HOME/spark/*.jar to $SPARK_HOME/jars/
To use Celeborn,the following spark configurations should be added.
spark.shuffle.manager org.apache.spark.shuffle.celeborn.RssShuffleManager # must use kryo serializer because java serializer do not support relocation spark.serializer org.apache.spark.serializer.KryoSerializer # celeborn master spark.celeborn.master.endpoints clb-1:9097,clb-2:9097,clb-3:9097 spark.shuffle.service.enabled false # options: hash, sort # Hash shuffle writer use (partition count) * (celeborn.push.buffer.max.size) * (spark.executor.cores) memory. # Sort shuffle writer uses less memory than hash shuffle writer, if your shuffle partition count is large, try to use sort hash writer. spark.celeborn.shuffle.writer hash # We recommend setting spark.celeborn.push.replicate.enabled to true to enable server-side data replication # If you have only one worker, this setting must be false spark.celeborn.push.replicate.enabled true # Support for Spark AQE only tested under Spark 3 # we recommend setting localShuffleReader to false to get better performance of Celeborn spark.sql.adaptive.localShuffleReader.enabled false # we recommend enabling aqe support to gain better performance spark.sql.adaptive.enabled true spark.sql.adaptive.skewJoin.enabled true
Copy $CELEBORN_HOME/flink/*.jar to $FLINK_HOME/lib/
To use Celeborn, the following flink configurations should be added.
shuffle-service-factory.class: org.apache.celeborn.plugin.flink.RemoteShuffleServiceFactory celeborn.master.endpoints: clb-1:9097,clb-2:9097,clb-3:9097 celeborn.shuffle.batchHandleReleasePartition.enabled: true celeborn.push.maxReqsInFlight: 128 # Network connections between peers celeborn.data.io.numConnectionsPerPeer: 16 # threads number may vary according to your cluster but do not set to 1 celeborn.data.io.threads: 32 celeborn.shuffle.batchHandleCommitPartition.threads: 32 celeborn.rpc.dispatcher.numThreads: 32 # Floating buffers may need to change `taskmanager.network.memory.fraction` and `taskmanager.network.memory.max` taskmanager.network.memory.floating-buffers-per-gate: 4096 taskmanager.network.memory.buffers-per-channel: 0 taskmanager.memory.task.off-heap.size: 512m
If you want to set up a production-ready Celeborn cluster, your cluster should have at least 3 masters and at least 4 workers. Masters and works can be deployed on the same node but should not deploy multiple masters or workers on the same node. See more detail in CONFIGURATIONS
We provide a patch to enable users to use Spark with both Dynamic Resource Allocation(DRA) and Celeborn. For Spark2.x check Spark2 Patch.
For Spark3.x check Spark3 Patch. For Spark3.4 check Spark3 Patch.
Celeborn has various metrics. METRICS
Mail List is the most recognized form of communication in the Apache community. Contact us through the following mailing list.
Name | Scope | |||
---|---|---|---|---|
dev@celeborn.apache.org | Development-related discussions | Subscribe | Unsubscribe | Archives |
If you meet any questions, feel free to file a 🔗Jira Ticket or connect us and fix it by submitting a 🔗Pull Request.
IM | Contact Info |
---|---|
Slack | 🔗Slack |
DingTalk | 🔗DingTalk |
This is an active open-source project. We are always open to developers who want to use the system or contribute to it. See more detail in Contributing.
If you need to fully restart a Celeborn cluster in HA mode, you must clean ratis meta storage first because ratis meta will store expired states of the last running cluster.
Here are some instructions: