license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

KeyDefaultisDynamicDescriptionSinceDeprecated
celeborn.<module>.fetch.timeoutCheck.interval5sfalseInterval for checking fetch data timeout. It only support setting to data since it works for shuffle client fetch data.0.3.0
celeborn.<module>.fetch.timeoutCheck.threads4falseThreads num for checking fetch data timeout. It only support setting to data since it works for shuffle client fetch data.0.3.0
celeborn.<module>.heartbeat.interval60sfalseThe heartbeat interval between worker and client. If setting to push, it works for worker receiving push data. If setting to fetch, it works for worker fetch server. If you are using the “celeborn.client.heartbeat.interval”, please use the new configs for each module according to your needs or replace it with “celeborn.push.heartbeat.interval” and “celeborn.fetch.heartbeat.interval”.0.3.0celeborn.client.heartbeat.interval
celeborn.<module>.io.backLog0falseRequested maximum length of the queue of incoming connections. Default 0 for no backlog. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to push, it works for worker receiving push data. If setting to replicate, it works for replicate server of worker replicating data to peer worker. If setting to fetch, it works for worker fetch server.
celeborn.<module>.io.clientThreads0falseNumber of threads used in the client thread pool. Default to 0, which is 2x#cores. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data, of which default value is determined by celeborn..io.threads . If setting to replicate, it works for replicate client of worker replicating data to peer worker.
celeborn.<module>.io.conflictAvoidChooser.enablefalsefalseWhether to use conflict avoid event executor chooser in the client thread pool. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to replicate, it works for replicate client of worker replicating data to peer worker.0.5.4
celeborn.<module>.io.connectTimeout<value of celeborn.network.connect.timeout>falseSocket connect timeout. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to replicate, it works for the replicate client of worker replicating data to peer worker.
celeborn.<module>.io.connectionTimeout<value of celeborn.network.timeout>falseConnection active timeout. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to push, it works for worker receiving push data. If setting to replicate, it works for replicate server or client of worker replicating data to peer worker. If setting to fetch, it works for worker fetch server.
celeborn.<module>.io.lazyFDtruefalseWhether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files. If setting to fetch, it works for worker fetch server.
celeborn.<module>.io.maxRetries3falseMax number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries. If setting to data, it works for shuffle client push and fetch data. If setting to replicate, it works for replicate client of worker replicating data to peer worker. If setting to push, it works for Flink shuffle client push data.
celeborn.<module>.io.mode<undefined>falseNetty EventLoopGroup backend, available options: NIO, EPOLL, KQUEUE. For Linux environments, EPOLL is used if available before using NIO. For MacOS/BSD environments, KQUEUE is used if available before using NIO.
celeborn.<module>.io.numConnectionsPerPeer1falseNumber of concurrent connections between two nodes. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to replicate, it works for replicate client of worker replicating data to peer worker.
celeborn.<module>.io.preferDirectBufstruefalseIf true, we will prefer allocating off-heap byte buffers within Netty. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to push, it works for worker receiving push data. If setting to replicate, it works for replicate server or client of worker replicating data to peer worker. If setting to fetch, it works for worker fetch server.
celeborn.<module>.io.receiveBuffer0bfalseReceive buffer size (SO_RCVBUF). Note: the optimal size for receive buffer and send buffer should be latency * network_bandwidth. Assuming latency = 1ms, network_bandwidth = 10Gbps buffer size should be ~ 1.25MB. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to push, it works for worker receiving push data. If setting to replicate, it works for replicate server or client of worker replicating data to peer worker. If setting to fetch, it works for worker fetch server.0.2.0
celeborn.<module>.io.retryWait5sfalseTime that we will wait in order to perform a retry after an IOException. Only relevant if maxIORetries > 0. If setting to data, it works for shuffle client push and fetch data. If setting to replicate, it works for replicate client of worker replicating data to peer worker. If setting to push, it works for Flink shuffle client push data.0.2.0
celeborn.<module>.io.saslTimeout30sfalseTimeout for a single round trip of auth message exchange, in milliseconds.0.5.0
celeborn.<module>.io.sendBuffer0bfalseSend buffer size (SO_SNDBUF). If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to data, it works for shuffle client push and fetch data. If setting to push, it works for worker receiving push data. If setting to replicate, it works for replicate server or client of worker replicating data to peer worker. If setting to fetch, it works for worker fetch server.0.2.0
celeborn.<module>.io.serverThreads0falseNumber of threads used in the server thread pool. Default to 0, which is 2x#cores. If setting to rpc_app, works for shuffle client. If setting to rpc_service, works for master or worker. If setting to push, it works for worker receiving push data. If setting to replicate, it works for replicate server of worker replicating data to peer worker. If setting to fetch, it works for worker fetch server.
celeborn.<module>.io.threads8falseDefault number of threads used in the server and client thread pool. This specifies thread configuration based on JVM's allocation of cores. If setting to data, it works for shuffle client push and fetch data.
celeborn.<module>.push.timeoutCheck.interval5sfalseInterval for checking push data timeout. If setting to data, it works for shuffle client push data. If setting to push, it works for Flink shuffle client push data. If setting to replicate, it works for replicate client of worker replicating data to peer worker.0.3.0
celeborn.<module>.push.timeoutCheck.threads4falseThreads num for checking push data timeout. If setting to data, it works for shuffle client push data. If setting to push, it works for Flink shuffle client push data. If setting to replicate, it works for replicate client of worker replicating data to peer worker.0.3.0
celeborn.<role>.rpc.dispatcher.threads<value of celeborn.rpc.dispatcher.threads>falseThreads number of message dispatcher event loop for roles
celeborn.io.maxDefaultNettyThreads64falseMax default netty threads0.3.2
celeborn.network.advertise.preferIpAddress<value of celeborn.network.bind.preferIpAddress>falseWhen true, prefer to use IP address, otherwise FQDN for advertise address.0.6.0
celeborn.network.bind.preferIpAddresstruefalseWhen true, prefer to use IP address, otherwise FQDN. This configuration only takes effects when the bind hostname is not set explicitly, in such case, Celeborn will find the first non-loopback address to bind.0.3.0
celeborn.network.bind.wildcardAddressfalsefalseWhen true, the bind address will be set to a wildcard address, while the advertise address will remain as whatever is set by celeborn.network.advertise.preferIpAddress. The wildcard address is a special local IP address, and usually refers to ‘any’ and can only be used for bind operations. In the case of IPv4, this is 0.0.0.0 and in the case of IPv6 this is ::0. This is helpful in dual-stack environments, where the service must listen to both IPv4 and IPv6 clients.0.6.0
celeborn.network.connect.timeout10sfalseDefault socket connect timeout.0.2.0
celeborn.network.memory.allocator.numArenas<undefined>falseNumber of arenas for pooled memory allocator. Default value is Runtime.getRuntime.availableProcessors, min value is 2.0.3.0
celeborn.network.memory.allocator.verbose.metricfalsefalseWhether to enable verbose metric for pooled allocator.0.3.0
celeborn.network.timeout240sfalseDefault timeout for network operations.0.2.0
celeborn.port.maxRetries1falseWhen port is occupied, we will retry for max retry times.0.2.0
celeborn.rpc.askTimeout60sfalseTimeout for RPC ask operations. It's recommended to set at least 240s when HDFS is enabled in celeborn.storage.availableTypes0.2.0
celeborn.rpc.connect.threads64false0.2.0
celeborn.rpc.dispatcher.threads0falseThreads number of message dispatcher event loop. Default to 0, which is availableCore.0.3.0celeborn.rpc.dispatcher.numThreads
celeborn.rpc.dump.interval60sfalsemin interval (ms) for RPC framework to dump performance summary0.6.0
celeborn.rpc.inbox.capacity0falseSpecifies size of the in memory bounded capacity.0.5.0
celeborn.rpc.io.threads<undefined>falseNetty IO thread number of NettyRpcEnv to handle RPC request. The default threads number is the number of runtime available processors.0.2.0
celeborn.rpc.lookupTimeout30sfalseTimeout for RPC lookup operations.0.2.0
celeborn.rpc.retryWait1sfalseTime to wait before next retry on RpcTimeoutException.0.5.4
celeborn.rpc.slow.interval<undefined>falsemin interval (ms) for RPC framework to log slow RPC0.6.0
celeborn.rpc.slow.threshold1sfalsethreshold for RPC framework to log slow RPC0.6.0
celeborn.shuffle.io.maxChunksBeingTransferred<undefined>falseThe max number of chunks allowed to be transferred at the same time on shuffle service. Note that new incoming connections will be closed when the max number is hit. The client will retry according to the shuffle retry configs (see celeborn.<module>.io.maxRetries and celeborn.<module>.io.retryWait), if those limits are reached the task will fail with fetch failure.0.2.0
celeborn.ssl.<module>.enabledfalsefalseEnables SSL for securing wire traffic.0.5.0
celeborn.ssl.<module>.enabledAlgorithms<undefined>falseA comma-separated list of ciphers. The specified ciphers must be supported by JVM.
The reference list of protocols can be found in the “JSSE Cipher Suite Names” section of the Java security guide. The list for Java 11, for example, can be found at this page
Note: If not set, the default cipher suite for the JRE will be used
0.5.0
celeborn.ssl.<module>.keyStore<undefined>falsePath to the key store file.
The path can be absolute or relative to the directory in which the process is started.
0.5.0
celeborn.ssl.<module>.keyStorePassword<undefined>falsePassword to the key store.0.5.0
celeborn.ssl.<module>.protocolTLSv1.2falseTLS protocol to use.
The protocol must be supported by JVM.
The reference list of protocols can be found in the “Additional JSSE Standard Names” section of the Java security guide. For Java 11, for example, the list can be found here
0.5.0
celeborn.ssl.<module>.sslHandshakeTimeoutMs10sfalseThe timeout for the SSL handshake (in milliseconds). The default value is set to the current Netty default. This is applicable for rpc_app and rpc_service modules0.5.4
celeborn.ssl.<module>.trustStore<undefined>falsePath to the trust store file.
The path can be absolute or relative to the directory in which the process is started.
0.5.0
celeborn.ssl.<module>.trustStorePassword<undefined>falsePassword for the trust store.0.5.0
celeborn.ssl.<module>.trustStoreReloadIntervalMs10sfalseThe interval at which the trust store should be reloaded (in milliseconds), when enabled. This setting is mostly only useful for server components, not applications.0.5.0
celeborn.ssl.<module>.trustStoreReloadingEnabledfalsefalseWhether the trust store should be reloaded periodically.
This setting is mostly only useful for Celeborn services (masters, workers), and not applications.
0.5.0