Object stores offered by CSPs such as AWS S3 are important for users of Gluten to store their data. This doc will discuss all details of configs, and use cases around using Gluten with object stores. In order to use an S3 endpoint as your data source, please ensure you are using the following S3 configs in your spark-defaults.conf. If you're experiencing any issues authenticating to S3 with additional auth mechanisms, please reach out to us using the ‘Issues’ tab.
S3 provides the endpoint based method to access the files, here's the example configuration. Users may need to modify some values based on real setup.
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider spark.hadoop.fs.s3a.access.key XXXXXXXXX spark.hadoop.fs.s3a.secret.key XXXXXXXXX spark.hadoop.fs.s3a.endpoint https://s3.us-west-1.amazonaws.com spark.hadoop.fs.s3a.connection.ssl.enabled true spark.hadoop.fs.s3a.path.style.access false
S3 also provides other methods for accessing, you can also use instance credentials by setting the following config
spark.hadoop.fs.s3a.use.instance.credentials true
Note that in this case, “spark.hadoop.fs.s3a.endpoint” won't take affect as Gluten will use the endpoint set during instance creation.
You can also use iam role credentials by setting the following configurations. Instance credentials have higher priority than iam credentials.
spark.hadoop.fs.s3a.iam.role xxxx spark.hadoop.fs.s3a.iam.role.session.name xxxx
Note that spark.hadoop.fs.s3a.iam.role.session.name is optional.
You can change log granularity of AWS C++ SDK by setting the spark.gluten.velox.awsSdkLogLevel configuration. The Allowed values are: “OFF”, “FATAL”, “ERROR”, “WARN”, “INFO”, “DEBUG”, “TRACE”.
You can change whether to use proxy from env for S3 C++ client by setting the spark.gluten.velox.s3UseProxyFromEnv configuration. The Allowed values are: “false”, “true”.
You can change the S3 payload signing policy by setting the spark.gluten.velox.s3PayloadSigningPolicy configuration. The Allowed values are: “Always”, “RequestDependent”, “Never”.
You can set the log location by setting the spark.gluten.velox.s3LogLocation configuration.
Velox supports a local cache when reading data from S3 but not strictly tested and there are several limitations. Please refer Velox Local Cache part for more detailed configurations.
All configurations starts with spark.hadoop.fs.s3a.
✅ Supported ❌ Not Supported ⚠️ Partial Support 🔄 In Progress 🚫 Not applied or transparent to Gluten
Here is the list of hadoop s3 file system configurations:
| Name | Default Value | Gluten Honored |
|---|---|---|
| aws.credentials.provider | (empty) | ⚠️ |
| security.credential.provider.path | (empty) | ❌ |
| assumed.role.arn | (empty) | ❌ |
| assumed.role.session.name | (empty) | ❌ |
| assumed.role.policy | (empty) | ❌ |
| assumed.role.session.duration | 30m | ❌ |
| assumed.role.sts.endpoint | (empty) | ❌ |
| assumed.role.sts.endpoint.region | (empty) | ❌ |
| assumed.role.credentials.provider | org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider | ❌ |
| delegation.token.binding | (empty) | ❌ |
| attempts.maximum | 5 | ❌ |
| socket.send.buffer | 8192 | ❌ |
| socket.recv.buffer | 8192 | ❌ |
| paging.maximum | 5000 | ❌ |
| multipart.size | 64M | ❌ |
| multipart.threshold | 128M | ❌ |
| multiobjectdelete.enable | true | ❌ |
| acl.default | (empty) | ❌ |
| multipart.purge | false | ❌ |
| multipart.purge.age | 86400 | ❌ |
| encryption.algorithm | (empty) | ❌ |
| encryption.key | (empty) | ❌ |
| signing-algorithm | (empty) | ❌ |
| block.size | 32M | ❌ |
| buffer.dir | ${env.LOCAL_DIRS:-${hadoop.tmp.dir}}/s3a | ❌ |
| fast.upload.buffer | disk | ❌ |
| fast.upload.active.blocks | 4 | ❌ |
| readahead.range | 64K | ❌ |
| user.agent.prefix | (empty) | |
| impl | org.apache.hadoop.fs.s3a.S3AFileSystem | ❌ |
| retry.limit | 7 | ✅ |
| retry.interval | 500ms | ❌ |
| retry.throttle.limit | 20 | ❌ |
| retry.throttle.interval | 100ms | ❌ |
| committer.name | file | 🚫 |
| committer.magic.enabled | true | 🚫 |
| committer.threads | 8 | 🚫 |
| committer.staging.tmp.path | tmp/staging | 🚫 |
| committer.staging.unique-filenames | true | 🚫 |
| committer.staging.conflict-mode | append | 🚫 |
| committer.abort.pending.uploads | true | 🚫 |
| list.version | 2 | 🚫 |
| etag.checksum.enabled | false | ❌ |
| change.detection.source | etag | ❌ |
| change.detection.mode | server | ❌ |
| change.detection.version.required | true | ❌ |
| ssl.channel.mode | default_jsse | ❌ |
| downgrade.syncable.exceptions | true | ❌ |
| create.checksum.algorithm | (empty) | ❌ |
| audit.enabled | true | ❌ |
| vectored.read.min.seek.size | 128K | ❌ |
| vectored.read.max.merged.size | 2M | ❌ |
| vectored.active.ranged.reads | 4 | ❌ |
| experimental.input.fadvise | random | ❌ |
| threads.max | 96 | ❌ |
| threads.keepalivetime | 60s | ❌ |
| executor.capacity | 16 | ❌ |
| max.total.tasks | 16 | ❌ |
| connection.maximum | 25 | ✅ |
| connection.keepalive | false | ❌ |
| connection.acquisition.timeout | 60s | ❌ |
| connection.establish.timeout | 30s | ❌ |
| connection.idle.time | 60s | ❌ |
| connection.request.timeout | 60s | ❌ |
| connection.timeout | 200s | ✅ |
| connection.ttl | 5m | ❌ |
Gluten new parameters:
| Name | Default Value |
|---|---|
| access.key | (none) |
| secret.key | (none) |
| endpoint | (none) |
| connection.ssl.enabled | false |
| path.style.access | false |
| retry.limit | (none) |
| retry.mode | legacy |
| instance.credentials | false |
| iam.role | (none) |
| iam.role.session.name | gluten-session |
| endpoint.region | (none) |
| aws.imds.enabled | true |
Gluten configures:
| Name | Default Value |
|---|---|
| spark.gluten.velox.awsSdkLogLevel | FATAL |
| spark.gluten.velox.s3UseProxyFromEnv | false |
| spark.gluten.velox.s3PayloadSigningPolicy | Never |
| spark.gluten.velox.s3LogLocation | (none) |