docs/faq/lakehouse-faq.md

{ “title”: “Data Lakehouse FAQ”, “language”: “en” }

Certificate Issues

When querying, an error curl 77: Problem with the SSL CA cert. occurs. This indicates that the current system certificate is too old and needs to be updated locally.
- You can download the latest CA certificate from https://curl.haxx.se/docs/caextract.html.
- Place the downloaded cacert-xxx.pem into the /etc/ssl/certs/ directory, for example: sudo cp cacert-xxx.pem /etc/ssl/certs/ca-certificates.crt.
When querying, an error occurs: ERROR 1105 (HY000): errCode = 2, detailMessage = (x.x.x.x)[CANCELLED][INTERNAL_ERROR]error setting certificate verify locations: CAfile: /etc/ssl/certs/ca-certificates.crt CApath: none.

yum install -y ca-certificates
ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca-certificates.crt

Kerberos

When connecting to a Hive Metastore authenticated with Kerberos, an error GSS initiate failed is encountered.
This is usually due to incorrect Kerberos authentication information. You can troubleshoot by following these steps:
1. In versions prior to 1.2.1, the libhdfs3 library that Doris depends on did not enable gsasl. Please update to versions 1.2.2 and later.
2. Ensure that correct keytab and principal are set for each component and verify that the keytab file exists on all FE and BE nodes.
  - hadoop.kerberos.keytab/hadoop.kerberos.principal: Used for Hadoop hdfs access, fill in the corresponding values for hdfs.
  - hive.metastore.kerberos.principal: Used for hive metastore.
3. Try replacing the IP in the principal with a domain name (do not use the default _HOST placeholder).
4. Ensure that the /etc/krb5.conf file exists on all FE and BE nodes.
When connecting to a Hive database through the Hive Catalog, an error occurs: RemoteException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS].
If the error occurs during the query when there are no issues with show databases and show tables, follow these two steps:
- Place core-site.xml and hdfs-site.xml in the fe/conf and be/conf directories.
- Execute Kerberos kinit on the BE node, restart BE, and then proceed with the query.
When encountering the error GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos Ticket) while querying a table configured with Kerberos, restarting FE and BE nodes usually resolves the issue.
- Before restarting all nodes, configure -Djavax.security.auth.useSubjectCredsOnly=false in the JAVA_OPTS parameter in "${DORIS_HOME}/be/conf/be.conf" to obtain JAAS credentials information through the underlying mechanism rather than the application.
- Refer to JAAS Troubleshooting for solutions to common JAAS errors.
To resolve the error Unable to obtain password from user when configuring Kerberos in the Catalog:
- Ensure the principal used is listed in klist by checking with klist -kt your.keytab.
- Verify the catalog configuration for any missing settings such as yarn.resourcemanager.principal.
- If the above checks are fine, it may be due to the JDK version installed by the system's package manager not supporting certain encryption algorithms. Consider installing JDK manually and setting the JAVA_HOME environment variable.
- Kerberos typically uses AES-256 for encryption. For Oracle JDK, JCE must be installed. Some distributions of OpenJDK automatically provide unlimited strength JCE, eliminating the need for separate installation.
- JCE versions correspond to JDK versions; download the appropriate JCE zip package and extract it to the $JAVA_HOME/jre/lib/security directory based on the JDK version:
  - JDK6: JCE6
  - JDK7: JCE7
  - JDK8: JCE8
When encountering the error java.security.InvalidKeyException: Illegal key size while accessing HDFS with KMS, upgrade the JDK version to >= Java 8 u162 or install the corresponding JCE Unlimited Strength Jurisdiction Policy Files.
If configuring Kerberos in the Catalog results in the error SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS], place the core-site.xml file in the "${DORIS_HOME}/be/conf" directory.
If accessing HDFS results in the error No common protection layer between client and server, ensure that the hadoop.rpc.protection properties on the client and server are consistent.
```
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

    <property>
        <name>hadoop.security.authentication</name>
        <value>kerberos</value>
    </property>

</configuration>
```
When using Broker Load with Kerberos configured and encountering the error Cannot locate default realm.:
Add the configuration item -Djava.security.krb5.conf=/your-path to the JAVA_OPTS in the start_broker.sh script for Broker Load.
When using Kerberos configuration in the Catalog, the hadoop.username property cannot be used simultaneously.
Accessing Kerberos with JDK 17
When running Doris with JDK 17 and accessing Kerberos services, you may encounter issues accessing due to the use of deprecated encryption algorithms. You need to add the allow_weak_crypto=true property in krb5.conf or upgrade the encryption algorithm in Kerberos.
For more details, refer to: https://seanjmullan.org/blog/2021/09/14/jdk17#kerberos

JDBC Catalog

Error connecting to SQLServer via JDBC Catalog: unable to find valid certification path to requested target
Add the trustServerCertificate=true option in the jdbc_url.
Connecting to MySQL database via JDBC Catalog results in Chinese character garbling or incorrect Chinese character query conditions
Add useUnicode=true&characterEncoding=utf-8 in the jdbc_url.
Note: Starting from version 1.2.3, when connecting to MySQL database via JDBC Catalog, these parameters will be automatically added.
Error connecting to MySQL database via JDBC Catalog: Establishing SSL connection without server's identity verification is not recommended
Add useSSL=true in the jdbc_url.
When synchronizing MySQL data to Doris using JDBC Catalog, date data synchronization error occurs. Verify if the MySQL version matches the MySQL driver package, for example, MySQL 8 and above require the driver com.mysql.cj.jdbc.Driver.

Hive Catalog

Error accessing Iceberg table via Hive Metastore: failed to get schema or Storage schema reading not supported
Place the relevant iceberg runtime jar files in Hive's lib/ directory.
Configure in hive-site.xml:
```
metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
```
After configuration, restart the Hive Metastore.

Error connecting to Hive Catalog: Caused by: java.lang.NullPointerException

If the fe.log contains the following stack trace:

Caused by: java.lang.NullPointerException
    at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.getFilteredObjects(AuthorizationMetaStoreFilterHook.java:78) ~[hive-exec-3.1.3-core.jar:3.1.3]
    at org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook.filterDatabases(AuthorizationMetaStoreFilterHook.java:55) ~[hive-exec-3.1.3-core.jar:3.1.3]
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1548) ~[doris-fe.jar:3.1.3]
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1542) ~[doris-fe.jar:3.1.3]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]

Try adding "metastore.filter.hook" = "org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl" in the create catalog statement to resolve.

If after creating Hive Catalog, show tables works fine but querying results in java.net.UnknownHostException: xxxxx
Add the following in the CATALOG's PROPERTIES:
```
'fs.defaultFS' = 'hdfs://<your_nameservice_or_actually_HDFS_IP_and_port>'
```
Tables in orc format in Hive 1.x may encounter system column names in the underlying orc file schema as _col0, _col1, _col2, etc. In this case, add hive.version as 1.x.x in the catalog configuration to map with the column names in the hive table.
```
CREATE CATALOG hive PROPERTIES (
    'hive.version' = '1.x.x'
);
```
When querying table data using Catalog, errors related to Hive Metastore such as Invalid method name are encountered, set the hive.version parameter.
When querying a table in ORC format, if the FE reports Could not obtain block or Caused by: java.lang.NoSuchFieldError: types, it may be due to the FE accessing HDFS to retrieve file information and perform file splitting by default. In some cases, the FE may not be able to access HDFS. This can be resolved by adding the following parameter: "hive.exec.orc.split.strategy" = "BI". Other options include HYBRID (default) and ETL.

In Hive, you can find the partition field values of a Hudi table, but in Doris, you cannot. Doris and Hive currently have different ways of querying Hudi. In Doris, you need to add the partition fields in the avsc file structure of the Hudi table. If not added, Doris will query with partition_val being empty (even if hoodie.datasource.hive_sync.partition_fields=partition_val is set).

{
    "type": "record",
    "name": "record",
    "fields": [{
        "name": "partition_val",
        "type": [
            "null",
            "string"
            ],
        "doc": "Preset partition field, empty string when not partitioned",
        "default": null
        },
        {
        "name": "name",
        "type": "string",
        "doc": "Name"
        },
        {
        "name": "create_time",
        "type": "string",
        "doc": "Creation time"
        }
    ]
}

When querying a Hive external table, if you encounter the error java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found, search for hadoop-lzo-*.jar in the Hadoop environment, place it in the "${DORIS_HOME}/fe/lib/" directory, and restart the FE. Starting from version 2.0.2, you can place this file in the custom_lib/ directory of the FE (if it does not exist, create it manually) to prevent file loss when upgrading the cluster due to the lib directory being replaced.
When creating a Hive table specifying the serde as org.apache.hadoop.hive.contrib.serde2.MultiDelimitserDe, and encountering the error storage schema reading not supported when accessing the table, add the following configuration to the hive-site.xml file and restart the HMS service:
```
<property>
  <name>metastore.storage.schema.reader.impl</name>
  <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value>
</property> 
```

Error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty. The complete error message in the FE log is as follows:

org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
org.apache.doris.common.UserException: errCode = 2, detailMessage = S3 list path exception. path=s3://bucket/part-*, err: errCode = 2, detailMessage = S3 list path failed. path=s3://bucket/part-*,msg=errors while get file status listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
org.apache.hadoop.fs.s3a.AWSClientIOException: listStatus on s3://bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: javax.net.ssl.SSLException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty
Caused by: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty

Try updating the CA certificate on the FE node using update-ca-trust (CentOS/RockyLinux), and then restart the FE process.

BE error: java.lang.InternalError. If you see an error similar to the following in be.INFO:

W20240506 15:19:57.553396 266457 jni-util.cpp:259] java.lang.InternalError
        at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.init(Native Method)
        at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.<init>(ZlibDecompressor.java:114)
        at org.apache.hadoop.io.compress.GzipCodec$GzipZlibDecompressor.<init>(GzipCodec.java:229)
        at org.apache.hadoop.io.compress.GzipCodec.createDecompressor(GzipCodec.java:188)
        at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:183)
        at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.<init>(CodecFactory.java:99)
        at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:223)
        at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:212)
        at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:43)

It is because the Doris built-in libz.a conflicts with the system environment's libz.so. To resolve this issue, first execute export LD_LIBRARY_PATH=/path/to/be/lib:$LD_LIBRARY_PATH, and then restart the BE process.

HDFS

When accessing HDFS 3.x, if you encounter the error java.lang.VerifyError: xxx, in versions prior to 1.2.1, Doris depends on Hadoop version 2.8. You need to update to 2.10.2 or upgrade Doris to versions after 1.2.2.
Using Hedged Read to optimize slow HDFS reads. In some cases, high load on HDFS may lead to longer read times for data replicas on a specific HDFS, thereby slowing down overall query efficiency. The HDFS Client provides the Hedged Read feature. This feature initiates another read thread to read the same data if a read request exceeds a certain threshold without returning, and the result returned first is used.
Note: This feature may increase the load on the HDFS cluster, so use it judiciously.
You can enable this feature in two ways:
- Specify it in the parameters when creating the Catalog:
```
create catalog regression properties (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.16.47:7004',
    'dfs.client.hedged.read.threadpool.size' = '128',
    'dfs.client.hedged.read.threshold.millis' = "500"
);
```

dfs.client.hedged.read.threadpool.size represents the number of threads used for Hedged Read, which are shared by an HDFS Client. Typically, for an HDFS cluster, BE nodes will share an HDFS Client.

dfs.client.hedged.read.threshold.millis is the read threshold in milliseconds. When a read request exceeds this threshold without returning, a Hedged Read is triggered.

When enabled, you can see the related parameters in the Query Profile:

TotalHedgedRead: Number of times Hedged Read was initiated.

HedgedReadWins: Number of successful Hedged Reads (times when the request was initiated and returned faster than the original request)

Note that these values are cumulative for a single HDFS Client, not for a single query. The same HDFS Client can be reused by multiple queries.

Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
In the start scripts of FE and BE, the environment variable HADOOP_CONF_DIR is added to the CLASSPATH. If HADOOP_CONF_DIR is set incorrectly, such as pointing to a non-existent or incorrect path, it may load the wrong xxx-site.xml file, resulting in reading incorrect information.
Check if HADOOP_CONF_DIR is configured correctly or remove this environment variable.
BlockMissingExcetpion: Could not obtain block: BP-XXXXXXXXX No live nodes contain current block
Possible solutions include:
- Use hdfs fsck file -files -blocks -locations to check if the file is healthy.
- Check connectivity with datanodes using telnet.
- Check datanode logs.
If you encounter the following error:
org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected SASL data transfer protection handshake from client at /XXX.XXX.XXX.XXX:XXXXX. Perhaps the client is running an older version of Hadoop which does not support SASL data transfer protection it means that the current hdfs has enabled encrypted transmission, but the client has not, causing the error.
Use any of the following solutions:
- Copy hdfs-site.xml and core-site.xml to be/conf and fe/conf directories. (Recommended)
- In hdfs-site.xml, find the corresponding configuration dfs.data.transfer.protection and set this parameter in the catalog.

DLF Catalog

When using the DLF Catalog, if Invalid address occurs during BE reading JindoFS data, add the domain name appearing in the logs to IP mapping in /etc/hosts.
If there is no permission to read data, use the hadoop.username property to specify a user with permission.
The metadata in the DLF Catalog should be consistent with DLF. When managing metadata using DLF, newly imported partitions in Hive may not be synchronized by DLF, leading to inconsistencies between DLF and Hive metadata. To address this, ensure that Hive metadata is fully synchronized by DLF.

Other Issues

Query results in garbled characters after mapping Binary type to Doris
Doris natively does not support the Binary type, so when mapping Binary types from various data lakes or databases to Doris, it is usually done using the String type. The String type can only display printable characters. If you need to query the content of Binary data, you can use the TO_BASE64() function to convert it to Base64 encoding before further processing.
Analyzing Parquet files
When querying Parquet files, due to potential differences in the format of Parquet files generated by different systems, such as the number of RowGroups, index values, etc., sometimes it is necessary to check the metadata of Parquet files for issue identification or performance analysis. Here is a tool provided to help users analyze Parquet files more conveniently:
1. Download and unzip Apache Parquet Cli 1.14.0
2. Download the Parquet file to be analyzed to your local machine, assuming the path is /path/to/file.parquet
3. Use the following command to analyze the metadata of the Parquet file:
  ./parquet-tools meta /path/to/file.parquet
4. For more functionalities, refer to Apache Parquet Cli documentation