| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="s3" rev="2.2.0"> |
| |
| <title>Using Impala with Amazon S3 Object Store</title> |
| <titlealts audience="PDF"><navtitle>S3 Tables</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Amazon"/> |
| <data name="Category" value="S3"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Preview Features"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p rev="2.2.0"> You can use Impala to query data residing on the Amazon S3 |
| object store. This capability allows convenient access to a storage system |
| that is remotely managed, accessible from anywhere, and integrated with |
| various cloud-based services. Impala can query files in any supported file |
| format from S3. The S3 storage location can be for an entire table, or |
| individual partitions in a partitioned table. </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| <concept id="s3_best_practices" rev="2.6.0 IMPALA-1878"> |
| <title>Best Practices for Using Impala with S3</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Guidelines"/> |
| <data name="Category" value="Best Practices"/> |
| </metadata> |
| </prolog> |
| <conbody> |
| <p> The following guidelines summarize the best practices described in the |
| rest of this topic: </p> |
| <ul> |
| <li> |
| <p> Any reference to an S3 location must be fully qualified when S3 is |
| not designated as the default storage, for example, |
| <codeph>s3a:://[s3-bucket-name]</codeph>.</p> |
| </li> |
| <li> |
| <p> Set <codeph>fs.s3a.connection.maximum</codeph> to 1500 for |
| <cmdname>impalad</cmdname>. </p> |
| </li> |
| <li> |
| <p> Set <codeph>fs.s3a.block.size</codeph> to 134217728 (128 MB in |
| bytes) if most Parquet files queried by Impala were written by Hive |
| or ParquetMR jobs. </p> |
| <p>Set the block size to 268435456 (256 MB in bytes) if most Parquet |
| files queried by Impala were written by Impala. </p> |
| <p>Starting in Impala 3.4.0, instead of |
| <codeph>fs.s3a.block.size</codeph>, the |
| <codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> query option |
| controls the Parquet-specific split size. The default value is 256 |
| MB.</p> |
| </li> |
| <li> |
| <p> |
| <codeph>DROP TABLE .. PURGE</codeph> is much faster than the default |
| <codeph>DROP TABLE</codeph>. The same applies to <codeph>ALTER |
| TABLE ... DROP PARTITION PURGE</codeph> versus the default |
| <codeph>DROP PARTITION</codeph> operation. Due to the eventually |
| consistent nature of S3, the files for that table or partition could |
| remain for some unbounded time when using <codeph>PURGE</codeph>. |
| The default <codeph>DROP TABLE/PARTITION</codeph> is slow because |
| Impala copies the files to the HDFS trash folder, and Impala waits |
| until all the data is moved. <codeph>DROP TABLE/PARTITION .. |
| PURGE</codeph> is a fast delete operation, and the Impala |
| statement finishes quickly even though the change might not have |
| propagated fully throughout S3. </p> |
| </li> |
| <li> |
| <p> |
| <codeph>INSERT</codeph> statements are faster than <codeph>INSERT |
| OVERWRITE</codeph> for S3. The query option |
| <codeph>S3_SKIP_INSERT_STAGING</codeph>, which is set to |
| <codeph>true</codeph> by default, skips the staging step for |
| regular <codeph>INSERT</codeph> (but not <codeph>INSERT |
| OVERWRITE</codeph>). This makes the operation much faster, but |
| consistency is not guaranteed: if a node fails during execution, the |
| table could end up with inconsistent data. Set this option to |
| <codeph>false</codeph> if stronger consistency is required, |
| however, this setting will make the <codeph>INSERT</codeph> |
| operations slower. </p> |
| <ul> |
| <li> |
| <p> For Impala-ACID tables, both <codeph>INSERT</codeph> and |
| <codeph>INSERT OVERWRITE</codeph> tables for S3 are fast, |
| regardless of the setting of |
| <codeph>S3_SKIP_INSERT_STAGING</codeph>. Plus, consistency is |
| guaranteed with ACID tables.</p> |
| </li> |
| </ul> |
| </li> |
| <li>Enable <xref href="impala_data_cache.xml#data_cache">data cache for |
| remote reads</xref>.</li> |
| <li>Enable <xref |
| href="https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/s3guard.html" |
| format="html" scope="external">S3Guard</xref> in your cluster for |
| data consistency.</li> |
| <li> |
| <p> Too many files in a table can make metadata load and update slow |
| in S3. If too many requests are made to S3, S3 has a back-off |
| mechanism and responds slower than usual.</p> |
| <ul> |
| <li>If you have many small files due to over-granular partitioning, |
| configure partitions with many megabytes of data so that even a |
| query against a single partition can be parallelized effectively. </li> |
| <li>If you have many small files because of many small |
| <codeph>INSERT</codeph> queries, use bulk |
| <codeph>INSERT</codeph>s so that more data is written to fewer |
| files. </li> |
| </ul> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="s3_sql"> |
| <title>How Impala SQL Statements Work with S3</title> |
| <conbody> |
| <p> Impala SQL statements work with data in S3 as follows: </p> |
| <ul> |
| <li> |
| <p> The <xref href="impala_create_table.xml#create_table">CREATE |
| TABLE</xref> or <xref href="impala_alter_table.xml#alter_table" |
| >ALTER TABLE</xref> statement can specify that a table resides in |
| the S3 object store by encoding an <codeph>s3a://</codeph> prefix |
| for the <codeph>LOCATION</codeph> property. <codeph>ALTER |
| TABLE</codeph> can also set the <codeph>LOCATION</codeph> property |
| for an individual partition so that some data in a table resides in |
| S3 and other data in the same table resides on HDFS. </p> |
| </li> |
| <li> |
| <p> Once a table or partition is designated as residing in S3, the |
| <xref href="impala_select.xml#select"/> statement transparently |
| accesses the data files from the appropriate storage layer. </p> |
| </li> |
| <li> |
| <p> |
| If the S3 table is an internal table, the <xref |
| href="impala_drop_table.xml#drop_table">DROP TABLE</xref> statement |
| removes the corresponding data files from S3 when the table is dropped. |
| </p> |
| </li> |
| <li> |
| <p> The <xref href="impala_truncate_table.xml#truncate_table">TRUNCATE |
| TABLE</xref> statement always removes the corresponding |
| data files from S3 when the table is truncated. </p> |
| </li> |
| <li> |
| <p> |
| The <xref href="impala_load_data.xml#load_data">LOAD DATA</xref> |
| statement can move data files residing in HDFS into |
| an S3 table. |
| </p> |
| </li> |
| <li> |
| <p> |
| The <xref href="impala_insert.xml#insert">INSERT</xref> statement, or the <codeph>CREATE TABLE AS SELECT</codeph> |
| form of the <codeph>CREATE TABLE</codeph> statement, can copy data from an HDFS table or another S3 |
| table into an S3 table. The <xref |
| href="impala_s3_skip_insert_staging.xml#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING</xref> |
| query option chooses whether or not to use a fast code path for these write operations to S3, |
| with the tradeoff of potential inconsistency in the case of a failure during the statement. |
| </p> |
| </li> |
| </ul> |
| <p> |
| For usage information about Impala SQL statements with S3 tables, see <xref href="impala_s3.xml#s3_ddl"/> |
| and <xref href="impala_s3.xml#s3_dml"/>. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="s3_creds"> |
| |
| <title>Specifying Impala Credentials to Access Data in S3</title> |
| |
| <conbody> |
| |
| <p> To allow Impala to access data in S3, specify values for the following |
| configuration settings in your <filepath>core-site.xml</filepath> file: </p> |
| <codeblock> |
| <property> |
| <name>fs.s3a.access.key</name> |
| <value><varname>your_access_key</varname></value> |
| </property> |
| <property> |
| <name>fs.s3a.secret.key</name> |
| <value><varname>your_secret_key</varname></value> |
| </property> |
| </codeblock> |
| |
| <p> After specifying the credentials, restart both the Impala and Hive |
| services. Restarting Hive is required because Impala statements, such as |
| <codeph>CREATE TABLE</codeph>, go through the Hive Metastore. </p> |
| |
| <note type="important"> |
| <p> |
| Although you can specify the access key ID and secret key as part of the <codeph>s3a://</codeph> URL in the |
| <codeph>LOCATION</codeph> attribute, doing so makes this sensitive information visible in many places, such |
| as <codeph>DESCRIBE FORMATTED</codeph> output and Impala log files. Therefore, specify this information |
| centrally in the <filepath>core-site.xml</filepath> file, and restrict read access to that file to only |
| trusted users. |
| </p> |
| </note> |
| <p>See <xref |
| href="https://www.google.com/url?q=https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html%23Authenticating_with_S3&sa=D&ust=1572980027740000&usg=AFQjCNFnzPSfNBMVRgJZRenvhLblezHbdw" |
| format="html" scope="external">Authenticating with S3</xref> for |
| additional authentication mechanisms to access S3.</p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="s3_etl"> |
| |
| <title>Loading Data into S3 for Impala Queries</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="ETL"/> |
| <data name="Category" value="Ingest"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| If your ETL pipeline involves moving data into S3 and then querying through Impala, |
| you can either use Impala DML statements to create, move, or copy the data, or |
| use the same data loading techniques as you would for non-Impala data. |
| </p> |
| |
| </conbody> |
| |
| <concept id="s3_dml" rev="2.6.0 IMPALA-1878"> |
| <title>Using Impala DML Statements for S3 Data</title> |
| <conbody> |
| <p>The Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD |
| DATA</codeph>, and <codeph>CREATE TABLE AS SELECT</codeph>) can |
| write data into a table or partition that resides in S3. The syntax of |
| the DML statements is the same as for any other tables because the S3 |
| location for tables and partitions is specified by an |
| <codeph>s3a://</codeph> prefix in the <codeph>LOCATION</codeph> |
| attribute of <codeph>CREATE TABLE</codeph> or <codeph>ALTER |
| TABLE</codeph> statements. If you bring data into S3 using the |
| normal S3 transfer mechanisms instead of Impala DML statements, issue |
| a <codeph>REFRESH</codeph> statement for the table before using Impala |
| to query the S3 data.</p> |
| <p conref="../shared/impala_common.xml#common/s3_dml_performance"/> |
| </conbody> |
| </concept> |
| |
| <concept id="s3_manual_etl"> |
| <title>Manually Loading Data into Impala Tables in S3</title> |
| <conbody> |
| <p> |
| As an alternative, or on earlier Impala releases without DML support for S3, |
| you can use the Amazon-provided methods to bring data files into S3 for querying through Impala. See |
| <xref href="http://aws.amazon.com/s3/" scope="external" format="html">the Amazon S3 web site</xref> for |
| details. |
| </p> |
| |
| <note type="important"> |
| <p conref="../shared/impala_common.xml#common/s3_drop_table_purge"/> |
| </note> |
| |
| <p> After you upload data files to a location already mapped to an |
| Impala table or partition, or if you delete files in S3 from such a |
| location, issue the <codeph>REFRESH</codeph> statement to make Impala |
| aware of the new set of data files. </p> |
| |
| </conbody> |
| </concept> |
| |
| </concept> |
| |
| <concept id="s3_ddl"> |
| |
| <title>Creating Impala Databases, Tables, and Partitions for Data Stored in |
| S3</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Databases"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| <p>To create a table that resides in S3, run the <codeph>CREATE |
| TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement with the |
| <codeph>LOCATION</codeph> clause. </p> |
| <p><codeph>ALTER TABLE</codeph> can set the <codeph>LOCATION</codeph> |
| property for an individual partition, so that some data in a table |
| resides in S3 and other data in the same table resides on HDFS.</p> |
| <p>The syntax for the <codeph>LOCATION</codeph> clause is:</p> |
| <codeblock>LOCATION 's3a://<varname>bucket_name</varname>/<varname>path</varname>/<varname>to</varname>/<varname>file</varname>'</codeblock> |
| <p>The file system prefix is always <codeph>s3a://</codeph>. Impala does |
| not support the <codeph>s3://</codeph> or <codeph>s3n://</codeph> |
| prefixes. </p> |
| <p> For a partitioned table, either specify a separate |
| <codeph>LOCATION</codeph> clause for each new partition, or specify a |
| base <codeph>LOCATION</codeph> for the table and set up a directory |
| structure in S3 to mirror the way Impala partitioned tables are |
| structured in S3. </p> |
| |
| <p> You point a nonpartitioned table or an individual partition at S3 by |
| specifying a single directory path in S3, which could be any arbitrary |
| directory. To replicate the structure of an entire Impala partitioned |
| table or database in S3 requires more care, with directories and |
| subdirectories nested and named to match the equivalent directory tree |
| in HDFS. Consider setting up an empty staging area if necessary in HDFS, |
| and recording the complete directory structure so that you can replicate |
| it in S3. </p> |
| <p> When working with multiple tables with data files stored in S3, you can |
| create a database with a <codeph>LOCATION</codeph> attribute pointing to |
| an S3 path. Specify a URL of the form |
| <codeph>s3a://<varname>bucket</varname>/<varname>root</varname>/<varname>path</varname>/<varname>for</varname>/<varname>database</varname></codeph> |
| for the <codeph>LOCATION</codeph> attribute of the database. Any tables |
| created inside that database automatically create directories underneath |
| the one specified by the database <codeph>LOCATION</codeph> attribute. </p> |
| <p>The following example creates a table with one partition for the year |
| 2017 resides on HDFS and one partition for the year 2018 resides in |
| S3.</p> |
| |
| <p>The partition for year 2018 includes a <codeph>LOCATION</codeph> |
| attribute with an <codeph>s3a://</codeph> URL, and so refers to data |
| residing in S3, under a specific path underneath the bucket |
| <codeph>impala-demo</codeph>. </p> |
| |
| <codeblock>CREATE TABLE mostly_on_hdfs (x int) PARTITIONED BY (year INT); |
| ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2017); |
| ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2018) |
| LOCATION 's3a://impala-demo/dir1/dir2/dir3/t1'; |
| </codeblock> |
| |
| <p> The following session creates a database and two partitioned tables |
| residing entirely in S3, one partitioned by a single column and the |
| other partitioned by multiple columns. </p> |
| <ul> |
| <li>Because a <codeph>LOCATION</codeph> attribute with an |
| <codeph>s3a://</codeph> URL is specified for the database, the |
| tables inside that database are automatically created in S3 underneath |
| the database directory. </li> |
| <li>To see the names of the associated subdirectories, including the |
| partition key values, use an S3 client tool to examine how the |
| directory structure is organized in S3. </li> |
| </ul> |
| |
| <codeblock>CREATE DATABASE db_on_s3 LOCATION 's3a://impala-demo/dir1/dir2/dir3'; |
| CREATE TABLE partitioned_multiple_keys (x INT) |
| PARTITIONED BY (year SMALLINT, month TINYINT, day TINYINT); |
| |
| ALTER TABLE partitioned_multiple_keys |
| ADD PARTITION (year=2015,month=1,day=1); |
| ALTER TABLE partitioned_multiple_keys |
| ADD PARTITION (year=2015,month=1,day=31); |
| |
| !hdfs dfs -ls -R s3a://impala-demo/dir1/dir2/dir3 |
| 2015-03-17 13:56:34 0 dir1/dir2/dir3/ |
| 2015-03-17 16:47:13 0 dir1/dir2/dir3/partitioned_multiple_keys/ |
| 2015-03-17 16:47:44 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/ |
| 2015-03-17 16:47:50 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/</codeblock> |
| |
| <p> |
| The <codeph>CREATE DATABASE</codeph> and <codeph>CREATE TABLE</codeph> statements create the associated |
| directory paths if they do not already exist. You can specify multiple levels of directories, and the |
| <codeph>CREATE</codeph> statement creates all appropriate levels, similar to using <codeph>mkdir |
| -p</codeph>. |
| </p> |
| |
| <p> Use the standard S3 file upload methods to put the actual data files |
| into the right locations. You can also put the directory paths and data |
| files in place before creating the associated Impala databases or |
| tables, and Impala automatically uses the data from the appropriate |
| location after the associated databases and tables are created. </p> |
| <p>Use the <codeph>ALTER TABLE</codeph> statement with the |
| <codeph>LOCATION</codeph> clause to switch whether an existing table |
| or partition points to data in HDFS or S3. For example, if you have an |
| Impala table or partition pointing to data files in HDFS or S3, and you |
| later transfer those data files to the other filesystem, use the |
| <codeph>ALTER TABLE</codeph> statement to adjust the |
| <codeph>LOCATION</codeph> attribute of the corresponding table or |
| partition to reflect that change. </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="s3_internal_external"> |
| |
| <title>Internal and External Tables Located in S3</title> |
| |
| <conbody> |
| |
| <p> Just as with tables located on HDFS storage, you can designate |
| S3-based tables as either internal (managed by Impala) or external, by |
| using the syntax <codeph>CREATE TABLE</codeph> or <codeph>CREATE |
| EXTERNAL TABLE</codeph> respectively. </p> |
| <p>When you drop an internal table, the files associated with the table |
| are removed, even if they are in S3 storage. When you drop an external |
| table, the files associated with the table are left alone, and are still |
| available for access by other tools or components.</p> |
| |
| <p> If the data in S3 is intended to be long-lived and accessed by other |
| tools in addition to Impala, create any associated S3 tables with the |
| <codeph>CREATE EXTERNAL TABLE</codeph> syntax, so that the files are |
| not deleted from S3 when the table is dropped. </p> |
| |
| <p> If the data in S3 is only needed for querying by Impala and can be |
| safely discarded once the Impala workflow is complete, create the |
| associated S3 tables using the <codeph>CREATE TABLE</codeph> syntax, so |
| that dropping the table also deletes the corresponding data files in S3. </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="s3_queries"> |
| |
| <title>Running and Tuning Impala Queries for Data Stored in S3</title> |
| |
| <conbody> |
| <p> Once a table or partition is designated as residing in S3, the |
| <codeph>SELECT</codeph> statement transparently accesses the data |
| files from the appropriate storage layer. </p> |
| |
| <ul> |
| <li> |
| Queries against S3 data support all the same file formats as for HDFS data. |
| </li> |
| |
| <li> |
| Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in S3 |
| corresponding to the HDFS directories representing partition key values, or use <codeph>ALTER TABLE ... |
| ADD PARTITION</codeph> to set up the appropriate paths in S3. |
| </li> |
| |
| <li> |
| HDFS and HBase tables can be joined to S3 tables, or S3 tables can be joined with each other. |
| </li> |
| |
| <li> Authorization to control access to databases, tables, or columns |
| works the same whether the data is in HDFS or in S3. </li> |
| <li> The Catalog Server (<cmdname>catalogd</cmdname>) daemon caches |
| metadata for both HDFS and S3 tables.</li> |
| |
| <li> |
| Queries against S3 tables are subject to the same kinds of admission control and resource management as |
| HDFS tables. |
| </li> |
| |
| <li> Metadata about S3 tables is stored in the same Metastore database |
| as for HDFS tables. </li> |
| |
| <li> |
| You can set up views referring to S3 tables, the same as for HDFS tables. |
| </li> |
| |
| <li> The <codeph>COMPUTE STATS</codeph>, <codeph>SHOW TABLE |
| STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph> statements |
| work for S3 tables. </li> |
| </ul> |
| |
| </conbody> |
| |
| <concept id="s3_performance"> |
| |
| <title>Understanding and Tuning Impala Query Performance for S3 Data</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Performance"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p>Here are techniques you can use to interpret explain plans and |
| profiles for queries against S3 data, and tips to achieve the best |
| performance possible for such queries. </p> |
| |
| <p> All else being equal, performance is expected to be lower for |
| queries running against data in S3 rather than HDFS. The actual |
| mechanics of the <codeph>SELECT</codeph> statement are somewhat |
| different when the data is in S3. Although the work is still |
| distributed across the DataNodes of the cluster, Impala might |
| parallelize the work for a distributed query differently for data on |
| HDFS and S3.</p> |
| <p>S3 does not have the same block notion as HDFS, so Impala uses |
| heuristics to determine how to split up large S3 files for processing |
| in parallel. Because all hosts can access any S3 data file with equal |
| efficiency, the distribution of work might be different than for HDFS |
| data, where the data blocks are physically read using short-circuit |
| local reads by hosts that contain the appropriate block replicas. |
| Although the I/O to read the S3 data might be spread evenly across the |
| hosts of the cluster, the fact that all data is initially retrieved |
| across the network means that the overall query performance is likely |
| to be lower for S3 data than for HDFS data. </p> |
| <p>Use the <codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> query option |
| to control the Parquet-specific split size. The default value is 256 |
| MB.</p> |
| |
| <p> When optimizing aspects of complex queries, such as the join order, |
| Impala treats tables on HDFS and S3 the same way. Therefore, follow |
| all the same tuning recommendations for S3 tables as for HDFS ones, |
| such as using the <codeph>COMPUTE STATS</codeph> statement to help |
| Impala construct accurate estimates of row counts and cardinality. See |
| <xref href="impala_performance.xml#performance"/> for details. </p> |
| |
| <p> In query profile reports, the numbers for |
| <codeph>BytesReadLocal</codeph>, |
| <codeph>BytesReadShortCircuit</codeph>, |
| <codeph>BytesReadDataNodeCached</codeph>, and |
| <codeph>BytesReadRemoteUnexpected</codeph> are blank because those |
| metrics come from HDFS. By definition, all the I/O for S3 tables |
| involves remote reads. </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| <concept id="s3_restrictions"> |
| |
| <title>Restrictions on Impala Support for S3</title> |
| |
| <conbody> |
| |
| <p>The following restrictions apply when using Impala with S3:</p> |
| <ul> |
| <li> Impala does not support the old <codeph>s3://</codeph> block-based |
| and <codeph>s3n://</codeph> filesystem schemes, and it only supports |
| <codeph>s3a://</codeph>. </li> |
| <li>Although S3 is often used to store JSON-formatted data, the current |
| Impala support for S3 does not include directly querying JSON data. |
| For Impala queries, use data files in one of the file formats listed |
| in <xref href="impala_file_formats.xml#file_formats"/>. If you have |
| data in JSON format, you can prepare a flattened version of that data |
| for querying by Impala as part of your ETL cycle. </li> |
| <li>You cannot use the <codeph>ALTER TABLE ... SET CACHED</codeph> |
| statement for tables or partitions that are located in S3. </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| |
| </concept> |