commit | e53d649f8a88f42a70237fe7c2663baa126fed1a | [log] [tgz] |
---|---|---|
author | Vihang Karajgaonkar <vihangk1@apache.org> | Tue Sep 08 12:47:01 2020 -0700 |
committer | Vihang Karajgaonkar <vihang@cloudera.com> | Mon Sep 28 22:39:24 2020 +0000 |
tree | 5f65df95db3b340394f6edb3c9c186bdc1c5e53d | |
parent | ee9904bc504edb19655bbee045c9b9e6a711eb7b [diff] |
IMPALA-9664: Support hive replication This patch makes some improvements to the INSERT event generated by Impala. Specifically, the INSERT event will now include new file information when Impala inserts into a table. This information can be used by external tools like Hive Replication to replicate the changes made by Impala in their source databases. Additionally, this patch modifies the truncate table execution so that it uses HMS API to truncate the table instead of deleting the files directly on the filesystem. Following changes were made. 1. Fires insert events for insert overwrite. 2. Has the names of the new files in the events. In case of insert overwrite, this is just a list of files which were added by the insert overwrite operation. 3. In case of ACID tables, fires transactional notification API for all the partitions in which data is inserted. 4. For tables which have replication enabled, the truncate table operation now uses a HMS API to truncate the table. This is necessary since HMS API moves the files to a replication change manager location if needed. Additionally, it generates ALTER_TABLE events with truncate flag set to true. TODO: 1. For external tables, replication does not seem to work in the dev environment. This will be done as a followup. Testing: 1. Created a new test in test_events_processing.py which inserts into managed tables which are being replicated. It makes sure that hive replication detects the new rows which are added into the tables. The test also exercises insert overwrite and truncate statements and makes sure that the table is replicated correctly. Change-Id: Icaf3fe0adff755ff853960f270ceb45b11a84f0a Reviewed-on: http://gerrit.cloudera.org:8080/16439 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:
To learn more about Impala as a business user, or to try Impala live or in a VM, please visit the Impala homepage. Detailed documentation for administrators and users is available at Apache Impala documentation.
If you are interested in contributing to Impala as a developer, or learning more about Impala's internals and architecture, visit the Impala wiki.
Impala only supports Linux at the moment. Impala supports x86_64 and has experimental support for arm64 (as of Impala 4.0). Impala Requirements contains more detailed information on the minimum CPU requirements.
This distribution uses cryptographic software and may be subject to export controls. Please refer to EXPORT_CONTROL.md for more information.
See Impala's developer documentation to get started.
Detailed build notes has some detailed information on the project layout and build.