blob: 7ba5b1721e2ec24d7e914e341a133f19fc10a3c7 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="lineage" rev="2.2.0">
<title>Viewing Lineage Information for Impala Data</title>
<titlealts audience="PDF"><navtitle>Viewing Lineage Info</navtitle></titlealts>
<data name="Category" value="Impala"/>
<data name="Category" value="Lineage"/>
<data name="Category" value="Governance"/>
<data name="Category" value="Data Management"/>
<data name="Category" value="Navigator"/>
<data name="Category" value="Administrators"/>
<p rev="2.2.0">
<indexterm audience="hidden">lineage</indexterm>
<indexterm audience="hidden">column lineage</indexterm>
<term>Lineage</term> is a feature that helps you track where data originated, and how
data propagates through the system through SQL statements such as
<codeph>SELECT</codeph>, <codeph>INSERT</codeph>, and <codeph>CREATE
This type of tracking is important in high-security configurations, especially in
highly regulated industries such as healthcare, pharmaceuticals, financial services and
intelligence. For such kinds of sensitive data, it is important to know all
the places in the system that contain that data or other data derived from it; to verify who has accessed
that data; and to be able to doublecheck that the data used to make a decision was processed correctly and
not tampered with.
<section id="column_lineage">
<title>Column Lineage</title>
<term>Column lineage</term> tracks information in fine detail, at the level of
particular columns rather than entire tables.
For example, if you have a table with information derived from web logs, you might copy that data into
other tables as part of the ETL process. The ETL operations might involve transformations through
expressions and function calls, and rearranging the columns into more or fewer tables
(<term>normalizing</term> or <term>denormalizing</term> the data). Then for reporting, you might issue
queries against multiple tables and views. In this example, column lineage helps you determine that data
that entered the system as <codeph>RAW_LOGS.FIELD1</codeph> was then turned into
<codeph>WEBSITE_REPORTS.IP_ADDRESS</codeph> through an <codeph>INSERT ... SELECT</codeph> statement. Or,
conversely, you could start with a reporting query against a view, and trace the origin of the data in a
field such as <codeph>TOP_10_VISITORS.USER_ID</codeph> back to the underlying table and even further back
to the point where the data was first loaded into Impala.
When you have tables where you need to track or control access to sensitive information at the column
level, see <xref href="impala_authorization.xml#authorization"/> for how to implement column-level
security. You set up authorization using the Sentry framework, create views that refer to specific sets of
columns, and then assign authorization privileges to those views rather than the underlying tables.
<section id="lineage_data">
<title>Lineage Data for Impala</title>
The lineage feature is enabled by default. When lineage logging is enabled, the serialized column lineage
graph is computed for each query and stored in a specialized log file in JSON format.
Impala records queries in the lineage log if they complete successfully, or fail due to authorization
errors. For write operations such as <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph>,
the statement is recorded in the lineage log only if it successfully completes. Therefore, the lineage
feature tracks data that was accessed by successful queries, or that was attempted to be accessed by
unsuccessful queries that were blocked due to authorization failure. These kinds of queries represent data
that really was accessed, or where the attempted access could represent malicious activity.
Impala does not record in the lineage log queries that fail due to syntax errors or that fail or are
cancelled before they reach the stage of requesting rows from the result set.
To enable or disable this feature, set or remove the <codeph>-lineage_event_log_dir</codeph>
configuration option for the <cmdname>impalad</cmdname> daemon.