blob: 41a783a3ba12b7625a3042cc653b59a8e97c742f [file] [log] [blame]
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Provides run-time semantic analysis of the projection list for the
* scan operator. The project list can include table columns and a
* variety of special columns. Requested columns can exist in the table,
* or may be "missing" with null values applied. The code here prepares
* a run-time projection plan based on the actual table schema.
* <p>
* Resolves a scan schema throughout the scan lifecycle. Schema resolution
* comes from a variety of sources. Resolution starts with preparing the
* schema for the first reader:
* <ul>
* <li>Project list (wildcard, empty, or explicit)</li>
* <li>Optional provided schema (strict or lenient)</li>
* <li>Implicit columns</li>
* <li>An "early" reader schema (one determined before reading any
* data.</li>
* </ul>
* The result is a <i>defined schema</i> which may include;
* <ul>
* <li>Dynamic columns: those from the project list where we know only
* the column name, but not its type.</li>
* <li>Resolved columns: implicit or provided columns where we know
* the name and type.</li>
* </ul>
* The schema itself can be one of two forms:
* <ul>
* <li>Open: meaning that the reader can add other columns. An open
* schema results from a wildcard projection. Since the wildcard can appear
* along with implicit columns, the schema can be open and have a set of
* columns. If a provided schema appears, then the provided schema is
* expanded here. If the schema is "lenient", then the reader can add
* additional columns as it discovers them.</li>
* <li>Closed: meaning that the reader cannot add additional columns.
* A closed schema results from an empty or explicit projection list. A closed
* schema also results from a wildcard projection and a strict schema.</li>
* </ul>
* <p>
* Internally, the schema may start as open (has a wildcard), but may transition
* to closed when processing a strict provided schema.
* <p>
* Once this class is complete, the scan can add columns only to an open schema.
* All such columns are inserted at the wildcard location. If the wildcard appears
* by itself, columns are appended. If the wildcard appears along with implicit columns,
* then the reader columns appear at the wildcard location, before the implicit columns.
* <p>
* Once we have the initial reader input schema, we can then further refine
* the schema with:
* <ul>
* <li>The reader "output" schema: the columns actually read by the
* reader.</li>
* <li>The set of "missing" columns: those projected, but which the reader did
* not provide. We must make up a type for missing columns (and hope we guess
* correctly.) In fact, the purpose of the provided (and possibly early reader)
* schema is to avoid the need to guess.</li>
* </ul>
*
* <h4>Implicit (Wildcard) Projection</h4>
*
* A query can contain a wildcard ({@code *}). In this case, the set of columns is
* driven by the reader. Each scan might drive one, two or many readers. In an ideal
* world, every reader would produce the same schema. In the real world, files tend
* the evolve: early files have three columns, later files have five. In this case
* some readers will produce one schema, other readers another. Much of the complexity
* of Drill comes from this simple fact that Drill is a SQL engine that requires a
* single schema for all rows, but Drill reads data sources which are free to return
* any schema that they want.
* <p>
* A wildcard projection starts by accepting the schema produced by the first reader.
* In "classic" mode, later readers can add columns (causing a schema change to be
* sent downstream), but cannot change the types of existing columns. The code
* here supports a "no schema change" mode in which the first reader discovers the
* schema, which is then fixed for all subsequent readers. This mode cannot, however
* prevent schema conflicts across scans running in different fragments.
*
* <h4>Explicit Projection</h4>
*
* Explicit projection provides the list of columns, but not their types.
* Example: SELECT a, b, c.
* <p>
* The projection list holds the columns
* as requested by the user in the {@code SELECT} clause of the query,
* in the order which columns appear in that clause, along with additional
* columns implied by other columns. The planner
* determines which columns to project. In Drill, projection is speculative:
* it is a list of names which the planner hopes will appear in the data
* files. The reader must make up columns (the infamous nullable INT) when
* it turns out that no such column exists. Else, the reader must figure out
* the data type for any columns that does exist.
* <p>
* An explicit projection starts with the requested set of columns,
* then looks in the table schema to find matches. Columns not in the project list
* are not projected (not written to vectors). The reader columns provide the types
* of the projected columns, "resolving" them to a concrete type.
* <p>
* An explicit projection may include columns that do not exist in
* the source schema. In this case, we fill in null columns for
* unmatched projections.
* <p>
* The challenge in this case is that Drill cannot know the type of missing columns;
* Drill can only guess. If a reader in Scan 1 guesses a type, but a reader in
* Scan 2 reads a column with a different type, then a schema conflict will
* occur downstream.
*
* <h4>Maps</h4>
*
* Maps introduce a large amount of additional complexity. First, maps appear
* in the project list as either:
* <ul>
* <li>A generic projection: just the name {@code m}, where {@code m} is a map.
* In this case, we project all members of the map. That is, the map itself
* is open in the above sense. Note that a map can be open even if the scan
* schema itself is closed. That is, if the projection list contains only
* {@code m}, the scan schema is closed, but the map is open (the reader will
* discover the fields that make up the map.)</li>
* <li>A specific projection: a list of map members: {@code m.x, m.y}. In this
* case, we know that the downstream Project operator will pull just those two
* members to the top level and discard the rest of the map. We can thus
* project just those two members in the scan. As a result, the map is closed
* in the above sense: any additional map members discovered by the reader will
* be unprojected.</li>
* <li>Hybrid: a projection list that includes both: {@code m, m.x}. Here, the
* generic projection takes precedence. If the specific projection includes
* qualifiers, {@code m, m.x[1]}, then that information is used to check the
* type of column {@code x}.</li>
* <li>Implied: in a wildcard projection, a column may turn out to be a map.
* In this case, the map is open when the schema itself is open. (Remember that
* a wildcard projection can result in a closed schema if paired with a strict
* provided schema.</li>
* </ul>
*
* <h4>Schema Definition</h4>
*
* This resolver is the first step in the scan schema process. The result is a
* (typically dynamic) <i>defined schema</i>. To understand this concept, it helps
* to compare Drill with other query engines. In most engines, the planner is
* responsible for working out the scan schema from table metadata, from the
* project list and so on. The scan is given a fully-defined schema which it
* must use.
* <p>
* Drill is unique in that it uses a <i>dynamic schema</i> with columns and/or types
* "to be named later." The scan must convert the dynamic schema into a concrete
* schema sent downstream. This class implements some of the steps in doing so.
* <p>
* The result of this class is a schema identical to a defined schema that a
* planner might produce. Since Drill is dynamic, the planner must be able to
* produce a dynamic schema of the form described above. If the planner has table
* metadata (here represented by a provided schema), then the planner could produce
* a concrete defined schema (all types are defined.) Or, with a lenient provided
* schema, the planner might produce a dynamic defined schema: one with some
* concrete columns, some dynamic (name-only) columns.
*
* <h4>Implicit Columns</h4>
*
* This class handles one additional source of schema information: implicit
* columns: those defined by Drill itself. Examples include {@code filename,
* dir0}, etc. Implicit columns are available (at present) only for the file
* storage plugin, but could be added for other storage plugins. The project list
* can contain the names of implicit columns. If the query contains a wildcard,
* then the project list may also contain implicit columns:
* {@code filename, *, dir0}.
* <p>
* Implicit columns are known to Drill, so Drill itself can provide type information
* for those columns, by an external implicit column parser. That parser locates
* implicit columns by name, marks the columns as implicit, and takes care of
* populating the columns at read time. We use a column property,
* {@code IMPLICIT_COL_TYPE}, to mark a column as implicit. Later the scan mechanism
* will omit such columns when preparing the <i>reader schema</i>.
* <p>
* If the planner were to provide a defined schema, then the planner would have
* parsed out the implicit columns, provided their types, and marked them as
* implicit. So, again, we see that this class produces, at scan time, the same
* defined schema that the planner might produce at plan time.
* <p>
* Because of the way we handle implicit columns, we can allow the provided
* schema to include them. The provided schema simply adds a column (with any
* name), and sets the {@code IMPLICIT_COL_TYPE} property to indicate which
* implicit column definition to use for that column. This is handy for allowing the
* implicit column to include partition directories as regular columns.
* <p>
* We now have a parsing flow for this package:
* <ul>
* <li>Projection list (so we know what to include)</li>
* <li>Provided schema (to add/mark columns as implicit)</li>
* <li>Implicit columns, which looks for only for a) columns tagged as
* implicit or b) dynamic columns (those not defined in the provided
* schema.</li>
* </ul>
* <p>
* Drill has long had a source of ambiguity: what happens if the reader has a column
* with the same name as an implicit column. In this flow, the ambiguity is resolved
* as follows:
* <ul>
* <li>If a provided schema has a column explicitly tagged as an implicit column,
* then that column is unambiguously an implicit column independent of name.</li>
* <li>If a provided schema has a column with the same name as an implicit column
* (the names can be changed by a system/session option), then the fact that the
* column is not marked as implicit unambiguously tells us that the column is not
* implicit, despite the name.</li>
* <li>If a column appears in the project list, but not in the provided schema,
* and that column matches the (effective) name of some implicit column, then
* the column is marked as implicit and is not passed to the reader. Further, the
* projection filter will mark that column as unprojected in the reader, even if
* the reader otherwise has a wildcard schema.</li>
* </ul>
*
* <h4>Projection</h4>
*
* In prior versions of the scan operator, projection tended to be quite simple:
* just check if a name appears in the project list. As we've seen from the above,
* projection is actually quite complex with the need to reuse type information
* where available, open and closed top-level and map schemas, the need to avoid
* projecting columns with the same name as implicit columns, etc.
* <p>
* The {@code ProjectionFilter} classes handle projection. As it turns out, this
* class must follow (variations of) the same rules when merging the provided
* schema with the projection list and so on. To ensure a single implementation
* of the complex projection rules, this class uses a projection filter when
* resolving the provided schema. The devil is in the details, knowing when
* a map is open or closed, enforcing consistency with known information, etc.
*
* <h4>Provided Schema</h4>
*
* With the advent of provided schema in Drill 1.16, the query plan can provide
* not just column names (dynamic columns) but also the data type (concrete
* columns.) In this case, the scan schema can resolve projected columns against
* the provided schema, rather than waiting for the reader schema. Readers can use
* the provided schema to choose a column type when the choice is ambiguous, or multiple
* choices are possible.
* <p>
* If the projection list is a wildcard, then the wildcard expands to include all
* columns from the provided schema, in the order of that schema. If the schema
* is strict, then the scan schema becomes fixed, as if an explicit projection list
* where used.
* <p>
* If the projection list is explicit, then each column is resolved against
* the provided schema. If the projection list includes a column not in the
* provided schema, then it falls to the reader (or missing columns mechanism)
* to resolve that particular column.
*
* <h4>Early Reader Schema</h4>
*
* Some readers can declare their schema before reading data. For example, a JDBC
* query gets back a row schema during the initial prepare step. In this case, the
* reader is said to be <i>early schema</i>. The reader indicates an early schema
* via its <i>schema negotiator</i>. The framework then uses this schema to resolve
* the dynamic columns in the scan schema. If all columns are resolved this way,
* then the scan can declare its own schema before reading any data.
* <p>
* An early reader schema can work with a provided schema. In this case, the early
* reader schema must declare the same column type as the provided schema.
* This is not a large obstacle: the provided schema should have originally come
* from the reader (or a description of the reader) so conflicts should not
* occur in normal operation.
*
* <h4>Reader Output Schema</h4>
*
* Once a reader loads a batch of data, it provides (via the
* {@code ResultSetLoader}) the reader's <i>output schema</i>: the set of columns
* actually read by the reader.
* <p>
* If the projection list contained a wildcard, then the reader output schema
* will determine the set of columns that replaces the wildcard. (That is, all reader
* columns are projected and the scan schema expands to reflect the actual columns.)
* <p>
* If the projection list is explicit (or made so by a strict provided schema),
* then the reader output schema must be a subset of the scan schema: it is an error
* for the reader to include extra columns as the scan mechanism won't know what to
* do with those vectors. The projection mechanism (see below) integrates with the
* {@code ResultSetLoader} to project only those columns needed; the others are
* given to the reader as "dummy" column writers: writers that accept, but discard
* their data.
* <p>
* Note the major difference between the early reader schema and the reader output
* schema. The early reader schema includes all the columns that the reader can read.
* The reader output schema includes only those columns that the reader actually read
* (as controlled by the projection filter.) For most readers (CSV, JSON, etc.), there
* is no early reader schema, there is only the reader output schema: the set of columns
* (modulo projection) that turned out to be in the data source.
*
* <h4>Projection</h4
*
* The projection list tells the reader which columns to read. In this mechanism,
* the projection list undergoes multiple transforms (expanding into a provided
* schema, identifying implicit columns, etc.) Further, as columns are resolved
* (via a provided schema, an earlier reader, etc.), the projection list can provide
* type information as well.
* <p>
* To handle this, projection is driven by the (evolving) scan schema. In fact, the
* schema mechanism uses the same projection implementation when applying the
* provided schema and early reader schema.
*
* <h4>Assembling the Output Schema and Batch</h4>
*
* The <i>scan output schema</i> consists of up to three parts:
* <ul>
* <li>Reader columns (the reader output schema)</li>
* <li>Missing columns (reader input columns which the reader does not
* actually provide.)</li>
* <li>Implicit columns.</li>
* </ul>
* Distinct mechanisms build each kind of schema. The reader builds the vectors
* for the reader schema. A missing column handler builds the missing columns
* (using provided or inferred types and values.) An implicit column manager
* fills in the implicit columns based on file information.
* <p>
* The scan schema tracker tracks all three schemas together to form the
* scan output schema. Tracking the combined schema ensures we preserve the
* user's requested project ordering. The reader manager builds the vectors
* using the above mechanisms, then merges the vectors (very easy to do in a
* columnar system) to produce the output batch which matches the scan schema.
*
* <h4>Architecture Overview</h4>
*
* <pre>
* Scan Plan
* |
* v
* +--------------+
* | Project List |
* | Parser |
* +--------------+
* |
* v
* +-------------+
* | Scan Schema | +-------------------+
* | Tracker | --->| Projection Filter |
* +-------------+ +-------------------+
* | |
* v v
* +------+ +------------+ +------------+ +-----------+
* | File | ---> | Reader |---->| Result Set | ---> | Data File |
* | Data | | | | Loader | <--- | Reader |
* +------+ +------------+ +------------+ +-----------+
* | |
* v |
* +------------+ Reader |
* | Reader | Schema |
* | Lifecycle | <----------+
* +------------+ |
* | |
* v |
* +---------+ Loaded |
* | Output | Vectors |
* | Builder | <-----------+
* +---------+
* |
* v
* Output Batch
* </pre>
*
* Omitted are the details of implicit and missing columns. The scan lifecycle
* (not shown) orchestrates the whole process.
* <p>
* The result is a scan schema which can start entirely dynamic (just a wildcard
* or list of column names), which is then resolved via a series of steps (some
* of which involve the real work of the scanner: reading data.) The bottom is
* the output: a full-resolved scan schema which exactly describes an output
* data batch.
*/
package org.apache.drill.exec.physical.impl.scan.v3.schema;