blob: ad4595d996101ed80894d173ecfa08ba7c957091 [file] [log] [blame]
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Defines a mock data source which generates dummy test data for use
* in testing. The data source operates in two modes:
* <ul>
* <li><b>Classic:</b> used in physical plans in many unit tests.
* The plan specifies a set of columns; data is generated by the
* vectors themselves based on two alternating values.</li>
* <li><b>Enhanced:</b> available for use in newer unit tests.
* Enhances the physical plan description to allow specifying a data
* generator class (for various types, data formats, etc.) Also
* provides a data storage engine framework to allow using mock
* tables in SQL queries.</li>
* </ul>
* <h3>Classic Mode</h3>
* Create a scan operator that looks like the following (from
* <tt>/src/test/resources/functions/cast/two_way_implicit_cast.json</tt>,
* used in {@link TestReverseImplicitCast}):
* <pre><code>
* graph:[
* {
* @id:1,
* pop:"mock-scan",
* url: "http://apache.org",
* entries:[
* {records: 1, types: [
* {name: "col1", type: "FLOAT4", mode: "REQUIRED"},
* {name: "col2", type: "FLOAT8", mode: "REQUIRED"}
* ]}
* ]
* },
* }, ...
* </code></pre>
* Here:
* <ul>
* <li>The <tt>pop</tt> must be <tt>mock-scan</tt>.</li>
* <li>The <tt>url</tt> is unused.</li>
* <li>The <tt>entries</tt> section can have one or more entries. If
* more than one entry, the storage engine will enable parallel scans
* up to the number of entries, as though each entry was a different
* file or group.</li>
* <li>The entry <tt>name</tt> is arbitrary, though color names seem
* to be the traditional names used in Drill tests.</li>
* <li>The <tt>type</tt> is one of the supported Drill
* {@link MinorType} names.</li>
* <li>The <tt>mode</tt> is one of the supported Drill
* {@link DataMode} names: usually <tt>OPTIONAL</tt> or <tt>REQUIRED</tt>.</li>
* </ul>
* <p>
* Recent extensions include:
* <ul>
* <li><tt>repeat</tt> in either the "entry" or "record" elements allow
* repeating entries (simulating multiple blocks or row groups) and
* repeating fields (easily create a dozen fields of some type.)</li>
* <li><tt>generator</tt> in a field definition lets you specify a
* specific data generator (see below.)</tt>
* <li><tt>properties</tt> in a field definition lets you pass
* generator-specific values to the data generator (such as, say
* a minimum and maximum value.)</li>
* </ul>
*
* <h3>Enhanced Mode</h3>
* Enhanced builds on the Classic mode to add additional capabilities.
* Enhanced mode can be used either in a physical plan or in SQL. Data
* is randomly generated over a wide range of values and can be
* controlled by custom generator classes. When
* in a physical plan, the <tt>records</tt> section has additional
* attributes as described in {@link MockTableDef.MockColumn}:
* <ul>
* <li>The <tt>generator</tt> lets you specify a class to generate the
* sample data. Rules for the class name is that it can either contain
* a full package path, or just a class name. If just a class name, the
* class is assumed to reside in this package. For example, to generate
* an ISO date into a string, use <tt>DateGen</tt>. Additional generators
* can (and should) be added as the need arises.</li>
* <li>The <tt>repeat</tt> attribute lets you create a very wide row by
* repeating a column the specified number of times. Actual column names
* have a numeric suffix. For example, if the base name is "blue" and
* is repeated twice, actual columns are "blue1" and "blue2".</li>
* </ul>
* When used in SQL, use the <tt>mock</tt> name space as follows:
* <pre><code>
* SELECT id_i, name_s50 FROM `mock`.`employee_500`;
* </code></pre>
* Both the column names and table names encode information that specifies
* what data to generate.
* <p>
* Columns are of the form <tt><i>name</i>_<i>type</i><i>length</i>?</tt>.
* <ul>
* <li>The name is anything you want ("id" and "name" in the example.)</li>
* <li>The underscore is required to separate the type from the name.</li>
* <li>The type is one of "i" (integer), "d" (double) or "s" (string).
* Other types can be added as needed: n (decimal number), l (long), etc.</li>
* <li>The length is optional and is used only for string (<tt>VARCHAR</tt>)
* columns. The default string length is 10.</li>
* <li>Columns do not yet support nulls. When they do, the encoding will
* be "_n<i>percent</i>" where the percent specifies the percent of rows
* that should contain null values in this column.<l/i>
* <li>The column is known to SQL as its full name, that is "id_i" or
* "name_s50".</li>
* </ul>
* <p>
* Tables are of the form <tt><i>name</i>_<i>rows</i><i>unit<i>?</tt> where:
* <ul>
* <li>The name is anything you want. ("employee" in the example.)</li>
* <li>The underscore is required to separate the row count from the name.</li>
* <li>The row count specifies the number of rows to return.</li>
* <li>The count unit can be none, K (multiply count by 1000) or M
* (multiply row count by one million), case insensitive.</li>
* <li>Another field (not yet implemented) might specify the split count.</li>
* </ul>
* <h3>Enhanced Mode with Definition File</h3>
* You can reference a mock data definition file directly from SQL as follows:
* <pre<code>SELECT * FROM `mock`.`your_defn_file.json`</code></pre>
* <h3>Data Generators</h3>
* The classic mode uses data generators built into each vector to generate
* the sample data. These generators use a very simple black/white alternating
* series of two values. Simple, but limited. The enhanced mode allows custom
* data generators. Unfortunately, this requires a separate generator class for
* each data type. As a result, we presently support just a few key data types.
* On the other hand, the custom generators do allow tests to specify a custom
* generator class to generate the kind of data needed for that test.
* <p>
* All data generators implement the {@link FieldGen} interface, and must have
* a non-argument constructor to allow dynamic instantiation. The mock data
* source either picks a default generator (if no <tt>generator</tt> is provided)
* or uses the custom generator specified in <tt>generator<tt>. Generators
* are independent (though one could, perhaps, write generators that correlate
* field values.)
*/
package org.apache.drill.exec.store.mock;