blob: fdab1ed6b2dff180a841b03ddd2f7be932160675 [file] [log] [blame] [view]
Tutorial
========
Optiq-csv is a fully functional adapter for <a
href="https://github.com/julianhyde/optiq">Optiq</a> that reads text
files in <a
href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV
(comma-separated values)</a> format. It is remarkable that a couple of
hundred lines of Java code are sufficient to provide full SQL query
capability.
Optiq-csv also serves as a template for building adapters to other
data formats. Even though there are not many lines of code, it covers
several important concepts:
* user-defined schema using SchemaFactory and Schema interfaces;
* declaring schemas in a model JSON file;
* declaring views in a model JSON file;
* user-defined table using the Table interface;
* determining the record type of a table;
* a simple implementation of Table that enumerates all rows directly;
* advanced implementation of Table that translates to relational operators using planner rules.
## First queries
Without further ado, let's get started.
If you haven't already installed, follow the instructions in the <a href="README.md">README</a>.
It's just two commands: <code>git clone</code> followed <code>mvn install</code>.
Now let's connect to Optiq using
<a href="https://github.com/julianhyde/sqlline">sqlline</a>, a SQL shell
that is included in the optiq-csv github project.
```bash
$ ./sqlline
sqlline> !connect jdbc:optiq:model=target/test-classes/model.json admin admin
```
Execute a metadata query:
```bash
sqlline> !tables
+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null | SALES | DEPTS | TABLE | null | null |
| null | SALES | EMPS | TABLE | null | null |
| null | metadata | COLUMNS | SYSTEM_TABLE | null | null |
| null | metadata | TABLES | SYSTEM_TABLE | null | null |
+------------+--------------+-------------+---------------+----------+------+
```
(JDBC experts, note: sqlline's <code>!tables</code> command is just executing
<a href="http://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getTables(java.lang.String, java.lang.String, java.lang.String, java.lang.String[])"><code>DatabaseMetaData.getTables()</code></a>
behind the scenes.
It has other commands to query JDBC metadata, such as <code>!columns</code> and <code>!describe</code>.)
As you can see there are 4 tables in the system: tables
<code>EMPS</code> and <code>DEPTS</code> in the current
<code>SALES</code> schema, and <code>COLUMNS</code> and
<code>TABLES</code> in the system <code>metadata</code> schema. The
system tables are always present in Optiq, but the other tables are
provided by the specific implementation of the schema; in this case,
the <code>EMPS</code> and <code>DEPTS</code> tables are based on the
<code>EMPS.csv</code> and <code>DEPTS.csv</code> files in the
<code>target/test-classes</code> directory.
Let's execute some queries on those tables, to show that Optiq is providing
a full implementation of SQL. First, a table scan:
```bash
sqlline> SELECT * FROM emps;
+--------+--------+---------+---------+----------------+--------+-------+---+
| EMPNO | NAME | DEPTNO | GENDER | CITY | EMPID | AGE | S |
+--------+--------+---------+---------+----------------+--------+-------+---+
| 100 | Fred | 10 | | | 30 | 25 | t |
| 110 | Eric | 20 | M | San Francisco | 3 | 80 | n |
| 110 | John | 40 | M | Vancouver | 2 | null | f |
| 120 | Wilma | 20 | F | | 1 | 5 | n |
| 130 | Alice | 40 | F | Vancouver | 2 | null | f |
+--------+--------+---------+---------+----------------+--------+-------+---+
```
Now JOIN and GROUP BY:
```bash
sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;
+------------+---------+
| NAME | EXPR$1 |
+------------+---------+
| Sales | 1 |
| Marketing | 2 |
+------------+---------+
```
Last, the VALUES operator generates a single row, and is a convenient
way to test expressions and SQL built-in functions:
```bash
sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0 |
+---------+
| 13 |
+---------+
```
Optiq has many other SQL features. We don't have time to cover them
here. Write some more queries to experiment.
## Schema discovery
Now, how did Optiq find these tables? Remember, core Optiq does not
know anything about CSV files. (As a "database without a storage
layer", Optiq doesn't know about any file formats.) Optiq knows about
those tables because we told it to run code in the optiq-csv
project.
There are a couple of steps in that chain. First, we define a schema
based on a schema factory class in a model file. Then the schema
factory creates a schema, and the schema creates several tables, each
of which knows how to get data by scanning a CSV file. Last, after
Optiq has parsed the query and planned it to use those tables, Optiq
invokes the tables to read the data as the query is being
executed. Now let's look at those steps in more detail.
On the JDBC connect string we gave the path of a model in JSON
format. Here is the model:
```json
{
version: '1.0',
defaultSchema: 'SALES',
schemas: [
{
name: 'SALES',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.csv.CsvSchemaFactory',
operand: {
directory: 'target/test-classes/sales'
}
}
]
}
```
The model defines a single schema called 'SALES'. The schema is
powered by a plugin class,
<a href="https://github.com/julianhyde/optiq-csv/blob/master/src/main/java/net/hydromatic/optiq/impl/csv/CsvSchemaFactory.java">net.hydromatic.optiq.impl.csv.CsvSchemaFactory</a>, which is part of the
optiq-csv project and implements the Optiq interface
<a href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/SchemaFactory.html">SchemaFactory</a>. Its <code>create</code> method instantiates a
schema, passing in the <code>directory</code> argument from the model file:
```java
public Schema create(SchemaPlus parentSchema, String name,
Map<String, Object> operand) {
String directory = (String) operand.get("directory");
Boolean smart = (Boolean) operand.get("smart");
return new CsvSchema(
parentSchema,
name,
new File(directory),
smart != null && smart);
}
```
Driven by the model, the schema factory instantiates a single schema
called 'SALES'. The schema is an instance of
<a href="https://github.com/julianhyde/optiq-csv/blob/master/src/main/java/net/hydromatic/optiq/impl/csv/CsvSchema.java">net.hydromatic.optiq.impl.csv.CsvSchema</a>
and implements the Optiq interface <a
href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/Schema.html">Schema</a>.
A schema's job is to produce a list of tables. (It can also list sub-schemas and
table-functions, but these are advanced features and optiq-csv does
not support them.) The tables implement Optiq's <a
href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/Table.html">Table</a> interface. <code>CsvSchema</code> produces tables that are instances of
<a href="https://github.com/julianhyde/optiq-csv/blob/master/src/main/java/net/hydromatic/optiq/impl/csv/CsvTable.java">CsvTable</a>.
Here is the relevant code from <code>CsvSchema</code>, overriding the
<code><a href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/impl/AbstractSchema.html#getTableMap()">getTableMap()</a></code>
method in the <code>AbstractSchema</code> base class.
```java
protected Map<String, Table> getTableMap() {
final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder();
File[] files = directoryFile.listFiles(
new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.endsWith(".csv");
}
});
if (files == null) {
System.out.println("directory " + directoryFile + " not found");
files = new File[0];
}
for (File file : files) {
String tableName = file.getName();
if (tableName.endsWith(".csv")) {
tableName = tableName.substring(
0, tableName.length() - ".csv".length());
}
final CsvTable table;
if (smart) {
table = new CsvSmartTable(file, null);
} else {
table = new CsvTable(file, null);
}
builder.put(tableName, table);
}
return builder.build();
}
```
The schema scans the directory and finds all files whose name ends
with ".csv" and creates tables for them. In this case, the directory
is <code>target/test-classes/sales</code> and contains files
<code>EMPS.csv</code> and <code>DEPTS.csv</code>, which these become
the tables <code>EMPS</code> and <code>DEPTS</code>.
## Tables and views in schemas
Note how we did not need to define any tables in the model; the schema
generated the tables automatically.
You can define extra tables,
beyond those that are created automatically,
using the <code>tables</code> property of a schema.
Let's see how to create
an important and useful type of table, namely a view.
A view looks like a table when you are writing a query, but it doesn't store data.
It derives its result by executing a query.
The view is expanded while the query is being planned, so the query planner
can often perform optimizations like removing expressions from the SELECT
clause that are not used in the final result.
Here is a schema that defines a view:
```json
{
version: '1.0',
defaultSchema: 'SALES',
schemas: [
{
name: 'SALES',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.csv.CsvSchemaFactory',
operand: {
directory: 'target/test-classes/sales'
},
tables: [
{
name: 'FEMALE_EMPS',
type: 'view',
sql: 'SELECT * FROM emps WHERE gender = \'F\''
}
]
}
]
}
```
The line <code>type: 'view'</code> tags <code>FEMALE_EMPS</code> as a view, as opposed to a regular table
or a custom table. Note that single-quotes within the view definition are escaped using a
back-slash, in the normal way for JSON.
We can now execute queries using that view just as if it were a table:
```sql
sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
| NAME | NAME |
+--------+------------+
| Wilma | Marketing |
+--------+------------+
```
## Custom tables
Custom tables are tables whose implementation is driven by user-defined code.
They don't need to live in a custom schema.
There is an example in <code>model-with-custom-table.json</code>:
```json
{
version: '1.0',
defaultSchema: 'CUSTOM_TABLE',
schemas: [
{
name: 'CUSTOM_TABLE',
tables: [
{
name: 'EMPS',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.csv.CsvTableFactory',
operand: {
file: 'target/test-classes/sales/EMPS.csv',
smart: false
}
}
]
}
]
}
```
We can query the table in the usual way:
```sql
sqlline> !connect jdbc:optiq:model=target/test-classes/model-with-custom-table.json admin admin
sqlline> SELECT empno, name FROM custom_table.emps;
+--------+--------+
| EMPNO | NAME |
+--------+--------+
| 100 | Fred |
| 110 | Eric |
| 110 | John |
| 120 | Wilma |
| 130 | Alice |
+--------+--------+
```
The schema is a regular one, and contains a custom table powered by
<a href="https://github.com/julianhyde/optiq-csv/blob/master/src/main/java/net/hydromatic/optiq/impl/csv/CsvTableFactory.java">net.hydromatic.optiq.impl.csv.CsvTableFactory</a>,
which implements the Optiq interface
<a href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/TableFactory.html">TableFactory</a>.
Its <code>create</code> method instantiates a
table, passing in the <code>file</code> argument from the model file:
```java
public CsvTable create(SchemaPlus schema, String name,
Map<String, Object> map, RelDataType rowType) {
String fileName = (String) map.get("file");
Boolean smart = (Boolean) map.get("smart");
final File file = new File(fileName);
final RelProtoDataType protoRowType =
rowType != null ? RelDataTypeImpl.proto(rowType) : null;
if (smart != null && smart) {
return new CsvSmartTable(file, protoRowType);
} else {
return new CsvTable(file, protoRowType);
}
}
```
Implementing a custom table is often a simpler alternative to implementing
a custom schema. Both approaches might end up creating a similar implementation
of the <code>Table</code> interface, but for the custom table you don't
need to implement metadata discovery. (<code>CsvTableFactory</code>
creates a <code>CsvTable</code>, just as <code>CsvSchema</code> does,
but the table implementation does not scan the filesystem for .csv files.)
Custom tables require more work for the author of the model (the author
needs to specify each table and its file explicitly) but also give the author
more control (say, providing different parameters for each table).
## Optimizing queries using planner rules
The table implementations we have seen so far are fine as long as the tables
don't contain a great deal of data. But if your customer table has, say, a
hundred columns and a million rows, you would rather that the system did not
retrieve all of the data for every query. You would like Optiq to negotiate
with the adapter and find a more efficient way of accessing the data.
This negotiation is a simple form of query optimization. Optiq supports query
optimization by adding <i>planner rules</i>. Planner rules operate by
looking for patterns in the query parse tree (for instance a project on top
of a certain kind of table), and
Planner rules are also extensible, like schemas and tables. So, if you have a
data store that you want to access via SQL, you first define a custom table or
schema, and then you define some rules to make the access efficient.
To see this in action, let's use a planner rule to access
a subset of columns from a CSV file. Let's run the same query against two very
similar schemas:
```sql
sqlline> !connect jdbc:optiq:model=target/test-classes/model.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
| EnumerableTableAccessRel(table=[[SALES, EMPS]]) |
+-----------------------------------------------------+
sqlline> !connect jdbc:optiq:model=target/test-classes/smart.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
| CsvTableScan(table=[[SALES, EMPS]]) |
+-----------------------------------------------------+
```
What causes the difference in plan? Let's follow the trail of evidence. In the
<code>smart.json</code> model file, there is just one extra line:
```json
smart: true
```
This causes <code>CsvSchema</code> to be created with <code>smart = true</code>,
and its <code>createTable</code> method creates instances of
<a href="https://github.com/julianhyde/optiq-csv/blob/master/src/main/java/net/hydromatic/optiq/impl/csv/CsvSmartTable.java">CsvSmartTable</a>
rather than a <code>CsvTable</code>.
<code>CsvSmartTable</code> overrides the
<code><a href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/TranslatableTable#toRel()">TranslatableTable.toRel()</a></code>
method to create
<a href="https://github.com/julianhyde/optiq-csv/blob/master/src/main/java/net/hydromatic/optiq/impl/csv/CsvTableScan.java">CsvTableScan</a>.
Table scans are the leaves of a query operator tree.
The usual implementation is
<code><a href="http://www.hydromatic.net/optiq/apidocs/net/hydromatic/optiq/impl/java/JavaRules.EnumerableTableAccessRel.html">EnumerableTableAccessRel</a></code>,
but we have created a distinctive sub-type that will cause rules to fire.
Here is the rule in its entirety:
```java
public class CsvPushProjectOntoTableRule extends RelOptRule {
public static final CsvPushProjectOntoTableRule INSTANCE =
new CsvPushProjectOntoTableRule();
private CsvPushProjectOntoTableRule() {
super(
operand(ProjectRel.class,
operand(CsvTableScan.class, none())),
"CsvPushProjectOntoTableRule");
}
@Override
public void onMatch(RelOptRuleCall call) {
final ProjectRel project = call.rel(0);
final CsvTableScan scan = call.rel(1);
int[] fields = getProjectFields(project.getProjects());
if (fields == null) {
// Project contains expressions more complex than just field references.
return;
}
call.transformTo(
new CsvTableScan(
scan.getCluster(),
scan.getTable(),
scan.csvTable,
fields));
}
private int[] getProjectFields(List<RexNode> exps) {
final int[] fields = new int[exps.size()];
for (int i = 0; i < exps.size(); i++) {
final RexNode exp = exps.get(i);
if (exp instanceof RexInputRef) {
fields[i] = ((RexInputRef) exp).getIndex();
} else {
return null; // not a simple projection
}
}
return fields;
}
}
```
The constructor declares the pattern of relational expressions that will cause
the rule to fire.
The <code>onMatch</code> method generates a new relational expression and calls
<code><a href="http://www.hydromatic.net/optiq/apidocs/org/eigenbase/relopt/RelOptRuleCall.html#transformTo(org.eigenbase.rel.RelNode)">RelOptRuleCall.transformTo()</a></code>
to indicate that the rule has fired successfully.
## The query optimization process
There's a lot to say about how clever Optiq's query planner is, but we won't say
it here. The cleverness is designed to take the burden off you, the writer of
planner rules.
First, Optiq doesn't fire rules in a prescribed order. The query optimization process
follows many branches of a branching tree, just like a chess playing program
examines many possible sequences of moves. If rules A and B both match a given
section of the query operator tree, then Optiq can fire both.
Second, Optiq uses cost in choosing between plans, but the cost model doesn't
prevent rules from firing which may seem to be more expensive in the short term.
Many optimizers have a linear optimization scheme. Faced with a choice between
rule A and rule B, as above, such an optimizer needs to choose immediately. It
might have a policy such as "apply rule A to the whole tree, then apply rule B
to the whole tree", or apply a cost-based policy, applying the rule that
produces the cheaper result.
Optiq doesn't require such compromises.
This makes it simple to combine various sets of rules.
If, say you want to combine rules to recognize materialized views with rules to
read from CSV and JDBC source systems, you just give Optiq the set of all rules
and tell it to go at it.
Optiq does use a cost model. The cost model decides which plan to ultimately use,
and sometimes to prune the search tree to prevent the search space from
exploding, but it never forces you to choose between rule A and rule B. This is
important, because it avoids falling into local minima in the search space that
are not actually optimal.
Also (you guessed it) the cost model is pluggable, as are the table and query
operator statistics it is based upon. But that can be a subject for later.
## JDBC adapter
The JDBC adapter maps a schema in a JDBC data source as an Optiq schema.
For example, this schema reads from a MySQL "foodmart" database:
```json
{
version: '1.0',
defaultSchema: 'FOODMART',
schemas: [
{
name: 'FOODMART',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.jdbc.JdbcSchema.Factory',
operand: {
driver: 'com.mysql.jdbc.Driver',
url: 'jdbc:mysql://localhost/foodmart',
user: 'foodmart',
password: 'foodmart'
}
}
]
}
```
(The FoodMart database will be familiar to those of you who have used
the Mondrian OLAP engine, because it is Mondrian's main test data
set. To load the data set, follow <a
href="http://mondrian.pentaho.com/documentation/installation.php#2_Set_up_test_data">Mondrian's
installation instructions</a>.)
<b>Current limitations</b>: The JDBC adapter currently only pushes
down table scan operations; all other processing (filtering, joins,
aggregations and so forth) occurs within Optiq. Our goal is to push
down as much processing as possible to the source system, translating
syntax, data types and built-in functions as we go. If an Optiq query
is based on tables from a single JDBC database, in principle the whole
query should go to that database. If tables are from multiple JDBC
sources, or a mixture of JDBC and non-JDBC, Optiq will use the most
efficient distributed query approach that it can.
## The cloning JDBC adapter
The cloning JDBC adapter creates a hybrid database. The data is
sourced from a JDBC database but is read into in-memory tables the
first time each table is accessed. Optiq evaluates queries based on
those in-memory tables, effectively a cache of the database.
For example, the following model reads tables from a MySQL
"foodmart" database:
```json
{
version: '1.0',
defaultSchema: 'FOODMART_CLONE',
schemas: [
{
name: 'FOODMART_CLONE',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.clone.CloneSchema.Factory',
operand: {
driver: 'com.mysql.jdbc.Driver',
url: 'jdbc:mysql://localhost/foodmart',
user: 'foodmart',
password: 'foodmart'
}
}
]
}
```
Another technique is to build a clone schema on top of an existing
schema. You use the <code>source</code> property to reference a schema
defined earlier in the model, like this:
```json
{
version: '1.0',
defaultSchema: 'FOODMART_CLONE',
schemas: [
{
name: 'FOODMART',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.jdbc.JdbcSchema.Factory',
operand: {
driver: 'com.mysql.jdbc.Driver',
url: 'jdbc:mysql://localhost/foodmart',
user: 'foodmart',
password: 'foodmart'
}
},
{
name: 'FOODMART_CLONE',
type: 'custom',
factory: 'net.hydromatic.optiq.impl.clone.CloneSchema.Factory',
operand: {
source: 'FOODMART'
}
}
]
}
```
You can use this approach to create a clone schema on any type of
schema, not just JDBC.
The cloning adapter isn't the be-all and end-all. We plan to develop
more sophisticated caching strategies, and a more complete and
efficient implementation of in-memory tables, but for now the cloning
JDBC adapter shows what is possible and allows us to try out our
initial implementations.
## Further topics
### Defining a custom schema
(To be written.)
### Modifying data
How to enable DML operations (INSERT, UPDATE and DELETE) on your schema.
(To be written.)
### Calling conventions
(To be written.)
### Statistics and cost
(To be written.)
### Defining and using user-defined functions
(To be written.)
### Defining tables in a schema
(To be written.)
### Defining custom tables
(To be written.)
### Built-in SQL implementation
How does Optiq implement SQL, if an adapter does not implement all of the core relational operators?
(To be written.)
### Table functions
(To be written.)
## Further resources
* <a href="http://github.com/julianhyde/optiq">Optiq</a> home page
* <a href="http://github.com/julianhyde/optiq-csv">Optiq-csv</a> home page