Tutorial

Optiq-csv is a fully functional adapter for Optiq that reads text files in CSV (comma-separated values) format. It is remarkable that a couple of hundred lines of Java code are sufficient to provide full SQL query capability.

Optiq-csv also serves as a template for building adapters to other data formats. Even though there are not many lines of code, it covers several important concepts:

  • user-defined schema using SchemaFactory and Schema interfaces;
  • declaring schemas in a model JSON file;
  • declaring views in a model JSON file;
  • user-defined table using the Table interface;
  • determining the record type of a table;
  • a simple implementation of Table that enumerates all rows directly;
  • advanced implementation of Table that translates to relational operators using planner rules.

First queries

Without further ado, let's get started.

If you haven‘t already installed, follow the instructions in the README. It’s just two commands: git clone followed mvn install.

Now let's connect to Optiq using sqlline, a SQL shell that is included in the optiq-csv github project.

$ ./sqlline
sqlline> !connect jdbc:optiq:model=target/test-classes/model.json admin admin

Execute a metadata query:

sqlline> !tables
+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  |  TABLE_TYPE   | REMARKS  | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null       | SALES        | DEPTS       | TABLE         | null     | null |
| null       | SALES        | EMPS        | TABLE         | null     | null |
| null       | metadata     | COLUMNS     | SYSTEM_TABLE  | null     | null |
| null       | metadata     | TABLES      | SYSTEM_TABLE  | null     | null |
+------------+--------------+-------------+---------------+----------+------+

(JDBC experts, note: sqlline's !tables command is just executing DatabaseMetaData.getTables() behind the scenes. It has other commands to query JDBC metadata, such as !columns and !describe.)

As you can see there are 4 tables in the system: tables EMPS and DEPTS in the current SALES schema, and COLUMNS and TABLES in the system metadata schema. The system tables are always present in Optiq, but the other tables are provided by the specific implementation of the schema; in this case, the EMPS and DEPTS tables are based on the EMPS.csv and DEPTS.csv files in the target/test-classes directory.

Let's execute some queries on those tables, to show that Optiq is providing a full implementation of SQL. First, a table scan:

sqlline> SELECT * FROM emps;
+--------+--------+---------+---------+----------------+--------+-------+---+
| EMPNO  |  NAME  | DEPTNO  | GENDER  |      CITY      | EMPID  |  AGE  | S |
+--------+--------+---------+---------+----------------+--------+-------+---+
| 100    | Fred   | 10      |         |                | 30     | 25    | t |
| 110    | Eric   | 20      | M       | San Francisco  | 3      | 80    | n |
| 110    | John   | 40      | M       | Vancouver      | 2      | null  | f |
| 120    | Wilma  | 20      | F       |                | 1      | 5     | n |
| 130    | Alice  | 40      | F       | Vancouver      | 2      | null  | f |
+--------+--------+---------+---------+----------------+--------+-------+---+

Now JOIN and GROUP BY:

sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;
+------------+---------+
|    NAME    | EXPR$1  |
+------------+---------+
| Sales      | 1       |
| Marketing  | 2       |
+------------+---------+

Last, the VALUES operator generates a single row, and is a convenient way to test expressions and SQL built-in functions:

sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0  |
+---------+
| 13      |
+---------+

Optiq has many other SQL features. We don't have time to cover them here. Write some more queries to experiment.

Schema discovery

Now, how did Optiq find these tables? Remember, core Optiq does not know anything about CSV files. (As a “database without a storage layer”, Optiq doesn't know about any file formats.) Optiq knows about those tables because we told it to run code in the optiq-csv project.

There are a couple of steps in that chain. First, we define a schema based on a schema factory class in a model file. Then the schema factory creates a schema, and the schema creates several tables, each of which knows how to get data by scanning a CSV file. Last, after Optiq has parsed the query and planned it to use those tables, Optiq invokes the tables to read the data as the query is being executed. Now let's look at those steps in more detail.

On the JDBC connect string we gave the path of a model in JSON format. Here is the model:

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'net.hydromatic.optiq.impl.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      }
    }
  ]
}

The model defines a single schema called ‘SALES’. The schema is powered by a plugin class, net.hydromatic.optiq.impl.csv.CsvSchemaFactory, which is part of the optiq-csv project and implements the Optiq interface SchemaFactory. Its create method instantiates a schema, passing in the directory argument from the model file:

public Schema create(MutableSchema parentSchema, String name,
    Map<String, Object> operand) {
  Map map = (Map) operand;
  String directory = (String) map.get("directory");
  Boolean smart = (Boolean) map.get("smart");
  final CsvSchema schema =
      new CsvSchema(
          parentSchema,
          new File(directory),
          parentSchema.getSubSchemaExpression(name, CsvSchema.class),
          smart != null && smart);
  parentSchema.addSchema(name, schema);
  return schema;
}

Driven by the model, the schema factory instantiates a single schema called ‘SALES’. The schema is an instance of net.hydromatic.optiq.impl.csv.CsvSchema and implements the Optiq interface Schema.

A schema‘s job is to produce a list of tables. (It can also list sub-schemas and table-functions, but these are advanced features and optiq-csv does not support them.) The tables implement Optiq’s Table interface. CsvSchema produces tables that are instances of CsvTable.

Here is the relevant code from CsvSchema, overriding the initialTables method in the MapSchema base class.

  protected Collection<TableInSchema> initialTables() {
    final List<TableInSchema> list = new ArrayList<TableInSchema>();
    File[] files = directoryFile.listFiles(
        new FilenameFilter() {
          public boolean accept(File dir, String name) {
            return name.endsWith(".csv");
          }
        });
    for (File file : files) {
      String tableName = file.getName();
      if (tableName.endsWith(".csv")) {
        tableName = tableName.substring(
            0, tableName.length() - ".csv".length());
      }
      final List<CsvFieldType> fieldTypes = new ArrayList<CsvFieldType>();
      final RelDataType rowType =
          CsvTable.deduceRowType(typeFactory, file, fieldTypes);
      final CsvTable table;
      if (smart) {
        table = new CsvSmartTable(this, tableName, file, rowType, fieldTypes);
      } else {
        table = new CsvTable(this, tableName, file, rowType, fieldTypes);
      }
      list.add(
          new TableInSchemaImpl(this, tableName, TableType.TABLE, table));
    }
    return list;
  }

The schema scans the directory and finds all files whose name ends with “.csv” and creates tables for them. In this case, the directory is target/test-classes/sales and contains files EMPS.csv and DEPTS.csv, which these become the tables EMPS and DEPTS.

Tables and views in schemas

Note how we did not need to define any tables in the model; the schema generated the tables automatically.

Some schema types allow you to define extra tables, beyond those that are created automatically, using the tables property of a schema.

(Specifically, Optiq checks whether the schema returned from SchemaFactory.create implements the MutableSchema interface. If it does, you can define tables in the model. Further, if the schema extends the MapSchema class, Optiq will call MapSchema.initialize() to create the automatic tables, then go ahead and create the explicit tables.)

CsvSchema allows explicit tables, so let's see how to create an important and useful type of table, namely a view.

A view looks like a table when you are writing a query, but it doesn't store data. It derives its result by executing a query. The view is expanded while the query is being planned, so the query planner can often perform optimizations like removing expressions from the SELECT clause that are not used in the final result.

Here is a schema that defines a view:

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'net.hydromatic.optiq.impl.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      },
      tables: [
        {
          name: 'FEMALE_EMPS',
          type: 'view',
          sql: 'SELECT * FROM emps WHERE gender = \'F\''
        }
      ]
    }
  ]
}

The line type: ‘view’ tags it as a view, as opposed to a regular table or a custom table. Note that single-quotes within the view definition are escaped using a back-slash, in the normal way for JSON.

We can now execute queries using that view just as if it were a table:

sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
|  NAME  |    NAME    |
+--------+------------+
| Wilma  | Marketing  |
+--------+------------+

JDBC adapter

The JDBC adapter maps a schema in a JDBC data source as an Optiq schema.

For example, this schema reads from a MySQL “foodmart” database:

{
  version: '1.0',
  defaultSchema: 'FOODMART',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'net.hydromatic.optiq.impl.jdbc.JdbcSchema.Factory',
      operand: {
        driver: 'com.mysql.jdbc.Driver',
        url: 'jdbc:mysql://localhost/foodmart',
        user: 'foodmart',
        password: 'foodmart'
      }
    }
  ]
}

(The FoodMart database will be familiar to those of you who have used the Mondrian OLAP engine, because it is Mondrian‘s main test data set. To load the data set, follow Mondrian’s installation instructions.)

Current limitations: The JDBC adapter currently only pushes down table scan operations; all other processing (filting, joins, aggregations and so forth) occurs within Optiq. Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If an Optiq query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Optiq will use the most efficient distributed query approach that it can.

The cloning JDBC adapter

The cloning JDBC adapter creates a hybrid database. The data is sourced from a JDBC database but is read into in-memory tables the first time each table is accessed. Optiq evaluates queries based on those in-memory tables, effectively a cache of the database.

For example, the following model reads tables from a MySQL “foodmart” database:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'net.hydromatic.optiq.impl.clone.CloneSchema.Factory',
      operand: {
        driver: 'com.mysql.jdbc.Driver',
        url: 'jdbc:mysql://localhost/foodmart',
        user: 'foodmart',
        password: 'foodmart'
      }
    }
  ]
}

Another technique is to build a clone schema on top of an existing schema. You use the source property to reference a schema defined earlier in the model, like this:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'net.hydromatic.optiq.impl.jdbc.JdbcSchema.Factory',
      operand: {
        driver: 'com.mysql.jdbc.Driver',
        url: 'jdbc:mysql://localhost/foodmart',
        user: 'foodmart',
        password: 'foodmart'
      }
    },
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'net.hydromatic.optiq.impl.clone.CloneSchema.Factory',
      operand: {
        source: 'FOODMART'
      }
    }
  ]
}

You can use this approach for any type of schema, not just JDBC.

We plan to develop more sophisticated caching strategies, and a more complete and efficient implementation of in-memory tables, but for now the cloning JDBC adapter shows what is possible and allows us to try out our initial implementations.

Further topics

Defining a custom schema

(To be written.)

Modifying data

How to enable DML operations (INSERT, UPDATE and DELETE) on your schema.

(To be written.)

Defining views in a model file

(To be written.)

Calling conventions

(To be written.)

Statistics and cost

Defining and using user-defined functions

(To be written.)

Defining tables in a schema

(To be written.)

Defining custom tables

(To be written.)

Built-in SQL implementation

How does Optiq implement SQL, if an adapter does not implement all of the core relational operators?

(To be written.)

Further resources

  • Optiq home page
  • Optiq-csv home page