SQE: Import wiki documentation

commit: 7409c91ffa1c7ae181357d36d130ccf6d10b765f [log] [tgz]
author: P. Taylor Goetz <ptgoetz@gmail.com> Wed Sep 28 16:15:52 2016 -0400
committer: P. Taylor Goetz <ptgoetz@gmail.com> Wed Sep 28 16:15:52 2016 -0400
tree: 0c4a46fb7984bc8133744f2b0bb1c0bb4972fada
parent: b28921d36de688cefc0e99828a4d0ded4b510412 [diff]
diff --git a/docs/storm-sqe-commands.md b/docs/storm-sqe-commands.md
new file mode 100644
index 0000000..65c589e
--- /dev/null
+++ b/docs/storm-sqe-commands.md

@@ -0,0 +1,160 @@
+# Commands
+
+SQE currently supports the following commands:
+
+* CreateStream
+* Query
+* Set
+
+All commands are formatted as a JSON map with a single entry. The key is the name of the command and the value is a map formatted according to the command type. This command map can contain multiple fields, sections or clauses. Commands are typically loaded from one or more JSON files that are an array of commands:
+
+    [
+      {"Set": {...}},
+      {"Set": {...}},
+      {"CreateStream": {...}},
+      {"Query": {...}},
+      {"Query": {...}}
+    ]
+
+## CreateStream
+
+CreateStream allows for creating input streams to the topology. For example, you can create a stream from a topic in a Kafka cluster. The JSON representation looks like:
+
+    {
+      "createStream": {
+        "streamName": "<STREAM_NAME>",
+        "objectName": "<OBJECT_NAME>",
+        "spoutName": "<SPOUT_NAME>",
+        "spoutType": "<SPOUT_TYPE>",
+        "deserializer": "<DESERIALIZER>",
+        "options": {
+          "option1": value1,
+          "optionN": valueN
+        }
+      }
+    }
+
+* streamName - This is the name of the stream used by SQE. Queries reference streams by using this name in the FROM clause.
+* objectName - This is the name of the object to read from. For example, with Kafka, this would be the topic. Different spouts will handle this differently, but it's always a reference to some division of data from the data source.
+* spoutName - This is the name of the spout to use. Currently, Kafka and Fixed are supported.
+* spoutType (optional) - This determines the type of spout, which corresponds to the state types (NON_TRANSACTIONAL, TRANSACTIONAL, OPAQUE). Exactly once processing guarantees are determined by the combination of spout type and state type. Some spouts require this, while others do not. Not all types are supported by all spouts. (see Trident documentation for more information)
+* deserializer (optional) - This takes the output of the spout and deserializes it. The only currently supported deserializer is Avro, which takes a byte array, converts it into an Avro record, then adds the required fields (based on supplied queries) to the tuple. If no deserializer is specified, then the tuples emitted by the spout are unchanged. For example, streams from a Kafka spout will typically need a deserializer since data is stored as a single message. Other spouts may return data split into different fields that can be queried against directly.
+* options (optional) - This object allows you to provide command level spout options. It can include new options that affect how the spout emits data or overrides for global SQE options that are passed to the spout. For example, you can specify the hostname of the spout you are reading from.
+
+### Example
+
+    {
+      "CreateStream": {
+        "streamName": "big.query.data",
+        "objectName": "big.query.data",
+        "spoutName": "FIXED",
+        "spoutType": "NON_TRANSACTIONAL",
+        "options": {
+          "jw.sqe.spout.fixed.fields": ["DateString", "AccountToken", "UserName", "HappyDanceCount"],
+          "jw.sqe.spout.fixed.values": [
+            ["2015-05-01 00:00", "Account1", "Joe", 1],
+            ["2015-05-01 01:00", "Account1", "Bob", -1],
+            ["2015-05-01 02:00", "Account1", "Susy", 1],
+            ["2015-05-01 03:00", "Account1", "Mr. Fancy Pants", 1],
+            ["2015-05-01 04:00", "Account1", "Joe", -1],
+            ["2015-05-01 05:00", "Account1", "Bob", 1],
+            ["2015-05-01 06:00", "Account1", "Susy", 1],
+            ["2015-05-01 07:00", "Account1", "Mr. Fancy Pants", 1],
+            ["2015-05-01 08:00", "Account2", "Joe", 1],
+            ["2015-05-01 09:00", "Account2", "Bob", 1],
+            ["2015-05-01 10:00", "Account2", "Susy", -1],
+            ["2015-05-01 11:00", "Account2", "Mr. Fancy Pants", 1],
+            ["2015-05-01 12:00", "Account2", "Joe", 1],
+            ["2015-05-01 13:00", "Account2", "Bob", 1],
+            ["2015-05-01 14:00", "Account2", "Susy", 1],
+            ["2015-05-01 15:00", "Account2", "Mr. Fancy Pants", -1]
+          ]
+        }
+      }
+    }
+
+## Set
+
+Set adds or overwrites entries in the global config map. This can be used to set config options within a command file. The JSON representation looks lke:
+
+    {"Set": {"key":"<KEY>","value":"<VALUE>"}}
+
+### Example
+
+    {"Set": {"key":"jw.sqe.state.redis.datatype","value":"HASH"}}
+
+## Query
+
+Queries allow you to run SQL like queries against an input stream and to persist/aggregate against a Trident state. The JSON representation of a query looks like:
+
+    {
+      "Query": {
+        "insertInto":{
+          "objectName":"<OBJECT_NAME>",
+          "stateName": "<STATE_NAME>",
+          "stateType": "<STATE_TYPE>",
+          "fields":[<FIELD_LIST>],
+          "options": {
+            "option1": value1,
+            "optionN": valueN
+          }
+        },
+        "select":{"expressions":[<EXPRESSION_LIST>]},
+        "from":{"objectName":"<OBJECT_NAME>"},
+        "where":<PREDICATE_EXPRESSION>
+      }
+    }
+
+### InsertInto
+
+This clause tells the query engine what object (table, view, stream, etc.) the results are delivered to, what the fields are named, and if it needs to persist data to a certain state. Similar to SQL, the ordering of the fields in this clause lines up with the expressions in the Select clause. InsertInto contains the following fields:
+
+* objectName - If SQE is sending the results to an output stream, then this is the name of the stream in the output stream map returned by build(). If a state is supplied (see below), then this is the name of table (or other object) the results are persisted to.
+* stateName (optional) - If this field is supplied, then SQE will persist the results to a data store using this state. The name here must be a supported state, such as Redis. Otherwise, the results are streamed through the output stream supplied by build().
+* stateType (optional) - This determines the type of state (NON_TRANSACTIONAL, TRANSACTIONAL, OPAQUE). It must be supplied if stateName is.
+* fields - When persisting to a state, these fields are the names of the keys or columns. When using an output stream, they are the names of the fields in each tuple. The fields here are ordered and line up with the expressions in the Select clause.
+* options (optional) - This object allows you to provide query level state options. It can include new options that affect how the state persists data or overrides for global SQE options that are passed to the state. For example, you can specify the hostname of the data store you are persisting data to.
+
+### Select
+
+This clause represents the selection of fields and transformations and aggregations on those fields from the input streams. The only field of the Select clause is a list of expressions. Expressions can come in one of three types:
+
+* Field - Field expressions directly reference fields in the input streams and are represented as a string literal of the field name in JSON
+* Constant - A constant value that can be a number, string, boolean, or null. The JSON representation looks like: `{"C":<CONSTANT>}`. Non-string constants can also be represented as just the literal. Internally, numerical constants are represented as either an Integer, Long or Double value.
+* Function - Functions are expressions that represent processing of input data. There are two kinds of function expressions: transform and aggregation. There is also a special type of transform function called a predicate expression that always evaluates to true/false. Functions contain an argument list of expressions, which can include fields, constants and transform expressions. Aggregation expressions cannot be contained in another expression. Function expressions are represented in JSON as: `{"<FUNCTION_NAME>":[<EXPRESSION_LIST>]}`.
+
+### From
+
+The From clause tells SQE what input stream the query is referencing. The only field is "objectName." Typically, this is going to be the name of the Avro schema of the input data, but can include named input streams of tuples in the future.
+
+### Where
+
+The Where clause is an optional clause that allows filtering of data in the query by specifying which data should be kept. This clause accepts a single predicate expression, but you can chain together expressions using expressions like AND and OR, similar to other languages.
+
+### No Group By?
+
+SQE automatically determines which expressions are key fields and which expressions are aggregate/value fields. If a top level expression in the Select clause is a field, constant or transform expression, then it is a key field for the purposes of aggregations and persisting state. If a top level expression is an aggregation expression, then it is a value field.
+
+### Example
+
+This is an example of a query of top level device analytics by the minute. It includes sums for embeds, plays, completes. Additionally, it creates a HyperLogLog bitmap of all UserTokens for play events. The input stream is filtered to only include data has positive embeds, plays, or completes. It then uses the transactional RedisMapState to persist the results.
+
+    {
+      "Query": {
+        "insertInto":{
+          "objectName":"MinuteDeviceMeasures",
+          "stateName": "REDIS",
+          "stateType": "TRANSACTIONAL",
+          "fields":["DateTime","Device","Embeds","Plays","Completes","TimeWatched","AdImpressions","HllViewers"]
+        },
+        "select":{
+          "expressions":[
+            {"FormatDate":[{"ParseDate":["DateGMTISO",{"C":"yyyy-MM-dd HH:mm:ss Z"}]},{"C":"yyyy-MM-dd'T'HH"}]}, "Device",
+            {"Sum":["Embeds"]},{"Sum":["Plays"]},{"Sum":["Completes"]},{"Sum":["TimeWatched"]},{"Sum":["AdImpressions"]},
+            {"CreateHll":[{"If": [{"GreaterThan":["Plays",0]}, "UserToken",null]}]}
+          ]
+        },
+        "from":{"objectName":"com.jwplayer.analytics.avro.Ping"},
+        "where":{"Or":[{"Or":[{">":["Embeds",0]},{">":["Plays",0]}]},{">":["Completes",0]}]}
+      }
+    }
\ No newline at end of file

diff --git a/docs/storm-sqe-expressions.md b/docs/storm-sqe-expressions.md
new file mode 100644
index 0000000..38bc2d5
--- /dev/null
+++ b/docs/storm-sqe-expressions.md

@@ -0,0 +1,59 @@
+# Supported Expressions
+
+## Aggregate
+
+All aggregate expressions take a single argument and output a single value.
+
+* CreateHll - Creates a bitmap representing a HyperLogLog object using the Clearspring implementation
+* CreateHllp - Similar to CreateHll, but uses the HyperLogLogPlus algorithm
+* Max - Returns the maximum value in a set of objects. If the input values are numbers, clojure.lang.Numbers.max() is used, otherwise Java's Comparable interface is used to perform comparisons. Objects that do not implement Comparable are not supported.
+* Sum - Sums a set of numbers into a single value. Works similarly to Sum in SQL. Uses Trident's built in Sum aggregator.
+
+## Transform
+
+Conditionals
+
+* If - `[<PREDICATE>,<TRUE_VALUE>,<FALSE_VALUE>]` - Returns \<TRUE_VALUE\> if \<PREDICATE\> is true, otherwise returns \<FALSE_VALUE\>
+
+Date/Time
+
+* FormatDate - `[<DATE>,<OUTPUT_DATE_FORMAT>]` - Takes \<DATE\> and formats it into a string using \<OUTPUT_DATE_FORMAT\>. Uses Java's SimpleDateFormat and the date is adjusted to UTC.
+* GetTime - `[<DATE>]` - Turns a date object into a Long timestamp using the getTime method.
+* ParseDate - `[<DATE_STRING>,<INPUT_DATE_FORMAT>]` - Takes \<DATE_STRING\> and parses it into a Date object using \<OUTPUT_DATE_FORMAT\>. Uses Java's SimpleDateFormat and the date is adjusted to UTC.
+* RoundDate - `[<DATE>,<AMOUNT>,<UNIT>]` - Takes a date object and rounds it to the nearest \<AMOUNT\> \<UNIT\>. For example, if \<AMOUNT\> = 10 and \<UNIT\> = Second, then it will round to the nearest 10th second. Allowed units are Second, Minute, Hour, and Day. 
+
+Math
+
+* + - Addition
+* / - Division
+* % - Modulus
+* * - Multiplication
+* - - Subtraction
+
+Predicate
+
+* = - `[<VALUE_A>,<VALUE_B>]` - Compares \<VALUE_A\> to \<VALUE_B\>. If both values are Numbers, clojure.lang.Numbers.equiv is used to perform the comparison. Otherwise, the equals method of \<VALUE_A\> is called.
+* In - Returns true if the first argument is equal to one of the remaining arguments. Uses clojure.lang.Numbers.equiv for Number objects, or the equals method for other object types. Similar to the SQL keyword of the same name.
+* Logical Operators - And, Not, Or, Xor - Each takes 2 or more arguments, except for Not, which only takes one. If more than 2 arguments is supplied for And/Or/Xor, the operator is chained, i.e. `{"And":[A,B,C,D]}` is the same as `{"And":[{"And":[{"And":[A,B]},C]},D]}` is the same as `A and B and C and D`).
+* Numerical Comparators - >, >=, <, <=, != - Each takes 2 arguments. Uses clojure.lang.Numbers to perform the comparisons.
+* RLike - `[<STRING>,<REG_EX>]` - Similar to RLike in MySQL or Hive. Uses the String method matches to evaluate the regular expression. The function caches compilations of the `<REG_EX>` values it sees, though typically that expression is a constant.
+
+Map
+
+* ExpandKeys - Takes a Java Map object and emits each key. Unlike most other transformation expressions which add a single value onto a tuple, this takes an input tuple and creates a new tuple for each key. 
+* ExpandValues - Takes a Java Map object and emits each value. Similar to ExpandKeys, this can create multiple tuples for each input tuple.
+
+Other
+
+* Hash - `[<ALGORITHM_NAME>,<VALUE>]` - Hashes the value using the given algorithm. The hash is returned as a byte array. Supported algorithms:
+    * Murmur2 - Uses Clearspring's Murmur2 implementation. The output is always 8 bytes.
+
+## Adding new expressions
+
+Adding a new expressions requires the following changes:
+
+* Sub-classing a new expression from the appropriate parent class (AggregationExpression, TransformExpression, or PredicateExpression) and overriding the abstract methods. These classes allow the query engine to parse and operate on queries and to interact with the input streams.
+* Updating the /META-INF/services/com.jwplayer.sqe.language.expression.FunctionExpression resource file to include the expressions class. You can also add "UDFs" to a project that includes SQE as a dependency by adding a new /META-INF/services/com.jwplayer.sqe.language.expression.FunctionExpression resource file in that project. Expression names used in queries are pulled from the getFunctionName method in each expression's class. These names are case insensitive to the parser. Because SQE includes a version of this file, you'll need to use something like Maven's Shade plugin (https://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#AppendingTransformer) to combine the original resource file, with your custom resource file to combine the base SQE expressions and your UDFs.
+* Creating new classes from the Trident BaseAggregator, CombinerAggregator, ReducerAggregator or BaseFunction classes. These classes operate directly on the stream. One or more expressions will parse to one of the supported expression types, which in turn interface with one or more Trident Aggregators or Functions.
+
+When writing new expressions, it shouldn't be important to know what types of expressions the arguments are. Field expressions should be handled naturally because the values of that field will come in through each tuple you process. Constant and Transform expressions that are nested in the argument list will appear in tuples the same as a field. Behind the scenes, SQE processes nested transform expressions first. By the time a transform function higher up in an expression tree is processed, its transform function arguments already appear as fields in the tuple. 
\ No newline at end of file

diff --git a/docs/storm-sqe-options.md b/docs/storm-sqe-options.md
new file mode 100644
index 0000000..3c917a0
--- /dev/null
+++ b/docs/storm-sqe-options.md

@@ -0,0 +1,20 @@
+# Options
+
+## Topology Config
+
+The config file for the SQE Topology class allows you to specify both Storm and SQE options using a YAML file. The format of this file looks like:
+
+    Storm:
+        topology.workers: 2
+        other.storm.option: value
+    SQE:
+        jw.sqe.parallelism.hint: 10
+        jw.sqe.spout.kafka.zkhosts:
+            - zookeeper:2181
+        other.sqe.option: value
+
+The key/value pairs in the Storm section are added to the Topology config. Pairs in the SQE section are added to the SQE config. Options specified in a command file (either through the SET command or within an options section) override options specified in the config file.
+
+## Query Engine
+
+* jw.sqe.parallelism.hint - This determines the parallelism for each input stream. So a paralellism hint of 20 with an input stream plus the primary processing bolt would translate into 40 executors. This does not affect the number of executors for downstream global aggregations or bolts used for persisting state. These bolts currently get a single executor each.
\ No newline at end of file

diff --git a/docs/storm-sqe-replay-filtering.md b/docs/storm-sqe-replay-filtering.md
new file mode 100644
index 0000000..8e150b0
--- /dev/null
+++ b/docs/storm-sqe-replay-filtering.md

@@ -0,0 +1,33 @@
+# Replay Filtering
+
+Replay filtering in SQE deals with removing replays using a lightweight mechanism from a data source that can't be written to using the standard transactional or opaque semantics. Currently, this means writing to Kafka and later removing replays while reading from the data source in downstream topologies, though the solution here can potentially be used with other data stores, streams and states. The solution to this problem is inspired by the Kafka idempotent producer proposal [here](https://cwiki.apache.org/confluence/display/KAFKA/Idempotent+Producer).
+
+## How it works
+
+You can read a more detailed explanation in the link above, but the idea is pretty simple. Let's say you have two topologies. The first reads from topic 1 and writes to topic 2. The second reads from topic 2 and writes to a transactional data store. The first topology can replay records and write them into Kafka multiple times. For example, this can happen if a batch failure happens. 
+
+What we want to do is ignore those replays in the second topology. We can do this if the first topology meets the following criteria:
+
+* Tuples are read and processed in the order they appear in the data source. For Kafka, this means reading and processing the records in offset order. If the spout emits a batch of data and emits it in a random order, then replays cannot be removed using this method.
+* Partitioning is deterministic and maintained between the data source, inside Storm, and in the data store. This means no re-partitioning of streams inside Storm. We include a special partitioner that determines the destination partition based on the source partition. If the destination topic has the same (or more) partitions, source partition should equal destination partition. If the destination has fewer partitions, then: destination partition = source partition % total # of destination partitions.
+
+With the above, we are able to filter replayed messages by tracking each "highwater mark (offset)" per partition. We also include a PID that identifies the producer and source, which can be from a SQE topology or other producer. This allows us to track different highwater marks for different producers and sources. If we see a record with an offset that is below the current highwater mark for it's PID and partition, we know we've already processed that record and we can toss the replay. If instead the highwater mark is higher, then we emit it normally and update the highwater mark.
+
+To track the needed information, SQE adds a stream metadata object to each tuple. This metadata includes a PID (represented as a long), a partition (integer), and a message offset (long). For a Kafka stream, the PID is generated by hashing the topology name, stream name, ZK hosts, and ZK path together. Even though partition and offset map directly to the same concepts in Kafka, it can be extended to other streams/spouts. For storage in Kafka, the stream metadata is converted to a size 20 byte array with PID, partition, offset order. 
+
+In the future, we may be able to extend this to (partition) local aggregations, as long as we can produce a single piece of metadata from multiple tuples (group on PID and partition, max on offset?) and maintain offset ordering for each aggregation batch. This is not supported at the moment, however.
+
+## How to enable replay filtering
+
+### State Options
+
+Set the following options for the Kafka state to record the stream metadata and to ensure records are written to the appropriate partitions:
+
+* jw.sqe.state.kafka.keytype - StreamMetadata
+* jw.sqe.state.kafka.partitionClass - com.jwplayer.sqe.language.state.kafka.SourcePartitionPartitioner
+
+### Stream Options
+
+If the stream metadata is included in the key of each record (records with NULL or invalid keys don't get filtered), all that needs to be done on the stream is to enable replay filtering:
+
+* jw.sqe.spout.kafka.filterReplays - true
\ No newline at end of file

diff --git a/docs/storm-sqe-states.md b/docs/storm-sqe-states.md
new file mode 100644
index 0000000..c620f04
--- /dev/null
+++ b/docs/storm-sqe-states.md

@@ -0,0 +1,123 @@
+# States
+
+## Kafka
+
+SQE uses a slightly modified version of the storm-kafka project for its Kafka state implementation. The objectName in the InsertInto clause is the topic that messages are written to. Depending on the particular key type that is used, this state accept one or two fields which correspond to the key and value written to the Kafka topic. 
+
+### Options
+
+Most, but not all, of these options map to the underlying Kafka producer options.
+
+* jw.sqe.state.kafka.brokers - An array of Kafka brokers. Maps to metadata.broker.list as a comma delimited string.
+* jw.sqe.state.kafka.serializerClass (default: org.apache.kafka.common.serialization.ByteArraySerializer) - Maps to value.serializer
+* jw.sqe.state.kafka.key.serializerClass (default: org.apache.kafka.common.serialization.ByteArraySerializer) - Maps to key.serializer
+* jw.sqe.state.kafka.partitionClass (default: org.apache.kafka.clients.producer.internals.DefaultPartitioner) - Maps to partitioner.class
+* jw.sqe.state.kafka.producerType (default: async) - Maps to producer.type
+* jw.sqe.state.kafka.request.requiredAcks (default: 1) - Maps to acks
+* jw.sqe.state.kafka.keytype (default: MessageHash) - This determines how the key for each Kafka record is created. The three options are:
+    * Field - Normally, only one field is allowed, which is written as the value, and the key is generated. If the field key type is used, then two fields are required. The first is the key and the second is the value.
+    * MessageHash - Generates the key by hashing (using Murmur2) the value 
+    * StreamMetadata - Generates a 20 byte key from the internal metadata SQE attaches to each tuple. This is useful for lightweight replay filtering in downstream SQE topologies. See [Replay Filtering](https://github.com/jwplayer/sqe/wiki/replay-filtering) for more information.
+
+### Example
+
+    "insertInto": {
+      "objectName":"ping-avro-sqe-all-pings-test",
+      "stateName": "Kafka",
+      "stateType": "NON_TRANSACTIONAL",
+      "fields" : ["message"],
+      "jw.sqe.state.kafka.keytype": "StreamMetadata",
+      "jw.sqe.state.kafka.producerType": "sync",
+      "jw.sqe.state.kafka.partitionClass": "com.jwplayer.sqe.language.state.kafka.SourcePartitionPartitioner",
+      "jw.sqe.state.kafka.brokers": ["mykafka-01", "mykafka-02"]
+    }
+
+## MongoDB
+
+SQE uses an internal map state implementation to persist data to Mongo DB. The objectName in the InsertInto clause is the collection to write to and the fields are the fields names in each record. One record is written per value field for each set of keys. The _id field is generated from a hash of the keys plus the value field name written into a byte array. The order of the keys here matches the order of the key field names in the InsertInto clause. For transactional and opaque states, value fields are written as sub-documents containing the appropriate transaction ID and other fields. 
+
+### Options
+
+* jw.sqe.state.mongodb.cachesize (default: 5000) - The maximum number of objects in the map cache
+* jw.sqe.state.mongodb.db - The database to write to
+* jw.sqe.state.mongodb.hosts - A list of hostnames to connect to
+* jw.sqe.state.mongodb.password - The password
+* jw.sqe.state.mongodb.port (default: 27017) - The port
+* jw.sqe.state.mongodb.replicaSet - The replica set name
+* jw.sqe.state.mongodb.userName - The username
+
+### Example
+
+    "insertInto":{
+      "objectName":"DailyAcctEPCUniq",
+      "stateName": "MONGO",
+      "stateType": "TRANSACTIONAL",
+      "fields":["date","accounttoken","ttlmarker","uniques","embeds","plays","completes","tw"]
+      "options": {
+        "jw.sqe.state.mongodb.db": "mydb", 
+        "jw.sqe.state.mongodb.hosts": ["mymongodb.com"],
+        "jw.sqe.state.mongodb.password": "password",
+        "jw.sqe.state.mongodb.replicaSet": "myreplicaset", 
+        "jw.sqe.state.mongodb.userName": "bob"
+      }
+    }
+
+## Redis
+
+SQE uses a modified version of the storm-redis project for its Redis map state. Query data can be persisted as Redis strings or hashes, with string being the default. SQE uses a custom serializer using Gson. For transactional states, the format is in JSON: `[<TXID>,<VALUE_TYPE>,<VALUE>]`. Opaque states are formatted like: `[<TXID>,<VALUE_TYPE>,<VALUE>]`. Currently supported value types are:
+
+* N - null
+* I - Integer
+* L - Long
+* F - Float
+* D - Double
+* S - String
+* B - Byte Array
+
+An example transactional value is: `[6491,"I",0]`
+
+How the InsertInto fields are translated into Redis data structures depends on the Redis data type being used. The two supported types are string and hash. 
+
+### String
+
+String is the default data type used by SQE. A custom KeyFactory is used that formats the aggregated data like: `<OBJECT_NAME>:<KEY_FIELD1>:...:<KEY_FIELDN>:<VALUE_FIELD>`. ':' is the default delimiter, but it can be configured (see the Options section below). For example, if you have the key field "Timestamp" and value field "AdImpressions" in your InsertInto clause, a key/value pair in Redis could look like: 
+
+Key: `HourlyEvents:2015-05-18T20:AdImpressions`
+Value: `[1234,"I",5678]`
+
+This is the simplest way to send data to Redis. However, for large amounts of data, it's often better to format the data using the hash data type. 
+
+### Hash
+
+This data type splits the InsertInto key fields between the hash's key name and field name based on the "jw.sqe.state.redis.keyname.fields" and "jw.sqe.state.redis.fieldname.fields" options in the query (see the Options section below). For example, if you have the following InsertInto clause in your query:
+
+    "insertInto":{
+      "objectName":"TestQuery",
+      "stateName": "REDIS",
+      "stateType": "TRANSACTIONAL",
+      "fields":["Account","Timestamp","Domain","Views","Clicks"],
+      "options": {
+        "jw.sqe.state.redis.datatype": "HASH",
+        "jw.sqe.state.redis.keyname.fields": ["Account","Timestamp"],
+        "jw.sqe.state.redis.fieldname.fields": ["Domain"]
+      }
+    }
+
+The data in Redis would look like:
+
+Key Name: `TestQuery:ABCD-1234:2015-05-18T20:`
+Field Name: `bob.com:Views`
+Value: `[1234,"I",5678]`
+
+Another way to think of this, is that you are splitting the standard Redis string representation in two. The first half is used for the Redis key name, the second for the Redis field name. The objectName always appears first in the Redis key name and the value field name always appears at the end of the Redis field name. The same delimiter option is used as for Redis strings.
+
+### Options
+
+* jw.sqe.state.redis.database (default: 0) - The Redis DB to write to
+* jw.sqe.state.redis.datatype (default: STRING) - The Redis data type to use for storing the queried data in Redis. The currently available options are STRING and HASH. When using the HASH data type, you should also include the "jw.sqe.state.redis.fieldname.fields" and "jw.sqe.state.redis.keyname.fields" options so the RedisMapState knows how to build the key names and field names from the given keys from the query.
+* jw.sqe.state.redis.delimiter (default: :) - The delimiter used by the KeyFactory
+* jw.sqe.state.redis.expireintervalsec (default: 0) - Sets the expiration TTL for all keys. No TTL is set if this is 0.
+* jw.sqe.state.redis.fieldname.fields - A list of key fields used to create the field name in a Redis hash
+* jw.sqe.state.redis.host - The host of the Redis server
+* jw.sqe.state.redis.keyname.fields - A list of key fields used to create the key name in a Redis hash
+* jw.sqe.state.redis.port (default: 6379) - The port of the Redis server
\ No newline at end of file

diff --git a/docs/storm-sqe-streams.md b/docs/storm-sqe-streams.md
new file mode 100644
index 0000000..8d39dfa
--- /dev/null
+++ b/docs/storm-sqe-streams.md

@@ -0,0 +1,104 @@
+# Streams
+
+Streams are ways of accessing data from data stores as streams of tuples. Some spouts emit data that may not be immediately useful to a SQE query. For example, Kafka writes tuples that contain a message field. These messages may be serialized data or otherwise contain complex information and objects you want to access. Deserializers allow us to access this data and pull fields out of them that can then be queried against. See more information about deserializers below.
+
+## Fixed
+
+SQE uses the FixedBatchSpout provided by Storm for testing. This allows you to write a list of of fields and values that will be emitted onto the stream as part of a query.
+
+### Options
+
+* jw.sqe.spout.fixed.fields - The fields that are emitted onto the stream as an array 
+* jw.sqe.spout.fixed.values - An array of value arrays that the spout emits onto the stream. Each array is emitted as a tuple and the order of values in each array corresponds to the order of the fields.
+
+### Example
+
+    {
+      "CreateStream": {
+        "streamName": "big.query.data",
+        "objectName": "big.query.data",
+        "spoutName": "FIXED",
+        "spoutType": "NON_TRANSACTIONAL",
+        "options": {
+          "jw.sqe.spout.fixed.fields": ["DateString", "AccountToken", "UserName", "HappyDanceCount"],
+          "jw.sqe.spout.fixed.values": [
+            ["2015-05-01 00:00", "Account1", "Joe", 1],
+            ["2015-05-01 01:00", "Account1", "Bob", -1],
+            ["2015-05-01 02:00", "Account1", "Susy", 1],
+            ["2015-05-01 03:00", "Account1", "Mr. Fancy Pants", 1],
+            ["2015-05-01 04:00", "Account1", "Joe", -1],
+            ["2015-05-01 05:00", "Account1", "Bob", 1],
+            ["2015-05-01 06:00", "Account1", "Susy", 1],
+            ["2015-05-01 07:00", "Account1", "Mr. Fancy Pants", 1],
+            ["2015-05-01 08:00", "Account2", "Joe", 1],
+            ["2015-05-01 09:00", "Account2", "Bob", 1],
+            ["2015-05-01 10:00", "Account2", "Susy", -1],
+            ["2015-05-01 11:00", "Account2", "Mr. Fancy Pants", 1],
+            ["2015-05-01 12:00", "Account2", "Joe", 1],
+            ["2015-05-01 13:00", "Account2", "Bob", 1],
+            ["2015-05-01 14:00", "Account2", "Susy", 1],
+            ["2015-05-01 15:00", "Account2", "Mr. Fancy Pants", -1]
+          ]
+        }
+      }
+    }
+
+## Kafka
+
+The Kafka stream type allows you to read data from a Kafka topic onto a stream. Typically, though not necessarily, you will use a deserializer on the message to create the appropriate fields on the stream. The object name in the CreateStream command is the topic you are reading from. By default, without using a deserializer, the key and value are outputted under the fields _key and _value, respectively.
+
+### Options
+
+* jw.sqe.spout.kafka.zkhosts - An array of Zookeeper hosts, including the port. This is used to locate the Kafka cluster.
+* jw.sqe.spout.kafka.clientid - The client ID used by the Kafka spout
+* jw.sqe.spout.kafka.bufferSizeBytes (optional)
+* jw.sqe.spout.kafka.fetchSizeBytes (optional)
+* jw.sqe.spout.kafka.maxOffsetBehind (optional)
+* jw.sqe.spout.kafka.filterReplays (default: false) - Enables replay filtering based on stream metadata recorded in the key of each record. See [Replay Filtering](https://github.com/jwplayer/sqe/wiki/replay-filtering) for more information.
+* jw.sqe.spout.kafka.filterReplays.metadata.ttl (default 172800) - The TTL (in seconds) of any individual metadata/highwater marks recorded by the replay filtering functionality. This prevents new PIDs from accumulating over time.
+
+### Example
+
+    {
+      "CreateStream": {
+        "streamName": "big.query.data",
+        "objectName": "my-topic",
+        "spoutName": "KAFKA",
+        "spoutType": "TRANSACTIONAL",
+        "deserializer": "avro"
+        "options": {
+          "jw.sqe.spout.kafka.zkhosts": ["zk-01.host.com:2181","zk-02.host.com:2181"],
+          "jw.sqe.spout.kafka.clientid": "my-client-id",
+          "jw.sqe.spout.deserializer.avro.schemaname": "my.avro.schema"
+        }
+      }
+    }
+
+# Deserializers
+
+Deserializers allow us to take serialized or otherwise packed data that is emitted by a spout, parse it, and split its constituent parts into fields on the tuple. Kafka is a good example since it stores data packed as a message that can be formatted in any number of ways such as Avro. If no deserializer is specified, then the data remains on the stream as it is emitted by the spout without an pre-processing between the spout and the queries.
+
+## Avro
+
+The Avro deserializer takes an Avro record as a byte array along with a schema. Based on the fields needed by the queries, they are pulled from each record and added to each tuple. For example, if messages are stored in Kafka in Avro with the following schema:
+
+    {
+      "fields": [
+        {"name": "DateTime", "type": "string"},
+        {"name": "Embeds", "type": "int"},
+        {"name": "Plays", "type": "int"},
+        {"name": "Completes", "type": "int"},
+        {"name": "AdImpressions", "type": "int"},
+        {"name": "TimeWatched", "type": "long"}
+      ],
+      "name": "HourlyMeasures",
+      "namespace": "com.jwplayer.analytics.avro",
+      "type": "record"
+    }
+
+We then have a query that references the DateTime, Embeds, and Plays fields. The byte array message is then deserialized into an Avro record, then those 3 fields are added to the tuple as fields with the same name as the fields in the Avro schema.
+
+### Options
+
+* jw.sqe.spout.deserializer.avro.schemaname - The name of the avro schema to use to deserialize the byte array. The full namespace should be included. This should point to a Java object that implements SpecificRecord.
+**TODO: Allow specifying avro schemas either inline or from another file as another way of deserializing Avro objects**
\ No newline at end of file

diff --git a/docs/storm-sqe-using.md b/docs/storm-sqe-using.md
new file mode 100644
index 0000000..e96ffc8
--- /dev/null
+++ b/docs/storm-sqe-using.md

@@ -0,0 +1,15 @@
+# Using SQE
+
+Using the SQE topology builder is straightforward. You just launch the topology using Storm with appropriate command line options:
+
+    storm jar sqe.jar com.jwplayer.sqe.Topology --config ./conf/conf.yaml --commands ./commands/commands1.json /my/commands/commands2.json --name=my-topology
+
+The locations of config and command files can be specified using a URI. For example
+
+    file:///conf/conf.yaml
+
+You can see all of the command line options by running:
+
+    java jar sqe.jar com.jwplayer.sqe.Topology --help
+
+Refer to the [Commands](storm-sqe-commands.html) section for more information on the JSON command format and the [Options](storm-sqe-options.html) section for more information on the options and config file format.
\ No newline at end of file

diff --git a/docs/storm-sqe.md b/docs/storm-sqe.md
new file mode 100644
index 0000000..f107761
--- /dev/null
+++ b/docs/storm-sqe.md

@@ -0,0 +1,28 @@
+# Streaming Query Engine
+
+SQE is a query engine written using Storm's Trident framework that takes a set of SQL-like commands, including queries, and runs them against one or more input streams. By using Trident, input streams are processed in micro-batches with both good latency and high throughput while guaranteeing "exactly-once" processing. Results can be returned through a list of output streams or SQE can handle persisting to a data store directly using one of its supported Trident states. SQE is designed to make it easy to query against large streams of data with good performance for many different use cases.
+
+* [Using SQE](storm-sqe-using.html)
+* [Options](storm-sqe-options.html)
+* [Commands](storm-sqe-commands.html)
+* [Expressions](storm-sqe-expressions.html)
+* [States](storm-sqe-states.html)
+* [Streams](storm-sqe-streams.html)
+* [Replay Filtering](storm-sqe-replay-filtering.html)
+
+## Potential Future Features
+
+* More expressions - String expressions, Average, etc.
+* Query planning and optimization
+* Better state/stream support and optimization
+* More options
+* Support for joins
+* Support for sub-queries, in-line queries and query chaining
+* Split out streams, states and other hard coded factories to configuration files. Factories should use the appropriate file to create appropriate objects. Then users can add additional expressions, states, etc. by building a jar with SQE as a dependency, then including their own versions of the configuration files with their new expressions, states, etc. (Note: we do this for expressions now)
+
+## Links
+
+* [Trident State](http://storm.apache.org/documentation/Trident-state.html) - Very useful for understanding how state and persisting data works in Trident
+* [Trident API Overview](http://storm.apache.org/documentation/Trident-API-Overview.html) - Basic overview of Trident. Not necessary to know to use SQE, but still helpful.
+* [Squall](https://github.com/epfldata/squall) - "A streaming / online query processing / analytics engine based on Apache Storm" - Another project from EPFL Data that does complex SQL like queries on top of Storm. Definitely something to keep an eye on, though use cases may be different. Doesn't support "exactly-once" processing yet, something we do by using Trident.
+
commit	7409c91ffa1c7ae181357d36d130ccf6d10b765f	[log] [tgz]
author	P. Taylor Goetz <ptgoetz@gmail.com>	Wed Sep 28 16:15:52 2016 -0400
committer	P. Taylor Goetz <ptgoetz@gmail.com>	Wed Sep 28 16:15:52 2016 -0400
tree	0c4a46fb7984bc8133744f2b0bb1c0bb4972fada
parent	b28921d36de688cefc0e99828a4d0ded4b510412 [diff]