site/_posts/2016-06-28-file-format-talk.md

layout: news_item title: “File format benchmark” date: “2016-06-28 08:00:00 -0800” author: omalley categories: [talk]

I gave a talk at Hadoop Summit San Jose 2016 about a file format benchmark that I've contributed as ORC-72. The benchmark focuses on real data sets that are publicly available. The data sets represent a wide variety of use cases:

NYC Taxi Data - very dense data with mostly numeric types
Github Archives - very sparse data with a lot of complex structure
Sales - a real production schema from a sales table with a synthetic generator

The benchmarks look at a set of three very common use cases:

Full table scan - read all columns and rows
Column projection - read some columns, but all of the rows
Column projection and predicate push down - read some columns and some rows

You can see the slides here:

File Format Benchmarks: Avro, JSON, ORC, & Parquet