site/docs/introduction.md

layout: docs title: Introduction permalink: /docs/introduction/

Pig

One of the big reasons that Apache Pig exists is to provide a much lower-cost entry point to running MapReduce. Writing a MapReduce job typically ends up being expression in hundreds of lines of code to solve problems that are often categorized as embarrassingly parallel. As such, these problems are typically easy to think about conceptually and can be used as a point of introspection to a data set or combination of data sets. Thus, it doesn't make sense to write large amounts of Java code for what may be repeated one-off questions.

Accumulo

Accumulo is one of many storage solutions in the Apache Hadoop ecosystem, so it makes sense that Pig should also be able to read/write data from/to Accumulo. Very similar to Apache HBase, Accumulo is a Key-Value datastore, where a Key is made up of multiple parts.

Data Model

This data model lends itself well to working with columnar data, handling changing, sparse column definitions on the fly. As new columns are created, Accumulo can process these automatically without any user intevention. Many columns can exist across many rows, and rows can contain different sets of columns. This can be thought of as storing a Map of data in each row.

File Storage

Accumulo provides many other desirable features when it comes to data management over “hand rolled” solutions using flat-files in HDFS.

Sorted

All data stored in Accumulo is sorted lexicographically. New data which is written to Accumulo is also written in sorted order, while reads against Accumulo performed a merged-read against these sorted streams of data to provide a globally sorted view over the table. This feature opens the door to many algorithms that can run efficiently over sorted data sets.

Tablets

Accumulo organizes data in tables. Each table is made up of multiple tablets. Each tablet contains at minimum one row and can be composed of many files in HDFS. In practice, most tablets in Accumulo will contain many rows. As new data is inserted into a table, Accumulo will manage how this data is written to HDFS. As new data is ingested and written to a tablet, that tablet will eventually split from one tablet into many. As these splits occurs, Accumulo manages each of these files in HDFS for you transparantly alleviating the necessity to implement data retention and organizational logic in the application.

Indexing

In addition to the trivial “map in row” layout, more advanced table schemas exist such as inverted indexes, document-partitioned indexes, and edge lists to name a few. All of these can be expressed using the same “5 tuple” Key data model that Accumulo provides.

Next, how to use Pig to manipulate “map in row” datasets in Accumulo.