| <HTML> |
| <BODY> |
| Pig is a platform for a data flow programming on large data sets in a parallel |
| environment. It consists of a language to specify these programs, |
| <a href="http://wiki.apache.org/pig/PigLatin">Pig Latin</a>, |
| a compiler for this language, and an execution engine to execute the programs. |
| <p> |
| Pig runs on <a href="http://hadoop.apache.org/core/">hadoop</a> |
| MapReduce, reading data from and writing data to HDFS, and doing processing via |
| one or more MapReduce jobs. |
| |
| <h2> Design </h2> |
| This section gives a very high overview of the design of the Pig system. |
| Throughout the documents you can see design for that package or class by |
| looking for the Design heading in the documentation. |
| |
| <h3> Overview </h3> |
| <p> |
| Pig's design is guided by our <a href="http://hadoop.apache.org/pig/philosophy.html"> |
| pig philosophy</a>. |
| <p> |
| Pig shares many similarities with a traditional RDBMS design. It has a parser, |
| type checker, optimizer, and operators that perform the data processing. However, |
| there are some |
| significant differences. Pig does not have a data catalog, there are no |
| transactions, pig does not directly manage data storage, nor does it implement the |
| execution framework. |
| <p> |
| <h3> High Level Architecture </h3> |
| Pig is split between the front and back ends of the engine. In the front end, |
| the parser transforms a Pig Latin script into a logical plan. |
| Semantic checks (such |
| as type checking) and some optimizations (such as determining which fields in the data need |
| to be read to satisfy the script) are done on this Logical Plan. The Logical |
| Plan is than transformed into a |
| {@link org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan}. |
| This Physical Plan contains the operators that will be applied to the data. This is then |
| divided into a set of MapReduce jobs by the |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler} into an |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.MROperPlan}. This |
| MROperPlan (aka the map reduce plan) is then optimized (for example, the combiner is used where |
| possible, jobs that scan the same input data are combined where possible, etc.). Finally a set of |
| MapReduce jobs are generated by the |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler}. These are |
| submitted to Hadoop and monitored by the |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher}. |
| <p> |
| On the backend, each |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Map}, |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner.Combine}, and |
| {@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce} |
| use the pipeline of physical operators constructed in the front end to load, process, and store |
| data. |
| |
| <h3> Programmatic Interface </h3> |
| <p> |
| In addition to the command line and grunt interfaces, users can connect to |
| {@link org.apache.pig.PigServer} from a Java program. |
| <p> |
| Pig makes it easy for users to extend its functionality by implementing User Defined Functions |
| (UDFs). There are interfaces for defining functions to load data |
| {@link org.apache.pig.LoadFunc}, storing data {@link org.apache.pig.StoreFunc}, doing evaluations |
| on fields (including collections of data, so user defined aggregates are possible) |
| {@link org.apache.pig.EvalFunc} and filtering data {@link org.apache.pig.FilterFunc}. |
| </BODY> |
| </HTML> |
| |