src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/package.html - pig - Git at Google

 <html>
 <body>

 <p>
 Implementation of physical operators that use hadoop as the execution engine
 and data storage.

 <h2> Design </h2>
 <p>
 Physical operators use the operator, plan, visitor, and optimizer framework
 provided by the {@link org.apache.pig.impl.plan} package.
 <p>
 As with {@link org.apache.pig.impl.logicalLayer}, physical operators consist
 of {@link org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators} and
 {@link org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators}.  In many data
 processing systems relational operators and expression operators are modeled
 as different entities because they behave differently.  Pig blurs, though does
 not entirely remove, this distinction because of its support for nested
 operations.
 <p>
 Conceptually, relational operators work on an entire relation (in Pig's case,
 a bag).  In terms of implementation, they operate on one record (tuple) at a
 time.  This avoids needing to load the entire relation into memory before
 operating on it.
 <p>
 Expression operators, on the other hand, operate on the assumption that they
 are provided their entire input at invocation time and provide their entire
 output when they are finished.
 <p>
 Pig's hadoop implementation implements a pull based model, where each operator
 calls getNext() on the operator before it in the plan.  getNext() is
 implemented for each of the different data types, so that operators can
 request the data type they expect.  Relational operators will always expect a
 tuple.  Expression operators can request any data type.
 <p>
 As with the logical plan, physical relational operators often have embedded
 physical plans.  When a relational operator calls getNext() on its predecessor
 and receives a tuple, it will attach that tuple to its embedded physical plan(s)
 and then call getNext() on the root node(s) of those plan(s) in order to get the
 output.  For example, the Pig Latin <code>filter A by $0 != 5</code> will
 produce a POFilter object, with an embedded physical plan that consists of
 POProject(0), POConst(5), both attached to PONotEqual.  Each time
 POFilter.getNext() is called, it will call its predecessors getNext() method,
 and then attach the input to POProject and POConst.  It will then call
 PONotEqual.getNext().  PONotEqual will in turn call POProject.getNext() and
 POConst.getNext(), and then evaluate and return the results.  If the result is
 true, POFilter will return its input tuple.
 If the answer is false, it will call it's predecessor's getNext() method and
 try again.
 <p>
 Given Pig's nested data and execution models, there are places it is necessary
 to move between relational and expression operators.  Consider the following
 Pig Latin script:
 <code>
 A = load 'myfile';
 B = group A by $0;
 C = foreach B {
     C1 = filter $1 by $0 &gt; 0;
     C2 = distinct C1;
     generate group, COUNT(C2), SUM(C1.$0);
 }
 </code>
 In particular, the foreach section presents some interesting challenges.
 <p>
 First, foreach has three separate outputs, all of which require separate but
 parallel executions.  To address this, each element of the foreach is described by a
 separate embedded plan.  This can cause duplication of
 operations, as in this plan.
 In this case splitting the plans for COUNT and SUM cause a double execution of
 the <code>C1 = filter</code> section of the script.  But it avoids needing to
 place a split operator between filter -&gt; distinct and filter -&gt; SUM.
 <p>
 The second issue presented by the nested logic is that the
 foreach operator is going to receive a tuple with the format ($0, bag), where
 bag is a collection of all the tuples with a given value for $0.  It will then
 attach that to the filter.  But filter does not expect a bag.  It expects
 to get tuples.  On the other end, distinct will be outputing tuples.  But
 COUNT() expects C2 to be a bag that can be processed by COUNT as a whole.
 <p>
 To address this issue, some operators have been modified to provide
 "bookend" functionality.  That is, the ability to translate between relational
 and expression operators.
 The embedded plan for calculating the COUNT in the foreach will
 look like:  POProject(1) -&gt; PODistinct -&gt; POProject(*) -&gt; COUNT().
 The first POProject(1) will have a bag attached as its input by POForeach.
 But POFilter will call getNext(Tuple).  In this case, POProject will know to
 open the bag and provide the tuples one at a time, until the bag is empty, at
 which point it will return STATUS_EOP.  The PODistinct will be expecting to
 return tuples, but POProject(*) will call getNext(bag).  In this case all
 relational operators will be able to accumulate all of the tuples by calling
 getNext(tuple) on themselves until they see STATUS_EOP, packaging those tuples
 into a bag, and then returning that bag.
 <p>
 And third, project is being subtly overloaded here.  In cases where the script
 says <code>C = foreach B generate $1</code>, this type of projection means take the second
 element from the tuple and project it.  But in cases like <code>C = foreach B
 generate SUM($1.$0)</code> and $1 is a bag, this type of projection expects to
 receive a bag ($1) and output a modified bag ($1 with only the first field,
 $0, remaining in all the tuples in the bag).  To handle this issue, POProject
 will, when it sees that its predecessor is a POProject and its successors is
 an expression operator it will perform a projection on the bag (that is,
 perform the specified project on each tuple in the bag) rather than on a
 tuple.

 </body>
 </html>
	<html>
	<body>

	<p>
	Implementation of physical operators that use hadoop as the execution engine
	and data storage.

	<h2> Design </h2>
	<p>
	Physical operators use the operator, plan, visitor, and optimizer framework
	provided by the {@link org.apache.pig.impl.plan} package.
	<p>
	As with {@link org.apache.pig.impl.logicalLayer}, physical operators consist
	of {@link org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators} and
	{@link org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators}. In many data
	processing systems relational operators and expression operators are modeled
	as different entities because they behave differently. Pig blurs, though does
	not entirely remove, this distinction because of its support for nested
	operations.
	<p>
	Conceptually, relational operators work on an entire relation (in Pig's case,
	a bag). In terms of implementation, they operate on one record (tuple) at a
	time. This avoids needing to load the entire relation into memory before
	operating on it.
	<p>
	Expression operators, on the other hand, operate on the assumption that they
	are provided their entire input at invocation time and provide their entire
	output when they are finished.
	<p>
	Pig's hadoop implementation implements a pull based model, where each operator
	calls getNext() on the operator before it in the plan. getNext() is
	implemented for each of the different data types, so that operators can
	request the data type they expect. Relational operators will always expect a
	tuple. Expression operators can request any data type.
	<p>
	As with the logical plan, physical relational operators often have embedded
	physical plans. When a relational operator calls getNext() on its predecessor
	and receives a tuple, it will attach that tuple to its embedded physical plan(s)
	and then call getNext() on the root node(s) of those plan(s) in order to get the
	output. For example, the Pig Latin <code>filter A by $0 != 5</code> will
	produce a POFilter object, with an embedded physical plan that consists of
	POProject(0), POConst(5), both attached to PONotEqual. Each time
	POFilter.getNext() is called, it will call its predecessors getNext() method,
	and then attach the input to POProject and POConst. It will then call
	PONotEqual.getNext(). PONotEqual will in turn call POProject.getNext() and
	POConst.getNext(), and then evaluate and return the results. If the result is
	true, POFilter will return its input tuple.
	If the answer is false, it will call it's predecessor's getNext() method and
	try again.
	<p>
	Given Pig's nested data and execution models, there are places it is necessary
	to move between relational and expression operators. Consider the following
	Pig Latin script:
	<code>
	A = load 'myfile';
	B = group A by $0;
	C = foreach B {
	C1 = filter $1 by $0 > 0;
	C2 = distinct C1;
	generate group, COUNT(C2), SUM(C1.$0);
	}
	</code>
	In particular, the foreach section presents some interesting challenges.
	<p>
	First, foreach has three separate outputs, all of which require separate but
	parallel executions. To address this, each element of the foreach is described by a
	separate embedded plan. This can cause duplication of
	operations, as in this plan.
	In this case splitting the plans for COUNT and SUM cause a double execution of
	the <code>C1 = filter</code> section of the script. But it avoids needing to
	place a split operator between filter -> distinct and filter -> SUM.
	<p>
	The second issue presented by the nested logic is that the
	foreach operator is going to receive a tuple with the format ($0, bag), where
	bag is a collection of all the tuples with a given value for $0. It will then
	attach that to the filter. But filter does not expect a bag. It expects
	to get tuples. On the other end, distinct will be outputing tuples. But
	COUNT() expects C2 to be a bag that can be processed by COUNT as a whole.
	<p>
	To address this issue, some operators have been modified to provide
	"bookend" functionality. That is, the ability to translate between relational
	and expression operators.
	The embedded plan for calculating the COUNT in the foreach will
	look like: POProject(1) -> PODistinct -> POProject(*) -> COUNT().
	The first POProject(1) will have a bag attached as its input by POForeach.
	But POFilter will call getNext(Tuple). In this case, POProject will know to
	open the bag and provide the tuples one at a time, until the bag is empty, at
	which point it will return STATUS_EOP. The PODistinct will be expecting to
	return tuples, but POProject(*) will call getNext(bag). In this case all
	relational operators will be able to accumulate all of the tuples by calling
	getNext(tuple) on themselves until they see STATUS_EOP, packaging those tuples
	into a bag, and then returning that bag.
	<p>
	And third, project is being subtly overloaded here. In cases where the script
	says <code>C = foreach B generate $1</code>, this type of projection means take the second
	element from the tuple and project it. But in cases like <code>C = foreach B
	generate SUM($1.$0)</code> and $1 is a bag, this type of projection expects to
	receive a bag ($1) and output a modified bag ($1 with only the first field,
	$0, remaining in all the tuples in the bag). To handle this issue, POProject
	will, when it sees that its predecessor is a POProject and its successors is
	an expression operator it will perform a projection on the bag (that is,
	perform the specified project on each tuple in the bag) rather than on a
	tuple.

	</body>
	</html>