site/source/docs/datafu/guide/macros.html.markdown.erb - datafu - Git at Google

 ---
 title: Macros - Guide - Apache DataFu Pig
 version: 1.6.0
 section_name: Apache DataFu Pig - Guide
 license: >
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 ---

 ## Macros

 ### Finding the most recent update of a given record — the _dedup_ (de-duplication) macro

 A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.

 <br>
 <script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
 <center>Raw customers’ data, with more than one row per customer</center>
 <br>

 We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? For this we can use the _dedup_ macro, as below:

 ```pig
 REGISTER datafu-pig-<%= current_page.data.version %>.jar;

 IMPORT 'datafu/dedup.pig';

 data = LOAD 'customers.csv' AS (id: int, name: chararray, purchases: int, date_updated: chararray);

 dedup_data = dedup(data, 'id', 'date_updated');

 STORE dedup_data INTO 'dedup_out';
 ```

 Our result will be as expected — each customer only appears once, as you can see below:

 <br>
 <script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
 <center>“Deduplicated” data, with only the most recent record for each customer</center>
 <br>

 One nice thing about this macro is that you can use more than one field to dedup the data. For example, if we wanted to use both the _id_ and _name_ fields, we would change this line:

 ```pig
 dedup_data = dedup(data, 'id', 'date_updated');
 ```

 to this:

 ```pig
 dedup_data = dedup(data, '(id, name)', 'date_updated');
 ```

 ---
 <br>

 ### Comparing expected and actual results for regression tests — the _diff\_macro_

 After making changes in an application’s logic, we are often interested in the effect they have on our output. One common use case is when we refactor — we don’t expect our output to change. Another is a surgical change which should only affect a very small subset of records. For easily performing such regression tests on actual data, we use the _diff\_macro_, which is based on DataFu’s _TupleDiff_ UDF.

 Let’s look at a table which is exactly like _dedup\_out_, but with four changes.

 1.  We will remove record 1, _quentin_
 2.  We will change _date\_updated_ for record 2, _julia_
 3.  We will change _purchases_ and _date\_updated_ for record 4, _alice_
 4.  We will add a new row, record 8, _amanda_

 <br>
 <script src="https://gist.github.com/eyala/699942d65471f3c305b0dcda09944a95.js"></script>
 <br>

 We’ll run the following Pig script, using DataFu’s _diff\_macro_:

 ```pig
 REGISTER datafu-pig-<%= current_page.data.version %>.jar;

 IMPORT 'datafu/diff_macros.pig';

 data = LOAD 'dedup_out.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, date_updated: chararray);

 changed = LOAD 'dedup_out_changed.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, date_updated: chararray);

 diffs = diff_macro(data,changed,id,'');

 DUMP diffs;
 ```

 The results look like this:

 <br>
 <script src="https://gist.github.com/eyala/3d36775faf081daad37a102f25add2a4.js"></script>
 <br>

 Let’s take a moment to look at these results. They have the same general structure. Rows that start with _missing_ indicate records that were in the first relation, but aren’t in the new one. Conversely, rows that start with _added_ indicate records that are in the new relation, but not in the old one. Each of these rows is followed by the relevant tuple from the relations.

 The rows that start with _changed_ are more interesting. The word _changed_ is followed by a list of the fields which have changed values in the new table. For the row with _id_ 2, this is the _date\_updated_ field. For the row with _id_ 4, this is the _purchases_ and _date\_updated_ fields.

 Obviously, one thing we might want to ignore is the _date\_updated_ field. If the only difference in the fields is when it was last updated, we might just want to skip these records for a more concise diff. For this, we need to change the following row in our original Pig script, from this:

 ```pig
 diffs = diff_macro(data,changed,id,'');
 ```

 to become this:

 ```pig
 diffs = diff_macro(data,changed,id,'date_updated');
 ```

 If we run our changed Pig script, we’ll get the following result.

 <br>
 <script src="https://gist.github.com/eyala/d9b0d5c60ad4d8bbccc79c3527f99aca.js"></script>
 <br>

 The row for _julia_ is missing from our diff, because only _date\_updated_ has changed, but the row for _alice_ still appears, because the _purchases_ field has also changed.

 There’s one implementation detail that’s important to know — the macro uses a replicated join in order to be able to run quickly on very large tables, so the sample table needs to be able to fit in memory.
	---
	title: Macros - Guide - Apache DataFu Pig
	version: 1.6.0
	section_name: Apache DataFu Pig - Guide
	license: >
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	---

	## Macros

	### Finding the most recent update of a given record — the _dedup_ (de-duplication) macro

	A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.

	<br>
	<script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
	<center>Raw customers’ data, with more than one row per customer</center>
	<br>

	We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? For this we can use the _dedup_ macro, as below:

	```pig
	REGISTER datafu-pig-<%= current_page.data.version %>.jar;

	IMPORT 'datafu/dedup.pig';

	data = LOAD 'customers.csv' AS (id: int, name: chararray, purchases: int, date_updated: chararray);

	dedup_data = dedup(data, 'id', 'date_updated');

	STORE dedup_data INTO 'dedup_out';
	```

	Our result will be as expected — each customer only appears once, as you can see below:

	<br>
	<script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
	<center>“Deduplicated” data, with only the most recent record for each customer</center>
	<br>

	One nice thing about this macro is that you can use more than one field to dedup the data. For example, if we wanted to use both the _id_ and _name_ fields, we would change this line:

	```pig
	dedup_data = dedup(data, 'id', 'date_updated');
	```

	to this:

	```pig
	dedup_data = dedup(data, '(id, name)', 'date_updated');
	```

	---
	<br>

	### Comparing expected and actual results for regression tests — the _diff\_macro_

	After making changes in an application’s logic, we are often interested in the effect they have on our output. One common use case is when we refactor — we don’t expect our output to change. Another is a surgical change which should only affect a very small subset of records. For easily performing such regression tests on actual data, we use the _diff\_macro_, which is based on DataFu’s _TupleDiff_ UDF.

	Let’s look at a table which is exactly like _dedup\_out_, but with four changes.

	1. We will remove record 1, _quentin_
	2. We will change _date\_updated_ for record 2, _julia_
	3. We will change _purchases_ and _date\_updated_ for record 4, _alice_
	4. We will add a new row, record 8, _amanda_

	<br>
	<script src="https://gist.github.com/eyala/699942d65471f3c305b0dcda09944a95.js"></script>
	<br>

	We’ll run the following Pig script, using DataFu’s _diff\_macro_:

	```pig
	REGISTER datafu-pig-<%= current_page.data.version %>.jar;

	IMPORT 'datafu/diff_macros.pig';

	data = LOAD 'dedup_out.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, date_updated: chararray);

	changed = LOAD 'dedup_out_changed.csv' USING PigStorage(',') AS (id: int, name: chararray, purchases: int, date_updated: chararray);

	diffs = diff_macro(data,changed,id,'');

	DUMP diffs;
	```

	The results look like this:

	<br>
	<script src="https://gist.github.com/eyala/3d36775faf081daad37a102f25add2a4.js"></script>
	<br>

	Let’s take a moment to look at these results. They have the same general structure. Rows that start with _missing_ indicate records that were in the first relation, but aren’t in the new one. Conversely, rows that start with _added_ indicate records that are in the new relation, but not in the old one. Each of these rows is followed by the relevant tuple from the relations.

	The rows that start with _changed_ are more interesting. The word _changed_ is followed by a list of the fields which have changed values in the new table. For the row with _id_ 2, this is the _date\_updated_ field. For the row with _id_ 4, this is the _purchases_ and _date\_updated_ fields.

	Obviously, one thing we might want to ignore is the _date\_updated_ field. If the only difference in the fields is when it was last updated, we might just want to skip these records for a more concise diff. For this, we need to change the following row in our original Pig script, from this:

	```pig
	diffs = diff_macro(data,changed,id,'');
	```

	to become this:

	```pig
	diffs = diff_macro(data,changed,id,'date_updated');
	```

	If we run our changed Pig script, we’ll get the following result.

	<br>
	<script src="https://gist.github.com/eyala/d9b0d5c60ad4d8bbccc79c3527f99aca.js"></script>
	<br>

	The row for _julia_ is missing from our diff, because only _date\_updated_ has changed, but the row for _alice_ still appears, because the _purchases_ field has also changed.

	There’s one implementation detail that’s important to know — the macro uses a replicated join in order to be able to run quickly on very large tables, so the sample table needs to be able to fit in memory.