blob: 5662eb8abee7982d743ab3f640cc214d58ff3d35 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<document>
<header>
<title>Built In Functions</title>
</header>
<body>
<section id="built-in-functions">
<title>Introduction</title>
<p>
Pig comes with a set of built in functions (the eval, load/store, math, string, bag and tuple functions). Two main properties differentiate built in functions from <a href="udf.html">user defined functions</a> (UDFs). First, built in functions don't need to be registered because Pig knows where they are. Second, built in functions don't need to be qualified when they are used because Pig knows where to find them.
</p>
</section>
<!-- ================================================================== -->
<!-- DYNAMIC INVOKERS -->
<section id="dynamic-invokers">
<title>Dynamic Invokers</title>
<p>Often you may need to use a simple function that is already provided by standard Java libraries, but for which a <a href="udf.html">user defined functions</a> (UDF) has not been written. Dynamic invokers allow you to refer to Java functions without having to wrap them in custom UDFs, at the cost of doing some Java reflection on every function call.
</p>
<source>
...
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
...
</source>
<p>Currently, dynamic invokers can be used for any static function that: </p>
<ul>
<li>Accepts no arguments or accepts some combination of strings, ints, longs, doubles, floats, or arrays with these same types </li>
<li>Returns a string, an int, a long, a double, or a float</li>
</ul>
<p>Only primitives can be used for numbers; no capital-letter numeric classes can be used as arguments. Depending on the return type, a specific kind of invoker must be used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or InvokeForFloat. </p>
<p>The <a href="basic.html#define">DEFINE</a> statement is used to bind a keyword to a Java method, as above. The first argument to the InvokeFor* constructor is the full path to the desired method. The second argument is a space-delimited ordered list of the classes of the method arguments. This can be omitted or an empty string if the method takes no arguments. Valid class names are string, long, float, double, and int. Invokers can also work with array arguments, represented in Pig as DataBags of single-tuple elements. Simply refer to string[], for example. Class names are not case sensitive. </p>
<p>The ability to use invokers on methods that take array arguments makes methods like those in org.apache.commons.math.stat.StatUtils available (for processing the results of grouping your datasets, for example). This is helpful, but a word of caution: the resulting UDF will not be optimized for Hadoop, and the very significant benefits one gains from implementing the Algebraic and Accumulator interfaces are lost here. Be careful if you use invokers this way.</p>
</section>
<!-- ======================================================== -->
<!-- EVAL FUNCTIONS -->
<section id="eval-functions">
<title>Eval Functions</title>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="avg">
<title>AVG</title>
<p>Computes the average of the numeric values in a single-column bag. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>AVG(expression)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>Any expression whose result is a bag. The elements of the bag should be data type int, long, float, double, bigdecimal, biginteger or bytearray.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Use the AVG function to compute the average of the numeric values in a single-column bag.
AVG requires a preceding GROUP ALL statement for global averages and a GROUP BY statement for group averages.</p>
<p>The AVG function ignores NULL values. </p>
</section>
<section>
<title>Example</title>
<p>In this example the average GPA for each student is computed (see the <a href="basic.html#group">GROUP</a> operator for information about the field names in relation B).</p>
<source>
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
</source>
</section>
<section>
<title>Types Tables</title>
<table>
<tr>
<td>
<p></p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>bytearray </p>
</td>
</tr>
<tr>
<td>
<p>AVG </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal *</p>
</td>
<td>
<p>bigdecimal *</p>
</td>
<td>
<p>error </p>
</td>
<td>
<p>cast as double </p>
</td>
</tr>
</table>
<p>* Average values for datatypes bigdecimal and biginteger have precision setting <a href="http://docs.oracle.com/javase/7/docs/api/java/math/MathContext.html#DECIMAL128">java.math.MathContext.DECIMAL128</a>.</p>
</section></section>
<!-- ======================================================== -->
<section id="bagtostring">
<title>BagToString</title>
<p>Concatenate the elements of a Bag into a chararray string, placing an optional delimiter between each value.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>BagToString(vals:bag [, delimiter:chararray])</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td><p>vals</p></td>
<td><p>A bag of arbitrary values. They will each be cast to chararray if they are not already.</p></td>
</tr>
<tr>
<td><p>delimiter</p></td>
<td><p>A chararray value to place between elements of the bag; defaults to underscore <code>'_'</code>.</p></td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>BagToString creates a single string from the elements of a bag, similar to SQL's <code>GROUP_CONCAT</code> function. Keep in mind the following:</p>
<ul>
<li>Bags can be of arbitrary size, while strings in Java cannot: you will either exhaust available memory or exceed the maximum number of characters (about 2 billion). One of the worst features a production job can have is thresholding behavior: everything will seem nearly fine until the data size of your largest bag grows from nearly-too-big to just-barely-too-big.</li>
<li>Bags are disordered unless you explicitly apply a nested <code>ORDER BY</code> operation as demonstrated below. A nested <code>FOREACH</code> will preserve ordering, letting you order by one combination of fields then project out just the values you'd like to concatenate.</li>
<li>The default string conversion is applied to each element. If the bags contents are not atoms (tuple, map, etc), this may be not be what you want. Use a nested <code>FOREACH</code> to format values and then compose them with BagToString as shown below</li>
</ul>
<p>Examples:</p>
<table>
<tr><th>vals</th> <th>delimiter</th> <th>BagToString(vals, delimiter)</th> <th>Notes</th> </tr>
<tr> <td><code>{('BOS'),('NYA'),('BAL')}</code></td> <td><code></code></td> <td><code>BOS_NYA_BAL</code></td> <td>If only one argument is given, the field is delimited with underscore characters</td></tr>
<tr> <td><code>{('BOS'),('NYA'),('BAL')}</code></td> <td><code>'|'</code></td> <td><code>BOS|NYA|BAL</code></td> <td>But you can supply your own delimiter</td></tr>
<tr> <td><code>{('BOS'),('NYA'),('BAL')}</code></td> <td><code>''</code></td> <td><code>BOSNYABAL</code></td> <td>Use an explicit empty string to just smush everything together</td></tr>
<tr> <td><code>{(1),(2),(3)}</code></td> <td><code>'|'</code></td> <td><code>1|2|3</code></td> <td>Elements are type-converted for you (but see examples below)</td></tr>
</table>
</section>
<section>
<title>Examples</title>
<p>Simple delimited strings are simple:</p>
<source>
team_parks = LOAD 'team_parks' AS (team_id:chararray, park_id:chararray, years:bag{(year_id:int)});
-- BOS BOS07 {(1995),(1997),(1996),(1998),(1999)}
-- NYA NYC16 {(1995),(1999),(1998),(1997),(1996)}
-- NYA NYC17 {(1998)}
-- SDN HON01 {(1997)}
-- SDN MNT01 {(1996),(1999)}
-- SDN SAN01 {(1999),(1997),(1998),(1995),(1996)}
team_parkslist = FOREACH (GROUP team_parks BY team_id) GENERATE
group AS team_id, BagToString(team_parks.park_id, ';');
-- BOS BOS07
-- NYA NYC17;NYC16
-- SDN SAN01;MNT01;HON01
</source>
<p>The default handling of complex elements works, but probably isn't what you want.</p>
<source>
team_parkyearsugly = FOREACH (GROUP team_parks BY team_id) GENERATE
group AS team_id,
BagToString(team_parks.(park_id, years));
-- BOS BOS07_{(1995),(1997),(1996),(1998),(1999)}
-- NYA NYC17_{(1998)}_NYC16_{(1995),(1999),(1998),(1997),(1996)}
-- SDN SAN01_{(1999),(1997),(1998),(1995),(1996)}_MNT01_{(1996),(1999)}_HON01_{(1997)}
</source>
<p>Instead, assemble it in pieces. In step 2, we sort on one field but process another; it remains in the sorted order.</p>
<source>
team_park_yearslist = FOREACH team_parks {
years_o = ORDER years BY year_id;
GENERATE team_id, park_id, SIZE(years_o) AS n_years, BagToString(years_o, '/') AS yearslist;
};
team_parkyearslist = FOREACH (GROUP team_park_yearslist BY team_id) {
tpy_o = ORDER team_park_yearslist BY n_years DESC, park_id ASC;
tpy_f = FOREACH tpy_o GENERATE CONCAT(park_id, ':', yearslist);
GENERATE group AS team_id, BagToString(tpy_f, ';');
};
-- BOS BOS07:1995/1996/1997/1998/1999
-- NYA NYC16:1995/1996/1997/1998/1999;NYC17:1998
-- SDN SAN01:1995/1996/1997/1998/1999;MNT01:1996/1999;HON01:1997
</source>
</section>
</section>
<section id="bagtotuple">
<title>BagToTuple</title>
<p>Un-nests the elements of a bag into a tuple.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>BagToTuple(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data type bag.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>BagToTuple creates a tuple from the elements of a bag. It removes only
the first level of nesting; it does not recursively un-nest nested bags.
Unlike FLATTEN, BagToTuple will not generate multiple output records per
input record.
</p>
</section>
<section>
<title>Examples</title>
<p>In this example, a bag containing tuples with one field is converted to a tuple.</p>
<source>
A = LOAD 'bag_data' AS (B1:bag{T1:tuple(f1:chararray)});
DUMP A;
({('a'),('b'),('c')})
({('d'),('e'),('f')})
X = FOREACH A GENERATE BagToTuple(B1);
DUMP X;
(('a','b','c'))
(('d','e','f'))
</source>
<p>In this example, a bag containing tuples with two fields is converted to a tuple.</p>
<source>
A = LOAD 'bag_data' AS (B1:bag{T1:tuple(f1:int,f2:int)});
DUMP A;
({(4,1),(7,8),(4,9)})
({(5,8),(4,3),(3,8)})
X = FOREACH A GENERATE BagToTuple(B1);
DUMP X;
((4,1,7,8,4,9))
((5,8,4,3,3,8))
</source>
</section>
</section>
<section id="bloom">
<title>Bloom</title>
<p>Bloom filters are a common way to select a limited set of records before
moving data for a join or other heavy weight operation.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>BuildBloom(String hashType, String mode, String vectorSize, String nbHash)</p>
</td>
</tr>
<tr>
<td>
<p>Bloom(String filename)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td><p>hashtype</p></td>
<td><p>The type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'.</p></td>
</tr>
<tr>
<td><p>mode</p></td>
<td><p>Will be ignored, though by convention it should be "fixed" or "fixedsize"</p></td>
</tr>
<tr>
<td><p>vectorSize</p></td>
<td><p>The number of bits in the bloom filter.</p></td>
</tr>
<tr>
<td><p>nbHash</p></td>
<td><p>The number of hash functions used in constructing the bloom filter.</p></td>
</tr>
<tr>
<td><p>filename</p></td>
<td><p>File containing the serialized Bloom filter.</p></td>
</tr>
</table>
<p>See <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom Filter</a> for
a discussion of how to select the number of bits and the number of hash
functions.
</p>
</section>
<section>
<title>Usage</title>
<p>Bloom filters are a common way to select a limited set of records before
moving data for a join or other heavy weight operation. For example, if
one wanted to join a very large data set L with a smaller set S, and it
was known that the number of keys in L that will match with S is small,
building a bloom filter on S and then applying it to L before the join
can greatly reduce the number of records from L that have to be moved
from the map to the reduce, thus speeding the join.
</p>
<p>The implementation uses Hadoop's bloom filters
(org.apache.hadoop.util.bloom.BloomFilter) internally.
</p>
</section>
<section>
<title>Examples</title>
<source>
define bb BuildBloom('128', '3', 'jenkins');
small = load 'S' as (x, y, z);
grpd = group small all;
fltrd = foreach grpd generate bb(small.x);
store fltrd in 'mybloom';
exec;
define bloom Bloom('mybloom');
large = load 'L' as (a, b, c);
flarge = filter large by bloom(L.a);
joined = join small by x, flarge by a;
store joined into 'results';
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="concat">
<title>CONCAT</title>
<p>Concatenates two or more expressions of identical type.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>CONCAT (expression, expression, [...expression])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>Any expression.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Use the CONCAT function to concatenate two or more expressions. The result values of the expressions must have identical types.</p>
<p>If any subexpression is null, the resulting expression is null.</p>
</section>
<section>
<title>Example</title>
<p>In this example, fields f1, an underscore string literal, f2 and f3 are concatenated.</p>
<source>
A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);
DUMP A;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
X = FOREACH A GENERATE CONCAT(f1, '_', f2,f3);
DUMP X;
(apache_opensource)
(hadoop_mapreduce)
(pig_piglatin)
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="count">
<title>COUNT</title>
<p>Computes the number of elements in a bag. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>COUNT(expression) </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data type bag.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use the COUNT function to compute the number of elements in a bag.
COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.</p>
<p>
The COUNT function follows syntax semantics and ignores nulls.
What this means is that a tuple in the bag will not be counted if the FIRST FIELD in this tuple is NULL.
If you want to include NULL values in the count computation, use
<a href="#count-star">COUNT_STAR</a>.
</p>
<p>
Note: You cannot use the tuple designator (*) with COUNT; that is, COUNT(*) will not work.
</p>
</section>
<section>
<title>Example</title>
<p>In this example the tuples in the bag are counted (see the <a href="basic.html#group">GROUP</a> operator for information about the field names in relation B).</p>
<source>
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = GROUP A BY f1;
DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
X = FOREACH B GENERATE COUNT(A);
DUMP X;
(1L)
(2L)
(1L)
(2L)
</source>
</section>
<section>
<title>Types Tables</title>
<table>
<tr>
<td>
<p></p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>bytearray </p>
</td>
</tr>
<tr>
<td>
<p>COUNT </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>long </p>
</td>
</tr>
</table>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="count-star">
<title>COUNT_STAR</title>
<p>Computes the number of elements in a bag. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>COUNT_STAR(expression)  </p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data type bag.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Use the COUNT_STAR function to compute the number of elements in a bag.
COUNT_STAR requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.</p>
<p>COUNT_STAR includes NULL values in the count computation
(unlike <a href="#count">COUNT</a>, which ignores NULL values).
</p>
</section>
<section>
<title>Example</title>
<p>In this example COUNT_STAR is used to count the tuples in a bag.</p>
<source>
X = FOREACH B GENERATE COUNT_STAR(A);
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="diff">
<title>DIFF</title>
<p>Compares two fields in a tuple.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>DIFF (expression, expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with any data type.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>The DIFF function takes two bags as arguments and compares them.
Any tuples that are in one bag but not the other are returned in a bag.
If the bags match, an empty bag is returned. If the fields are not bags
then they will be wrapped in tuples and returned in a bag if they do not match,
or an empty bag will be returned if the two records match. The implementation
assumes that both bags being passed to the DIFF function will fit entirely
into memory simultaneously. If this is not the case the UDF will still function
but it will be VERY slow.</p>
</section>
<section>
<title>Example</title>
<p>In this example DIFF compares the tuples in two bags.</p>
<source>
A = LOAD 'bag_data' AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});
DUMP A;
({(8,9),(0,1)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7)},{(2,2),(3,7)})
DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
X = FOREACH A GENERATE DIFF(B1,B2);
grunt> dump x;
({(0,1),(1,1)})
({})
({(6,7),(2,2)})
</source>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="isempty">
<title>IsEmpty</title>
<p>Checks if a bag or map is empty.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>IsEmpty(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with any data type.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>The IsEmpty function checks if a bag or map is empty (has no data). The function can be used to filter data.</p></section>
<section>
<title>Example</title>
<p>In this example all students with an SSN but no name are located.</p>
<source>
SSN = load 'ssn.txt' using PigStorage() as (ssn:long);
SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray);
/* do a cogroup of SSN with SSN_Name */
X = COGROUP SSN by ssn, SSN_NAME by ssn;
/* only keep those ssn's for which there is no name */
Y = filter X by IsEmpty(SSN_NAME);
</source>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="max">
<title>MAX</title>
<p>Computes the maximum of the numeric values or chararrays in a single-column bag. MAX requires a preceding GROUP ALL statement for global maximums and a GROUP BY statement for group maximums.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>MAX(expression)        </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data types int, long, float, double, bigdecimal, biginteger, chararray, datetime or bytearray.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use the MAX function to compute the maximum of the numeric values or chararrays in a single-column bag.</p>
<p>The MAX function ignores NULL values.</p>
</section>
<section>
<title>Example</title>
<p>In this example the maximum GPA for all terms is computed for each student (see the GROUP operator for information about the field names in relation B).</p>
<source>
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MAX(A.gpa);
DUMP X;
(John,4.0F)
(Mary,4.0F)
</source>
</section>
<section>
<title>Types Tables</title>
<table>
<tr>
<td>
<p></p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>datetime </p>
</td>
<td>
<p>bytearray </p>
</td>
</tr>
<tr>
<td>
<p>MAX </p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>datetime </p>
</td>
<td>
<p>cast as double</p>
</td>
</tr>
</table>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="min">
<title>MIN</title>
<p>Computes the minimum of the numeric values or chararrays in a single-column bag. MIN requires a preceding GROUP… ALL statement for global minimums and a GROUP … BY statement for group minimums.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>MIN(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data types int, long, float, double, bigdecimal, biginteger, chararray, datetime or bytearray.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use the MIN function to compute the minimum of a set of numeric values or chararrays in a single-column bag.</p>
<p>The MIN function ignores NULL values.</p>
</section>
<section>
<title>Example</title>
<p>In this example the minimum GPA for all terms is computed for each student (see the GROUP operator for information about the field names in relation B).</p>
<source>
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MIN(A.gpa);
DUMP X;
(John,3.7F)
(Mary,3.8F)
</source>
</section>
<section>
<title>Types Tables</title>
<table>
<tr>
<td>
<p></p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>datetime </p>
</td>
<td>
<p>bytearray </p>
</td>
</tr>
<tr>
<td>
<p>MIN </p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>datetime </p>
</td>
<td>
<p>cast as double</p>
</td>
</tr>
</table>
</section></section>
<section id="plucktuple">
<title>PluckTuple</title>
<p>Allows the user to specify a string prefix, and then filter for the columns in a relation that begin with that prefix or match that regex pattern. Optionally, include flag 'false' to filter
for columns that do not match that prefix or match that regex pattern</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>DEFINE pluck PluckTuple(expression1)</p>
<p>DEFINE pluck PluckTuple(expression1,expression3)</p>
<p>pluck(expression2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression1</p>
</td>
<td>
<p>A prefix to pluck by or an regex pattern to pluck by</p>
</td>
</tr>
<tr>
<td>
<p>expression2</p>
</td>
<td>
<p>The fields to apply the pluck to, usually '*'</p>
</td>
</tr>
<tr>
<td>
<p>expression3</p>
</td>
<td>
<p>A boolean flag to indicate whether to include or exclude matching columns</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Example:</p>
<source>
a = load 'a' as (x, y);
b = load 'b' as (x, y);
c = join a by x, b by x;
DEFINE pluck PluckTuple('a::');
d = foreach c generate FLATTEN(pluck(*));
describe c;
c: {a::x: bytearray,a::y: bytearray,b::x: bytearray,b::y: bytearray}
describe d;
d: {plucked::a::x: bytearray,plucked::a::y: bytearray}
DEFINE pluckNegative PluckTuple('a::','false');
d = foreach c generate FLATTEN(pluckNegative(*));
describe d;
d: {plucked::b::x: bytearray,plucked::b::y: bytearray}
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="size">
<title>SIZE</title>
<p>Computes the number of elements based on any Pig data type. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SIZE(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with any data type.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use the SIZE function to compute the number of elements based on the data type (see the Types Tables below).
SIZE includes NULL values in the size computation. SIZE is not algebraic.</p>
<p>If the tested object is null, the SIZE function returns null.</p>
</section>
<section>
<title>Example</title>
<p>In this example the number of characters in the first field is computed.</p>
<source>
A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
X = FOREACH A GENERATE SIZE(f1);
DUMP X;
(6L)
(6L)
(3L)
</source>
</section>
<section>
<title>Types Tables</title>
<table>
<tr>
<td>
<p>int </p>
</td>
<td>
<p>returns 1 </p>
</td>
</tr>
<tr>
<td>
<p>long </p>
</td>
<td>
<p>returns 1 </p>
</td>
</tr>
<tr>
<td>
<p>float </p>
</td>
<td>
<p>returns 1 </p>
</td>
</tr>
<tr>
<td>
<p>double </p>
</td>
<td>
<p>returns 1 </p>
</td>
</tr>
<tr>
<td>
<p>chararray </p>
</td>
<td>
<p>returns number of characters in the array </p>
</td>
</tr>
<tr>
<td>
<p>bytearray </p>
</td>
<td>
<p>returns number of bytes in the array </p>
</td>
</tr>
<tr>
<td>
<p>tuple </p>
</td>
<td>
<p>returns number of fields in the tuple</p>
</td>
</tr>
<tr>
<td>
<p>bag </p>
</td>
<td>
<p>returns number of tuples in bag </p>
</td>
</tr>
<tr>
<td>
<p>map </p>
</td>
<td>
<p>returns number of key/value pairs in map </p>
</td>
</tr>
</table></section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="subtract">
<title>SUBTRACT</title>
<p>Bags subtraction, SUBTRACT(bag1, bag2) = bags composed of bag1 elements not in bag2</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SUBTRACT(expression, expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data type bag.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>SUBTRACT takes two bags as arguments and returns a new bag composed of the tuples of first bag are not in the second bag.</p>
<p>If null, bag arguments are replaced by empty bags.<br></br>If arguments are not bags, an IOException is thrown.</p>
<p>The implementation assumes that both bags being passed to the SUBTRACT function will fit <strong>entirely
into memory</strong> simultaneously, if this is not the case, SUBTRACT will still function but will be <strong>very</strong> slow.</p>
</section>
<section>
<title>Example</title>
<p>In this example, SUBTRACT creates a new bag composed of B1 elements that are not in B2.</p>
<source>
A = LOAD 'bag_data' AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});
DUMP A;
({(8,9),(0,1),(1,2)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7),(3,7)},{(2,2),(3,7)})
DESCRIBE A;
A: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
X = FOREACH A GENERATE SUBTRACT(B1,B2);
DUMP X;
({(0,1),(1,2)})
({})
({(6,7)})
</source>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="sum">
<title>SUM</title>
<p>Computes the sum of the numeric values in a single-column bag. SUM requires a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SUM(expression)        </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data types int, long, float, double, bigdecimal, biginteger or bytearray cast as double.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use the SUM function to compute the sum of a set of numeric values in a single-column bag.</p>
<p>The SUM function ignores NULL values.</p>
</section>
<section>
<title>Example</title>
<p>In this example the number of pets is computed. (see the GROUP operator for information about the field names in relation B).</p>
<source>
A = LOAD 'data' AS (owner:chararray, pet_type:chararray, pet_num:int);
DUMP A;
(Alice,turtle,1)
(Alice,goldfish,5)
(Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
B = GROUP A BY owner;
DUMP B;
(Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat,2)})
(Bob,{(Bob,dog,2),(Bob,cat,2)})
X = FOREACH B GENERATE group, SUM(A.pet_num);
DUMP X;
(Alice,8L)
(Bob,4L)
</source>
</section>
<section>
<title>Types Tables</title>
<table>
<tr>
<td>
<p></p>
</td>
<td>
<p>int </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>float </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>chararray </p>
</td>
<td>
<p>bytearray </p>
</td>
</tr>
<tr>
<td>
<p>SUM </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>long </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>double </p>
</td>
<td>
<p>bigdecimal </p>
</td>
<td>
<p>biginteger </p>
</td>
<td>
<p>error </p>
</td>
<td>
<p>cast as double </p>
</td>
</tr>
</table>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="in">
<title>IN</title>
<p>IN operator allows you to easily test if an expression matches any value in a list of values. It is used to reduce the need for multiple OR conditions.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>IN (expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data types chararray, int, long, float, double, bigdecimal, biginteger or bytearray.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>IN operator allows you to easily test if an expression matches any value in a list of values. It is used to help reduce the need for multiple OR conditions.</p>
</section>
<section>
<title>Example</title>
<p>In this example we filter out ID 4 and 6.</p>
<source>
A = load 'data' using PigStorage(',') AS (id:int, first:chararray, last:chararray, gender:chararray);
DUMP A;
(1,Christine,Romero,Female)
(2,Sara,Hansen,Female)
(3,Albert,Rogers,Male)
(4,Kimberly,Morrison,Female)
(5,Eugene,Baker,Male)
(6,Ann,Alexander,Female)
(7,Kathleen,Reed,Female)
(8,Todd,Scott,Male)
(9,Sharon,Mccoy,Female)
(10,Evelyn,Rice,Female)
X = FILTER A BY id IN (4, 6);
DUMP X;
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
</source>
</section>
<p>In this example, we're passing a BigInteger and using NOT operator, thereby negating the passed list of fields in the IN clause</p>
<source>
A = load 'data' using PigStorage(',') AS (id:biginteger, first:chararray, last:chararray, gender:chararray);
X = FILTER A BY NOT id IN (1, 3, 5, 7, 9);
DUMP X;
(2,Sara,Hansen,Female)
(4,Kimberly,Morrison,Female)
(6,Ann,Alexander,Female)
(8,Todd,Scott,Male)
(10,Evelyn,Rice,Female)
</source>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="tokenize">
<title>TOKENIZE</title>
<p>Splits a string and outputs a bag of words. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TOKENIZE(expression [, 'field_delimiter'])        </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with data type chararray.</p>
</td>
</tr>
<tr>
<td>
<p>'field_delimiter'</p>
</td>
<td>
<p>An optional field delimiter (in single quotes).</p>
<p>If field_delimiter is null or not passed, the following will be used as delimiters: space [ ], double quote [ " ], coma [ , ] parenthesis [ () ], star [ * ].</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). </p>
</section>
<section>
<title>Example</title>
<p>In this example the strings in each row are split.</p>
<source>
A = LOAD 'data' AS (f1:chararray);
DUMP A;
(Here is the first string.)
(Here is the second string.)
(Here is the third string.)
X = FOREACH A GENERATE TOKENIZE(f1);
DUMP X;
({(Here),(is),(the),(first),(string.)})
({(Here),(is),(the),(second),(string.)})
({(Here),(is),(the),(third),(string.)})
</source>
<p>In this example a field delimiter is specified.</p>
<source>
{code}
A = LOAD 'data' AS (f1:chararray);
B = FOREACH A GENERATE TOKENIZE (f1,'||');
DUMP B;
{code}
</source>
</section></section></section>
<!-- ======================================================================== -->
<section id="load-store-functions">
<title>Load/Store Functions</title>
<p>Load/store functions determine how data goes into Pig and comes out of Pig.
Pig provides a set of built-in load/store functions, described in the sections below.
You can also write your own load/store functions (see <a href="udf.html">User Defined Functions</a>).</p>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="handling-compression">
<title>Handling Compression</title>
<p>Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.</p>
<p>To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.</p>
<source>
A = load 'myinput.gz';
store A into 'myoutput.gz';
</source>
<p>To work with bzip compressed files, the input/output files need to have a .bz or .bz2 extension. Because the compression is block-oriented, bzipped files can be split across multiple maps.</p>
<source>
A = load 'myinput.bz';
store A into 'myoutput.bz';
</source>
<p>Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED bz/bz2 FILES generated in this manner: </p>
<ul>
<li>
<p>cat *.bz > text/concat.bz </p>
</li>
<li>
<p>cat *.bz2 > text/concat.bz2</p>
</li>
</ul>
<p></p>
<p>If you use concatenated bzip files with your Pig jobs, you will NOT see a failure but the results will be INCORRECT.</p>
<p></p>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="binstorage">
<title>BinStorage</title>
<p>Loads and stores data in machine-readable format.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>BinStorage()        </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>none</p>
</td>
<td>
<p>no parameters</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Pig uses BinStorage to load and store the temporary data that is generated between multiple MapReduce jobs.</p>
<ul>
<li>BinStorage works with data that is represented on disk in machine-readable format.
BinStorage does NOT support <a href="#handling-compression">compression</a>.</li>
<li>BinStorage supports multiple locations (files, directories, globs) as input.</li>
</ul>
<p></p>
<p>Occasionally, users use BinStorage to store their own data. However, because BinStorage is a proprietary binary format, the original data is never in BinStorage - it is always a derivation of some other data.</p>
<p>We have seen several examples of users doing something like this:</p>
<source>
a = load 'b.txt' as (id, f);
b = group a by id;
store b into 'g' using BinStorage();
</source>
<p>And then later:</p>
<source>
a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;
</source>
<p>There is a problem with this sequence of events. The first script does not define data types and, as the result, the data is stored as a bytearray and a bag with a tuple that contains two bytearrays. The second script attempts to cast the bytearray to double; however, since the data originated from a different loader, it has no way to know the format of the bytearray or how to cast it to a different type. To solve this problem, Pig:</p>
<ul>
<li>Sends an error message when the second script is executed: "ERROR 1118: Cannot cast bytes loaded from BinStorage. Please provide a custom converter."</li>
<li id="custom-converter">Allows you to use a custom converter to perform the casting. <br></br>
<source>
a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;
</source>
</li>
</ul>
</section>
<section>
<title>Examples</title>
<p>In this example BinStorage is used with the LOAD and STORE functions.</p>
<source>
A = LOAD 'data' USING BinStorage();
STORE X into 'output' USING BinStorage();
</source>
<p>In this example BinStorage is used to load multiple locations.</p>
<source>
A = LOAD 'input1.bin, input2.bin' USING BinStorage();
</source>
<p>BinStorage does not track data lineage. When Pig uses BinStorage to move data between MapReduce jobs, Pig can figure out the correct cast function to use and apply it. However, as shown in the example below, when you store data using BinStorage and then use a separate Pig Latin script to read data (thus loosing the type information), it is your responsibility to correctly cast the data before storing it using BinStorage.
</p>
<source>
raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
--filter out null columns
A = filter raw by col1#'bcookie' is not null;
B = foreach A generate col1#'bcookie' as reqcolumn;
describe B;
--B: {regcolumn: bytearray}
X = limit B 5;
dump X;
(36co9b55onr8s)
(36co9b55onr8s)
(36hilul5oo1q1)
(36hilul5oo1q1)
(36l4cj15ooa8a)
B = foreach A generate (chararray)col1#'bcookie' as convertedcol;
describe B;
--B: {convertedcol: chararray}
X = limit B 5;
dump X;
()
()
()
()
()
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="jsonloadstore">
<title>JsonLoader, JsonStorage</title>
<p>Load or store JSON data.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>JsonLoader( ['schema'] ) </p>
</td>
</tr>
<tr>
<td>
<p>JsonStorage( ) </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>schema</p>
</td>
<td>
<p>An optional Pig schema, in single quotes.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>Use JsonLoader to load JSON data. </p>
<p>Use JsonStorage to store JSON data.</p>
<p>Note that there is no concept of delimit in JsonLoader or JsonStorage. The data is encoded in standard JSON format. JsonLoader optionally takes a schema as the construct argument.</p>
</section>
<section>
<title>Examples</title>
<p>In this example data is loaded with a schema. </p>
<source>
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');
</source>
<p>In this example data is loaded without a schema; it assumes there is a .pig_schema (produced by JsonStorage) in the input directory. </p>
<source>
a = load 'a.json' using JsonLoader();
</source>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="pigdump">
<title>PigDump</title>
<p>Stores data in UTF-8 format.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>PigDump()        </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>none</p>
</td>
<td>
<p>no parameters</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>PigDump stores data as tuples in human-readable UTF-8 format. </p></section>
<section>
<title>Example</title>
<p>In this example PigDump is used with the STORE function.</p>
<source>
STORE X INTO 'output' USING PigDump();
</source>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="pigstorage">
<title>PigStorage</title>
<p>Loads and stores data as structured text files.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>PigStorage( [field_delimiter] , ['options'] ) </p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p id="field-delimiter">field_delimiter</p>
</td>
<td>
<p>The default field delimiter is tab ('\t'). </p>
<p>You can specify other characters as field delimiters; however, be sure to encase the characters in single quotes.</p>
</td>
</tr>
<tr>
<td>
<p id="pigstorage-options">'options'</p>
</td>
<td>
<p>A string that contains space-separated options ('optionA optionB optionC')</p>
<p>Currently supported options are:</p>
<ul>
<li>('schema') - Stores the schema of the relation using a hidden JSON file.</li>
<li>('noschema') - Ignores a stored schema during the load.</li>
<li>('tagsource') - (deprecated, Use tagPath instead) Add a first column indicates the input file of the record.</li>
<li>('tagPath') - Add a first column indicates the input path of the record.</li>
<li>('tagFile') - Add a first column indicates the input file name of the record.</li>
</ul>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>PigStorage is the default function used by Pig to load/store the data. PigStorage supports structured text files (in human-readable UTF-8 format) in compressed or uncompressed form (see <a href="#handling-compression">Handling Compression</a>). All Pig <a href="basic.html#data-types">data types</a> (both simple and complex) can be read/written using this function. The input data to the load can be a file, a directory or a glob.</p>
<p><strong>Load/Store Statements</strong></p>
<p>Load statements – PigStorage expects data to be formatted using field delimiters, either the tab character ('\t') or other specified character.</p>
<p>Store statements – PigStorage outputs data using field delimiters, either the tab character ('\t') or other specified character, and the line feed record delimiter ('\n'). </p>
<p><strong>Field/Record Delimiters</strong></p>
<p>Field Delimiters – For load and store statements the default field delimiter is the tab character ('\t'). You can use other characters as field delimiters, but separators such as ^A or Ctrl-A should be represented in Unicode (\u0001) using UTF-16 encoding (see Wikipedia <a href="http://en.wikipedia.org/wiki/ASCII">ASCII</a>, <a href="http://en.wikipedia.org/wiki/Unicode">Unicode</a>, and <a href="http://en.wikipedia.org/wiki/UTF-16">UTF-16</a>).</p>
<p>Record Deliminters – For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter.</p>
<p><strong>Schemas</strong></p>
<p>If the schema option is specified, a hidden ".pig_schema" file is created in the output directory when storing data. It is used by PigStorage (with or without -schema) during loading to determine the field names and types of the data without the need for a user to explicitly provide the schema in an as clause, unless <code>noschema</code> is specified. No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used. </p>
<p>Additionally, if the schema option is specified, a ".pig_headers" file is created in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data). </p>
<p>If the schema option is NOT specified, a schema will not be written when storing data.</p>
<p>If the noschema option is NOT specified, and a schema is found, it gets loaded when loading data.</p>
<p>Note that regardless of whether or not you store the schema, you always need to specify the correct delimiter to read your data. If you store using delimiter "#" and then load using the default delimiter, your data will not be parsed correctly.</p>
<p><strong>Record Provenance</strong></p>
<p>If tagPath or tagFile option is specified, PigStorage will add a pseudo-column INPUT_FILE_PATH or INPUT_FILE_NAME respectively to the beginning of the record. As the name suggests, it is the input file path/name containing this particular record. Please note tagsource is deprecated.</p>
<p><strong>Complex Data Types</strong></p>
<p>The formats for complex data types are shown here:</p>
<ul>
<li><a href="basic.html#tuple">Tuple</a>: enclosed by (), items separated by ","
<ul>
<li>Non-empty tuple: (item1,item2,item3)</li>
<li>Empty tuple is valid: ()</li>
</ul>
</li>
<li><a href="basic.html#bag">Bag</a>: enclosed by {}, tuples separated by ","
<ul>
<li>Non-empty bag: {code}{(tuple1),(tuple2),(tuple3)}{code}</li>
<li>Empty bag is valid: {}</li>
</ul>
</li>
<li><a href="basic.html#map">Map</a>: enclosed by [], items separated by ",", key and value separated by "#"
<ul>
<li>Non-empty map: [key1#value1,key2#value2]</li>
<li>Empty map is valid: []</li>
</ul>
</li>
</ul>
<p>If load statement specify a schema, Pig will convert the complex type according to schema. If conversion fails, the affected item will be null (see <a href="basic.html#nulls">Nulls and Pig Latin</a>). </p>
</section>
<section>
<title>Examples</title>
<p>In this example PigStorage expects input.txt to contain tab-separated fields and newline-separated records. The statements are equivalent.</p>
<source>
A = LOAD 'student' USING PigStorage('\t') AS (name: chararray, age:int, gpa: float);
A = LOAD 'student' AS (name: chararray, age:int, gpa: float);
</source>
<p>In this example PigStorage stores the contents of X into files with fields that are delimited with an asterisk ( * ). The STORE statement specifies that the files will be located in a directory named output and that the files will be named part-nnnnn (for example, part-00000).</p>
<source>
STORE X INTO 'output' USING PigStorage('*');
</source>
<p>In this example, PigStorage loads data with complex data type, a bag of map and double.</p>
<source>
a = load '1.txt' as (a0:{t:(m:map[int],d:double)});
{([foo#1,bar#2],34.0),([white#3,yellow#4],45.0)} : valid
{([foo#badint],baddouble)} : conversion fail for badint/baddouble, get {([foo#],)}
{} : valid, empty bag
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="textloader">
<title>TextLoader</title>
<p>Loads unstructured data in UTF-8 format.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TextLoader()</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>none</p>
</td>
<td>
<p>no parameters</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>TextLoader works with unstructured data in UTF8 format. Each resulting tuple contains a single field with one line of input text. TextLoader also supports <a href="#handling-compression">compression</a>.</p>
<p>Currently, TextLoader support for compression is limited.</p>
<p>TextLoader cannot be used to store data.</p>
</section>
<section>
<title>Example</title>
<p>In this example TextLoader is used with the LOAD function.</p>
<source>
A = LOAD 'data' USING TextLoader();
</source>
</section></section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="HBaseStorage">
<title>HBaseStorage</title>
<p>Loads and stores data from an HBase table.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>HBaseStorage('columns', ['options'])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>columns</p>
</td>
<td>
<p>A list of qualified HBase columns to read data from or store data to.
The column family name and column qualifier are seperated by a colon (:).
Only the columns used in the Pig script need to be specified. Columns are specified
in one of three different ways as described below.</p>
<ul>
<li>Explicitly specify a column family and column qualifier (e.g., user_info:id). This
will produce a scalar in the resultant tuple.</li>
<li>Specify a column family and a portion of column qualifier name as a prefix followed
by an asterisk (i.e., user_info:address_*). This approach is used to read one or
more columns from the same column family with a matching descriptor prefix.
The datatype for this field will be a map of column descriptor name to field value.
Note that combining this style of prefix with a long list of fully qualified
column descriptor names could cause perfomance degredation on the HBase scan.
This will produce a Pig map in the resultant tuple with column descriptors as keys.</li>
<li>Specify all the columns of a column family using the column family name followed
by an asterisk (i.e., user_info:*). This will produce a Pig map in the resultant
tuple with column descriptors as keys.</li>
</ul>
</td>
</tr>
<tr>
<td>
<p>'options'</p>
</td>
<td>
<p>A string that contains space-separated options (&lsquo;-optionA=valueA -optionB=valueB -optionC=valueC&rsquo;)</p>
<p>Currently supported options are:</p>
<ul>
<li>-loadKey=(true|false) Load the row key as the first value in every tuple
returned from HBase (default=false)</li>
<li>-gt=minKeyVal Return rows with a rowKey greater than minKeyVal</li>
<li>-lt=maxKeyVal Return rows with a rowKey less than maxKeyVal</li>
<li>-regex=regex Return rows with a rowKey that match this regex on KeyVal</li>
<li>-gte=minKeyVal Return rows with a rowKey greater than or equal to minKeyVal</li>
<li>-lte=maxKeyVal Return rows with a rowKey less than or equal to maxKeyVal</li>
<li>-limit=numRowsPerRegion Max number of rows to retrieve per region</li>
<li>-caching=numRows Number of rows to cache (faster scans, more memory)</li>
<li>-delim=delimiter Column delimiter in columns list (default is whitespace)</li>
<li>-ignoreWhitespace=(true|false) When delim is set to something other than
whitespace, ignore spaces when parsing column list (default=true)</li>
<li>-caster=(HBaseBinaryConverter|Utf8StorageConverter) Class name of Caster to use
to convert values (default=Utf8StorageConverter). The default caster can be
overridden with the pig.hbase.caster config param. Casters must implement LoadStoreCaster.</li>
<li>-noWAL=(true|false) During storage, sets the write ahead to false for faster
loading into HBase (default=false). To be used with extreme caution since this
could result in data loss (see <a href="http://hbase.apache.org/book.html#perf.hbase.client.putwal">http://hbase.apache.org/book.html#perf.hbase.client.putwal</a>).</li>
<li>-minTimestamp=timestamp Return cell values that have a creation timestamp
greater or equal to this value</li>
<li>-maxTimestamp=timestamp Return cell values that have a creation timestamp
less than this value</li>
<li>-timestamp=timestamp Return cell values that have a creation timestamp equal to
this value</li>
<li>-includeTimestamp=Record will include the timestamp after the rowkey on store (rowkey, timestamp, ...)</li>
<li>-includeTombstone=Record will include a tombstone marker on store after the rowKey and timestamp (if included) (rowkey, [timestamp,] tombstone, ...)</li>
</ul>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>HBaseStorage stores and loads data from HBase. The function takes two arguments. The first
argument is a space seperated list of columns. The second optional argument is a
space seperated list of options. Column syntax and available options are listed above.
Note that HBaseStorage always disable split combination.</p>
</section>
<section>
<title>Load Example</title>
<p>In this example HBaseStorage is used with the LOAD function with an explicit schema.</p>
<source>
raw = LOAD 'hbase://SomeTableName'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name tags:work_* info:*', '-loadKey=true -limit=5') AS
(id:bytearray, first_name:chararray, last_name:chararray, tags_map:map[], info_map:map[]);
</source>
<p>The datatypes of the columns are declared with the "AS" clause. The first_name and last_name
columns are specified as fully qualified column names with a chararray datatype. The third
specification of tags:work_* requests a set of columns in the tags column family that begin
with "work_". There can be zero, one or more columns of that type in the HBase table. The
type is specified as tags_map:map[]. This indicates that the set of column values returned
will be accessed as a map, where the key is the column name and the value is the cell value
of the column. The fourth column specification is also a map of column descriptors to cell
values.</p>
<p>When the type of the column is specified as a map in the "AS" clause, the map keys are the
column descriptor names and the data type is chararray. The datatype of the columns values can
be declared explicitly as shown in the examples below:</p>
<ul>
<li>tags_map[chararray] - In this case, the column values are all declared to be of type chararray</li>
<li>tags_map[int] - In this case, the column values are all declared to be of type int.</li>
</ul>
</section>
<section>
<title>Store Example</title>
<p>In this example HBaseStorage is used to store a relation into HBase.</p>
<source>
A = LOAD 'hdfs_users' AS (id:bytearray, first_name:chararray, last_name:chararray);
STORE A INTO 'hbase://users_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:first_name info:last_name');
</source>
<p>In the example above relation A is loaded from HDFS and stored in HBase. Note that the schema
of relation A is a tuple of size 3, but only two column descriptor names are passed to the
HBaseStorage constructor. This is because the first entry in the tuple is used as the HBase
rowKey.</p>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="AvroStorage">
<title>AvroStorage</title>
<p>Loads and stores data from Avro files.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>AvroStorage(['schema|record name'], ['options'])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>schema</p>
</td>
<td>
<p>A JSON string specifying the Avro schema for the input. You may specify an explicit schema
when storing data or when loading data. When you manually provide a schema, Pig
will use the provided schema for serialization and deserialization. This means that
you can provide an explicit schema when saving data to simplify the output (for example
by removing nullable unions), or rename fields. This also means that you can provide
an explicit schema when reading data to only read a subset of the fields in each record.</p>
<p>See
<a href="http://avro.apache.org/docs/current/spec.html"> the Apache Avro Documentation</a>
for more details on how to specify a valid schema.</p>
</td>
</tr>
<tr>
<td>
<p>record name</p>
</td>
<td>
<p>When storing a bag of tuples with AvroStorage, if you do not want to specify
the full schema, you may specify the avro record name instead. (AvroStorage will
determine that the argument isn't a valid schema definition and use it as a
variable name instead.)</p>
</td>
</tr>
<tr>
<td>
<p>'options'</p>
</td>
<td>
<p>A string that contains space-separated options (&lsquo;-optionA valueA -optionB valueB -optionC &rsquo;)</p>
<p>Currently supported options are:</p>
<ul>
<li>-namespace nameSpace or -n nameSpace Explicitly specify the namespace
field in Avro records when storing data</li>
<li>-schemfile schemaFile or -f schemaFile Specify the input (or output) schema from
an external file. Pig assumes that the file is located on the default filesystem,
but you may use an explicity URL to unambigously specify the location. (For example, if
the data was on your local file system in /stuff/schemafile.avsc, you
could specify "-f file:///stuff/schemafile.avsc" to specify the location. If the
data was on HDFS under /yourdirectory/schemafile.avsc, you could specify
"-f hdfs:///yourdirectory/schemafile.avsc"). Pig expects this to be a
text file, containing a valid avro schema.</li>
<li>-examplefile exampleFile or -e exampleFile Specify the input (or output)
schema using another Avro file as an example. Pig assumes that the file is located on the default filesystem,
but you may use and explicity URL to specify the location. Pig
expects this to be an Avro data file.</li>
<li>-allowrecursive or -r Specify whether to allow recursive schema definitions (the
default is to throw an exception if Pig encounters a recursive schema). When
reading objects with recursive definitions, Pig will translate Avro records to
schema-less tuples; the Pig Schema for the object may not match the data exactly.</li>
<li>-doublecolons or -d Specify how to handle Pig schemas that contain double-colons
when writing data in Avro format. (When you join two bags in Pig, Pig will automatically
label the fields in the output Tuples with names that contain double-colons). If
you select this option, AvroStorage will translate names with double colons into
names with double underscores. </li>
</ul>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>AvroStorage stores and loads data from Avro files. Often, you can load and
store data using AvroStorage without knowing much about the Avros serialization format.
AvroStorage will attempt to automatically translate a pig schema and pig data to avro data,
or avro data to pig data.</p>
<p>By default, when you use AvoStorage to load data, AvroStorage will use depth first search to
find a valid Avro file on the input path, then use the schema from that file to load the
data. When you use AvroStorage to store data, AvroStorage will attempt to translate the
Pig schema to an equivalent Avro schema. You can manually specify the schema by providing
an explicit schema in Pig, loading a schema from an external schema file, or explicitly telling
Pig to read the schema from a specific avro file.</p>
<p>To compress your output with AvroStorage, you need to use the correct Avro properties for compression.
For example, to enable compression using deflate level 5, you would specify</p>
<source>
SET avro.output.codec 'deflate'
SET avro.mapred.deflate.level 5
</source>
<p>Valid values for avro.output.codec include deflate, snappy, and null.</p>
<p>There are a few key differences between Avro and Pig data, and in some cases
it helps to understand the differences between the Avro and Pig data models.
Before writing Pig data to Avro (or creating Avro files to use in Pig), keep in
mind that there might not be an equivalent Avro Schema for every Pig Schema (and
vice versa):</p>
<ul>
<li><strong>Recursive schema definitions</strong> You cannot define schemas recursively in Pig,
but you can define schemas recursively in Avro.</li>
<li><strong>Allowed characters</strong> Pig schemas may sometimes contain characters like colons (":")
that are illegal in Avro names.</li>
<li><strong>Unions</strong> In Avro, you can define an object that may be one of several different
types (including complex types such as records). In Pig, you cannot.</li>
<li><strong>Enums</strong> Avro allows you to define enums to efficiently and abstractly
represent categorical variable, but Pig does not.</li>
<li><strong>Fixed Length Byte Arrays</strong> Avro allows you to define fixed length byte arrays,
but Pig does not.</li>
<li><strong>Nullable values</strong> In Pig, all types are nullable. In Avro, they are not. </li>
</ul>
<p>Here is how AvroStorage translates Pig values to Avro:</p>
<table>
<tr>
<td></td>
<td>Original Pig Type</td>
<td>Translated Avro Type</td>
</tr>
<tr>
<td>Integers</td>
<td>int</td>
<td>["int","null"]</td>
</tr>
<tr>
<td>Longs</td>
<td>long</td>
<td>["long,"null"]</td>
</tr>
<tr>
<td>Floats</td>
<td>float</td>
<td>["float","null"]</td>
</tr>
<tr>
<td>Doubles</td>
<td>double</td>
<td>["double","null"]</td>
</tr>
<tr>
<td>Strings</td>
<td>chararray</td>
<td>["string","null"]</td>
</tr>
<tr>
<td>Bytes</td>
<td>bytearray</td>
<td>["bytes","null"]</td>
</tr>
<tr>
<td>Booleans</td>
<td>boolean</td>
<td>["boolean","null"]</td>
</tr>
<tr>
<td>Tuples</td>
<td>tuple</td>
<td>The Pig Tuple schema will be translated to an union of and Avro record with an equivalent
schem and null.</td>
</tr>
<tr>
<td>Bags of Tuples</td>
<td>bag</td>
<td>The Pig Tuple schema will be translated to a union of an array of records with an equivalent
schema and null.</td>
</tr>
<tr>
<td>Maps</td>
<td>map</td>
<td>The Pig Tuple schema will be translated to a union of a map of records with an equivalent
schema and null.</td>
</tr>
</table>
<p>Here is how AvroStorage translates Avro values to Pig:</p>
<table>
<tr>
<td></td>
<td>Original Avro Types</td>
<td>Translated Pig Type</td>
</tr>
<tr>
<td>Integers</td>
<td>["int","null"] or "int"</td>
<td>int</td>
</tr>
<tr>
<td>Longs</td>
<td>["long,"null"] or "long"</td>
<td>long</td>
</tr>
<tr>
<td>Floats</td>
<td>["float","null"] or "float"</td>
<td>float</td>
</tr>
<tr>
<td>Doubles</td>
<td>["double","null"] or "double"</td>
<td>double</td>
</tr>
<tr>
<td>Strings</td>
<td>["string","null"] or "string"</td>
<td>chararray</td>
</tr>
<tr>
<td>Enums</td>
<td>Either an enum or a union of an enum and null</td>
<td>chararray</td>
</tr>
<tr>
<td>Bytes</td>
<td>["bytes","null"] or "bytes"</td>
<td>bytearray</td>
</tr>
<tr>
<td>Fixes</td>
<td>Either a fixed length byte array, or a union of a fixed length array and null</td>
<td>bytearray</td>
</tr>
<tr>
<td>Booleans</td>
<td>["boolean","null"] or "boolean"</td>
<td>boolean</td>
</tr>
<tr>
<td>Tuples</td>
<td>Either a record type, or a union or a record and null</td>
<td>tuple</td>
</tr>
<tr>
<td>Bags of Tuples</td>
<td>Either an array, or a union of an array and null</td>
<td>bag</td>
</tr>
<tr>
<td>Maps</td>
<td>Either a map, or a union of a map and null</td>
<td>map</td>
</tr>
</table>
<p> In many cases, AvroStorage will automatically translate your data correctly and you will not
need to provide any more information to AvroStorage. But sometimes, it may be convenient to
manually provide a schema to AvroStorge. See the example selection below for examples
on manually specifying a schema with AvroStorage.
</p>
</section>
<section>
<title>Load Examples</title>
<p>Suppose that you were provided with a file of avro data (located in 'stuff')
with the following schema:</p>
<source>
{"type" : "record",
"name" : "stuff",
"fields" : [
{"name" : "label", "type" : "string"},
{"name" : "value", "type" : "int"},
{"name" : "marketingPlans", "type" : ["string", "bytearray", "null"]}
]
}
</source>
<p>Additionally, suppose that you don't need the value of the field "marketingPlans."
(That's a good thing, because AvroStorage doesn't know how to translate that Avro schema
to a Pig schema). To load only the fieds "label" and "value" into Pig, you can
manually specify the schema passed to AvroStorage:</p>
<source>
measurements = LOAD 'stuff' USING AvroStorage(
'{"type":"record","name":"measurement","fields":[{"name":"label","type":"string"},{"name":"value","type":"int"}]}'
);
</source>
</section>
<section>
<title>Store Examples</title>
<p>Suppose that you are saving a bag called measurements with the schema:</p>
<source>
measurements:{measurement:(label:chararray,value:int)}
</source>
<p>To store this bag into a file called "measurements", you can use a statement like:</p>
<source>
STORE measurements INTO 'measurements' USING AvroStorage('measurement');
</source>
<p>AvroStorage will translate this to the Avro schema</p>
<source>
{"type":"record",
"name":"measurement",
"fields" : [
{"name" : "label", "type" : ["string", "null"]},
{"name" : "value", "type" : ["int", "null"]}
]
}
</source>
<p>But suppose that you knew that the label and value fields would never be null. You could
define a more precise schema manually using a statement like:</p>
<source>
STORE measurements INTO 'measurements' USING AvroStorage(
'{"type":"record","name":"measurement","fields":[{"name":"label","type":"string"},{"name":"value","type":"int"}]}'
);
</source>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="TrevniStorage">
<title>TrevniStorage</title>
<p>Loads and stores data from Trevni files.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TrevniStorage(['schema|record name'], ['options'])</p>
</td>
</tr>
</table>
</section>
<p>Trevni is a column-oriented storage format that is part of the Apache Avro project. Trevni is
closely related to Avro.</p>
<p>Likewise, TrevniStorage is very closely related to AvroStorage, and shares the same options as
AvroStorage. See <a href="#AvroStorage">AvroStorage</a> for a detailed description of the
arguments for TrevniStorage.</p>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="AccumuloStorage">
<title>AccumuloStorage</title>
<p>Loads or stores data from an Accumulo table. The first element in a Tuple is equivalent to the "row"
from the Accumulo Key, while the columns in that row are can be grouped in various static or wildcarded
ways. Basic wildcarding functionality exists to group various columns families/qualifiers into a Map for
LOADs, or serialize a Map into some group of column families or qualifiers on STOREs.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>AccumuloStorage(['columns'[, 'options']])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Arguments</title>
<table>
<tr>
<td>
<p>'columns'</p>
</td>
<td>
<p>A comma-separated list of "columns" to read data from to write data to.
Each of these columns can be considered one of three different types:
</p>
<ol>
<li>Literal</li>
<li>Column family prefix</li>
<li>Column qualifier prefix</li>
</ol>
<p><strong>Literal:</strong> this is the simplest specification
which is a colon-delimited string that maps to a column family and column
qualifier. This will read/write a simple scalar from/to Accumulo.
</p>
<p><strong>Column family prefix:</strong> When reading data, this
will fetch data from Accumulo Key-Values in the current row whose column family match the
given prefix. This will result in a Map being placed into the Tuple. When writing
data, a Map is also expected at the given offset in the Tuple whose Keys will be
appended to the column family prefix, an empty column qualifier is used, and the Map
value will be placed in the Accumulo Value. A valid column family prefix is a literal
asterisk (*) in which case the Map Key will be equivalent to the Accumulo column family.
</p>
<p><strong>Column qualifier prefix:</strong> Similar to the column
family prefix except it operates on the column qualifier. On reads, Accumulo Key-Values
in the same row that match the given column family and column qualifier prefix will be
placed into a single Map. On writes, the provided column family from the column specification
will be used, the Map key will be appended to the column qualifier provided in the specification,
and the Map Value will be the Accumulo Value.
</p>
<p>When "columns" is not provided or is a blank String, it is treated equivalently to "*".
This is to say that when a column specification string is not provided, for reads, all columns
in the given Accumulo row will be placed into a single Map (with the Map keys being colon
delimited to preserve the column family/qualifier from Accumulo). For writes, the Map keys
will be placed into the column family and the column qualifier will be empty.
</p>
</td>
</tr>
<tr>
<td>
<p>'options'</p>
</td>
<td>
<p>A string that contains space-separated options ("optionA valueA -optionB valueB -optionC valueC")</p>
<p>The currently supported options are:</p>
<ul>
<li>(-c|--caster) LoadStoreCasterImpl An implementation of a LoadStoreCaster to use when serializing types into Accumulo,
usually AccumuloBinaryConverter or UTF8StringConverter, defaults to UTF8StorageConverter.
</li>
<li>(-auths|--authorizations) auth1,auth2... A comma-separated list of Accumulo authorizations to use when reading
data from Accumulo. Defaults to the empty set of authorizations (none).
</li>
<li>(-s|--start) start_row The Accumulo row to begin reading from, inclusive</li>
<li>(-e|--end) end_row The Accumulo row to read until, inclusive</li>
<li>(-buff|--mutation-buffer-size) num_bytes The number of bytes to buffer when writing data to Accumulo. A higher
value requires more memory</li>
<li>(-wt|--write-threads) num_threads The number of threads used to write data to Accumulo.</li>
<li>(-ml|--max-latency) milliseconds Maximum time in milliseconds before data is flushed to Accumulo.</li>
<li>(-sep|--separator) str The separator character used when parsing the column specification, defaults to comma (,)</li>
<li>(-iw|--ignore-whitespace) (true|false) Should whitespace be stripped from the column specification, defaults to true</li>
</ul>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>AccumuloStorage has the functionality to store or fetch data from Accumulo. Its goal is to provide
a simple, widely applicable table schema compatible with Pig's API. Each Tuple contains some subset
of the columns stored within one row of the Accumulo table, which depends on the columns provided
as an argument to the function. If '*' is provided, all columns in the table will be returned. The
second argument provides control over a variety of options that can be used to change various properties.</p>
<p>When invoking Pig Scripts that use AccumuloStorage, it's important to ensure that Pig has the Accumulo
jars on its classpath. This is easily achieved using the ACCUMULO_HOME environment variable.
</p>
<source>
PIG_CLASSPATH="$ACCUMULO_HOME/lib/*:$PIG_CLASSPATH" pig my_script.pig
</source>
</section>
<section>
<title>Load Example</title>
<p>It is simple to fetch all columns from Airport codes that fall between Boston and San Francisco
that can be viewed with 'auth1' and/or 'auth2' Accumulo authorizations.</p>
<source>
raw = LOAD 'accumulo://airports?instance=accumulo&amp;user=root&amp;password=passwd&amp;zookeepers=localhost'
USING org.apache.pig.backend.hadoop.accumulo.AccumuloStorage(
'*', '-a auth1,auth2 -s BOS -e SFO') AS
(code:chararray, all_columns:map[]);
</source>
<p>The datatypes of the columns are declared with the "AS" clause. In this example, the row key,
which is the unique airport code is assigned to the "code" variable while all of the other
columns are placed into the map. When there is a non-empty column qualifier, the key in that
map will have a colon which separates which portion of the key came from the column family and
which portion came from the column qualifier. The Accumulo value is placed in the Map value.</p>
<p>Most times, it is not necessary, nor desired for performance reasons, to fetch all columns.</p>
<source>
raw = LOAD 'accumulo://airports?instance=accumulo&amp;user=root&amp;password=passwd&amp;zookeepers=localhost'
USING org.apache.pig.backend.hadoop.accumulo.AccumuloStorage(
'name,building:num_terminals,carrier*,reviews:transportation*') AS
(code:chararray name:bytearray carrier_map:map[] transportion_reviews_map:map[]);
</source>
<p>An asterisk can be used when requesting columns to group a collection of columns into a single
Map instead of enumerating each column.</p>
</section>
<section>
<title>Store Example</title>
<p>Data can be easily stored into Accumulo.</p>
<source>
A = LOAD 'flights.txt' AS (id:chararray, carrier_name:chararray, src_airport:chararray, dest_airport:chararray, tail_number:int);
STORE A INTO 'accumulo://flights?instance=accumulo&amp;user=root&amp;password=passwd&amp;zookeepers=localhost' USING
org.apache.pig.backend.hadoop.accumulo.AccumuloStorage('carrier_name,src_airport,dest_airport,tail_number');
</source>
<p>Here, we read the file 'flights.txt' out of HDFS and store the results into the relation A.
We extract a unique ID for the flight, its source and destination and the tail number from the
given file. When STORE'ing back into Accumulo, we specify the column specifications (in this case,
just a column family). It is also important to note that four elements are provided as columns
because the first element in the Tuple is used as the row in Accumulo.
</p>
</section>
</section>
<section id="OrcStorage">
<title>OrcStorage</title>
<p>Loads from or stores data to Orc file.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>OrcStorage(['options'])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Options</title>
<table>
<tr>
<td>
<p>A string that contains space-separated options (&lsquo;-optionA valueA -optionB valueB -optionC &rsquo;). Current options are only applicable with STORE operation and not for LOAD.</p>
<p>Currently supported options are:</p>
<ul>
<li>--stripeSize or -s Set the stripe size for the file. Default is 268435456(256 MB).</li>
<li>--rowIndexStride or -r Set the distance between entries in the row index. Default is 10000.</li>
<li>--bufferSize or -b Set the size of the memory buffers used for compressing and storing the stripe in memory. Default is 262144 (256K).</li>
<li>--blockPadding or -p Sets whether the HDFS blocks are padded to prevent stripes from straddling blocks. Default is true.</li>
<li>--compress or -c Sets the generic compression that is used to compress the data. Valid codecs are: NONE, ZLIB, SNAPPY, LZO. Default is ZLIB.</li>
<li>--version or -v Sets the version of the file that will be written</li>
</ul>
</td>
</tr>
</table>
</section>
<section>
<title>Example</title>
<p>OrcStorage as a StoreFunc.</p>
<source>
A = LOAD 'student.txt' as (name:chararray, age:int, gpa:double);
store A into 'student.orc' using OrcStorage('-c SNAPPY'); -- store student.txt into data.orc with SNAPPY compression
</source>
<p>OrcStorage as a LoadFunc.</p>
<source>
A = LOAD 'student.orc' USING OrcStorage();
describe A; -- See the schema of student.orc
B = filter A by age &gt; 25 and gpa &lt; 3; -- filter condition will be pushed up to loader
dump B; -- dump the content of student.orc
</source>
</section>
<section>
<title>Data types</title>
<p>Most Orc data type has one to one mapping to Pig data type. Several exceptions are:</p>
<p>Loader side:</p>
<ul>
<li>Orc STRING/CHAR/VARCHAR all map to Pig varchar</li>
<li>Orc BYTE/BINARY all map to Pig bytearray</li>
<li>Orc TIMESTAMP/DATE all maps to Pig datetime</li>
<li>Orc DECIMAL maps to Pig bigdecimal</li>
</ul>
<p>Storer side:</p>
<ul>
<li>Pig chararray maps to Orc STRING</li>
<li>Pig datetime maps to Orc TIMESTAMP</li>
<li>Pig bigdecimal/biginteger all map to Orc DECIMAL</li>
<li>Pig bytearray maps to Orc BINARY</li>
</ul>
</section>
<section>
<title>Predicate pushdown</title>
<p>If there is a filter statement right after OrcStorage, Pig will push the filter condition to the loader.
OrcStorage will prune file/stripe/row group which does not satisfy the condition entirely. For the file/stripe/row group contains
data that satisfies the filter condition, OrcStorage will load the file/stripe/row group and Pig will evaluate the filter condition
again to remove additional data which does not satisfy the filter condition.</p>
<p>OrcStorage predicate pushdown currently support all primitive data types but none of the complex data types. For example, map condition
cannot push into OrcStorage:</p>
<source>
A = LOAD 'student.orc' USING OrcStorage();
B = filter A by info#'age' > 25; -- map condition cannot push to OrcStorage
dump B;
</source>
<p>Currently, the following expressions in filter condition are supported in OrcStorage predicate pushdown: &gt;, &gt;=, &lt;, &lt;=, ==, !=, between, in, and, or, not. The missing expressions are: is null, is not null, matches.</p>
</section>
</section>
</section>
<!-- ======================================================== -->
<!-- ======================================================== -->
<!-- Math Functions -->
<section id="math-functions">
<title>Math Functions</title>
<p>For general information about these functions, see the <a href="http://docs.oracle.com/javase/6/docs/api/">Java API Specification</a>,
<a href="http://docs.oracle.com/javase/6/docs/api/java/lang/Math.html">Class Math</a>. Note the following:</p>
<ul>
<li>
<p>Pig function names are case sensitive and UPPER CASE.</p>
</li>
<li>
<p>Pig may process results differently than as stated in the Java API Specification:</p>
<ul>
<li>
<p>If the result value is null or empty, Pig returns null.</p>
</li>
<li>
<p>If the result value is not a number (NaN), Pig returns null.</p>
</li>
<li>
<p>If Pig is unable to process the expression, Pig returns an exception.</p>
</li>
</ul>
</li>
</ul>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="abs">
<title>ABS</title>
<p>Returns the absolute value of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ABS(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>Any expression whose result is type int, long, float, or double.</p>
</td>
</tr>
</table></section>
<section>
<title>Usage</title>
<p>
Use the ABS function to return the absolute value of an expression.
If the result is not negative (x &#8805; 0), the result is returned. If the result is negative (x &lt; 0), the negation of the result is returned.
</p>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="acos">
<title>ACOS</title>
<p>Returns the arc cosine of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ACOS(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ACOS function to return the arc cosine of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="asin">
<title>ASIN</title>
<p>Returns the arc sine of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ASIN(expression)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ASIN function to return the arc sine of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="atan">
<title>ATAN</title>
<p>Returns the arc tangent of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ATAN(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ATAN function to return the arc tangent of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="cbrt">
<title>CBRT</title>
<p>Returns the cube root of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>CBRT(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the CBRT function to return the cube root of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="ceil">
<title>CEIL</title>
<p>Returns the value of an expression rounded up to the nearest integer.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>CEIL(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the CEIL function to return the value of an expression rounded up to the nearest integer.
This function never decreases the result value.
</p>
<table>
<tr>
<td>
<p>x</p>
</td>
<td>
<p>CEIL(x)</p>
</td>
</tr>
<tr>
<td>
<p> 4.6</p>
</td>
<td>
<p> 5</p>
</td>
</tr>
<tr>
<td>
<p> 3.5</p>
</td>
<td>
<p> 4</p>
</td>
</tr>
<tr>
<td>
<p> 2.4</p>
</td>
<td>
<p> 3</p>
</td>
</tr>
<tr>
<td>
<p>1.0</p>
</td>
<td>
<p>1</p>
</td>
</tr>
<tr>
<td>
<p>-1.0</p>
</td>
<td>
<p>-1</p>
</td>
</tr>
<tr>
<td>
<p>-2.4</p>
</td>
<td>
<p>-2</p>
</td>
</tr>
<tr>
<td>
<p>-3.5</p>
</td>
<td>
<p>-3</p>
</td>
</tr>
<tr>
<td>
<p>-4.6</p>
</td>
<td>
<p>-4</p>
</td>
</tr>
</table>
</section>
</section>
<!-- ======================================================== -->
<section id="cos">
<title>COS</title>
<p>Returns the trigonometric cosine of an expression.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>COS(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression (angle) whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the COS function to return the trigonometric cosine of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="cosh">
<title>COSH</title>
<p>Returns the hyperbolic cosine of an expression.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>COSH(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the COSH function to return the hyperbolic cosine of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="exp">
<title>EXP</title>
<p>Returns Euler's number e raised to the power of x.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>EXP(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the EXP function to return the value of Euler's number e raised to the power of x (where x is the result value of the expression).
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="floor">
<title>FLOOR</title>
<p>Returns the value of an expression rounded down to the nearest integer.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>FLOOR(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the FLOOR function to return the value of an expression rounded down to the nearest integer.
This function never increases the result value.
</p>
<table>
<tr>
<td>
<p>x</p>
</td>
<td>
<p>FLOOR(x)</p>
</td>
</tr>
<tr>
<td>
<p> 4.6</p>
</td>
<td>
<p> 4</p>
</td>
</tr>
<tr>
<td>
<p> 3.5</p>
</td>
<td>
<p> 3</p>
</td>
</tr>
<tr>
<td>
<p> 2.4</p>
</td>
<td>
<p> 2</p>
</td>
</tr>
<tr>
<td>
<p>1.0</p>
</td>
<td>
<p>1</p>
</td>
</tr>
<tr>
<td>
<p>-1.0</p>
</td>
<td>
<p>-1</p>
</td>
</tr>
<tr>
<td>
<p>-2.4</p>
</td>
<td>
<p>-3</p>
</td>
</tr>
<tr>
<td>
<p>-3.5</p>
</td>
<td>
<p>-4</p>
</td>
</tr>
<tr>
<td>
<p>-4.6</p>
</td>
<td>
<p>-5</p>
</td>
</tr>
</table>
</section>
</section>
<!-- ======================================================== -->
<section id="log">
<title>LOG</title>
<p>Returns the natural logarithm (base e) of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>LOG(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the LOG function to return the natural logarithm (base e) of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="log10">
<title>LOG10</title>
<p>Returns the base 10 logarithm of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>LOG10(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the LOG10 function to return the base 10 logarithm of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="random">
<title>RANDOM</title>
<p>Returns a pseudo random number.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>RANDOM( )</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>N/A</p>
</td>
<td>
<p>No terms.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the RANDOM function to return a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="round">
<title>ROUND</title>
<p>Returns the value of an expression rounded to an integer.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ROUND(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is type float or double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ROUND function to return the value of an expression rounded to an
integer (if the result type is float) or rounded to a long (if the result
type is double).
</p>
<p>
Values are rounded towards positive infinity: <code>round(x) = floor(x + 0.5)</code>.
</p>
<table>
<tr>
<td>
<p>x</p>
</td>
<td>
<p>ROUND(x)</p>
</td>
</tr>
<tr>
<td>
<p> 4.6</p>
</td>
<td>
<p> 5</p>
</td>
</tr>
<tr>
<td>
<p> 3.5</p>
</td>
<td>
<p> 4</p>
</td>
</tr>
<tr>
<td>
<p> 2.4</p>
</td>
<td>
<p> 2</p>
</td>
</tr>
<tr>
<td>
<p>1.0</p>
</td>
<td>
<p>1</p>
</td>
</tr>
<tr>
<td>
<p>-1.0</p>
</td>
<td>
<p>-1</p>
</td>
</tr>
<tr>
<td>
<p>-2.4</p>
</td>
<td>
<p>-2</p>
</td>
</tr>
<tr>
<td>
<p>-3.5</p>
</td>
<td>
<p>-3</p>
</td>
</tr>
<tr>
<td>
<p>-4.6</p>
</td>
<td>
<p>-5</p>
</td>
</tr>
</table>
</section>
</section>
<!-- ======================================================== -->
<section id="round_to">
<title>ROUND_TO</title>
<p>Returns the value of an expression rounded to a fixed number of decimal digits.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ROUND_TO(val, digits [, mode])</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>val</p>
</td>
<td>
<p>An expression whose result is type float or double: the value to round.</p>
</td>
</tr>
<tr>
<td>
<p>digits</p>
</td>
<td>
<p>An expression whose result is type int: the number of digits to preserve.</p>
</td>
</tr>
<tr>
<td>
<p>mode</p>
</td>
<td>
<p>
An optional int specifying the
<a href="https://en.wikipedia.org/wiki/Rounding#Tie-breaking">rounding mode</a>,
according to the <a href="http://docs.oracle.com/javase/7/docs/api/constant-values.html#java.math">constants Java provides</a>.
</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ROUND function to return the value of an expression rounded to a
fixed number of digits. Given a float, its result is a float; given a
double its result is a double.
</p>
<p>
The result is a multiple of the <code>digits</code>-th power of ten: 0
leads to no fractional digits; a negative value zeros out correspondingly
many places to the left of the decimal point.
</p>
<p>
When <code>mode</code> is omitted or has the value 6 (<a href="http://docs.oracle.com/javase/7/docs/api/java/math/RoundingMode.html#HALF_EVEN"><code>RoundingMode.HALF_EVEN</code></a>),
the result is rounded towards the nearest neighbor, and ties are
<a href="https://en.wikipedia.org/wiki/Rounding#Round_half_to_even">rounded to the nearest even digit</a>.
This mode minimizes cumulative error and tends to preserve the average of a set of values.
</p>
<p>
When <code>mode</code> has the value 4 (<a href="http://docs.oracle.com/javase/7/docs/api/java/math/RoundingMode.html#HALF_UP"><code>RoundingMode.HALF_UP</code></a>), the result is
rounded towards the nearest neighbor, and ties are
<a href="https://en.wikipedia.org/wiki/Rounding#Round_half_away_from_zero">rounded
away from zero</a>. This mode matches the behavior of most SQL systems.
</p>
<p>
For other rounding modes, consult
<a href="http://docs.oracle.com/javase/7/docs/api/java/math/RoundingMode.html">Java's
documentation</a>. There is no rounding mode that matches
<code>Math.round</code>'s behavior (i.e. round towards positive infinity)
-- blame Java, not Pig.
</p>
<table>
<tr><th><p>val</p></th> <th><p>digits</p></th> <th><p>mode</p></th> <th><p>ROUND_TO(val, digits)</p></th></tr>
<tr><td><p> 1234.1789</p> </td> <td><p> 8</p></td> <td><p></p></td> <td><p> 1234.1789</p> </td> </tr>
<tr><td><p> 1234.1789</p> </td> <td><p> 4</p></td> <td><p></p></td> <td><p> 1234.1789</p> </td> </tr>
<tr><td><p> 1234.1789</p> </td> <td><p> 1</p></td> <td><p></p></td> <td><p> 1234.2</p> </td> </tr>
<tr><td><p> 1234.1789</p> </td> <td><p> 0</p></td> <td><p></p></td> <td><p> 1234.0</p> </td> </tr>
<tr><td><p> 1234.1789</p> </td> <td><p>-1</p></td> <td><p></p></td> <td><p> 1230.0</p> </td> </tr>
<tr><td><p> 1234.1789</p> </td> <td><p>-3</p></td> <td><p></p></td> <td><p> 1000.0</p> </td> </tr>
<tr><td><p> 1234.1789</p> </td> <td><p>-4</p></td> <td><p></p></td> <td><p> 0.0</p> </td> </tr>
<tr><td><p> 3.25000001</p></td> <td><p> 1</p></td> <td><p></p></td> <td><p> 3.3</p> </td> </tr>
<tr><td><p> 3.25</p> </td> <td><p> 1</p></td> <td><p></p></td> <td><p> 3.2</p> </td> </tr>
<tr><td><p> -3.25</p> </td> <td><p> 1</p></td> <td><p></p></td> <td><p> -3.2</p> </td> </tr>
<tr><td><p> 3.15</p> </td> <td><p> 1</p></td> <td><p></p></td> <td><p> 3.2</p> </td> </tr>
<tr><td><p> -3.15</p> </td> <td><p> 1</p></td> <td><p></p></td> <td><p> -3.2</p> </td> </tr>
<tr><td><p> 3.25</p> </td> <td><p> 1</p></td> <td><p>4</p></td> <td><p> 3.3</p> </td> </tr>
<tr><td><p> -3.25</p> </td> <td><p> 1</p></td> <td><p>4</p></td> <td><p> -3.3</p> </td> </tr>
<tr><td><p> 3.5</p> </td> <td><p> 0</p></td> <td><p></p></td> <td><p> 4.0</p> </td> </tr>
<tr><td><p> -3.5</p> </td> <td><p> 0</p></td> <td><p></p></td> <td><p> -4.0</p> </td> </tr>
<tr><td><p> 2.5</p> </td> <td><p> 0</p></td> <td><p></p></td> <td><p> 2.0</p> </td> </tr>
<tr><td><p> -2.5</p> </td> <td><p> 0</p></td> <td><p></p></td> <td><p> -2.0</p> </td> </tr>
<tr><td><p> 3.5</p> </td> <td><p> 0</p></td> <td><p>4</p></td> <td><p> 4.0</p> </td> </tr>
<tr><td><p> -3.5</p> </td> <td><p> 0</p></td> <td><p>4</p></td> <td><p> -4.0</p> </td> </tr>
<tr><td><p> 2.5</p> </td> <td><p> 0</p></td> <td><p>4</p></td> <td><p> 3.0</p> </td> </tr>
<tr><td><p> -2.5</p> </td> <td><p> 0</p></td> <td><p>4</p></td> <td><p> -3.0</p> </td> </tr>
</table>
</section>
</section>
<!-- ======================================================== -->
<section id="sin">
<title>SIN</title>
<p>Returns the sine of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SIN(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the SIN function to return the sine of an expession.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="sinh">
<title>SINH</title>
<p>Returns the hyperbolic sine of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SINH(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the SINH function to return the hyperbolic sine of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="sqrt">
<title>SQRT</title>
<p>Returns the positive square root of an expression.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SQRT(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the SQRT function to return the positive square root of an expression.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="tan">
<title>TAN</title>
<p>Returns the trignometric tangent of an angle.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TAN(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression (angle) whose result is double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the TAN function to return the trignometric tangent of an angle.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="tanh">
<title>TANH</title>
<p>Returns the hyperbolic tangent of an expression. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TANH(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is double.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the TANH function to return the hyperbolic tangent of an expression.
</p>
</section>
</section>
</section>
<!-- End Math Functions -->
<!-- ======================================================== -->
<!-- ======================================================== -->
<!-- String Functions -->
<section id="string-functions">
<title>String Functions</title>
<p>For general information about these functions, see the <a href="http://docs.oracle.com/javase/6/docs/api/">Java API Specification</a>,
<a href="http://docs.oracle.com/javase/6/docs/api/java/lang/String.html">Class String</a>. Note the following:</p>
<ul>
<li>
<p>Pig function names are case sensitive and UPPER CASE.</p>
</li>
<li>
<p>Pig string functions have an extra, first parameter: the string to which all the operations are applied.</p>
</li>
<li>
<p>Pig may process results differently than as stated in the Java API Specification. If any of the input parameters are null or if an insufficient number of parameters are supplied, NULL is returned.</p>
</li>
</ul>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="endswith">
<title>ENDSWITH</title>
<p>Tests inputs to determine if the first argument ends with the string in the second. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ENDSWITH(string, testAgainst)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be tested.</p>
</td>
</tr>
<tr>
<td>
<p>testAgainst</p>
</td>
<td>
<p>The string to test against.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ENDSWITH function to determine if the first argument ends with the string in the second.
</p>
<p>
For example, ENDSWITH ('foobar', 'foo') will false, whereas ENDSWITH ('foobar', 'bar') will return true.
</p>
</section>
</section>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="equalsignorecase">
<title>EqualsIgnoreCase</title>
<p>Compares two Strings ignoring case considerations. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>EqualsIgnoreCase(string1, string2)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string1</p>
</td>
<td>
<p>The source string.</p>
</td>
</tr>
<tr>
<td>
<p>string2</p>
</td>
<td>
<p>The string to compare against.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the EqualsIgnoreCase function to determine if two string are equal ignoring case.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="indexof">
<title>INDEXOF</title>
<p>Returns the index of the first occurrence of a character in a string, searching forward from a start index. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>INDEXOF(string, 'character', startIndex)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be searched.</p>
</td>
</tr>
<tr>
<td>
<p>'character'</p>
</td>
<td>
<p>The character being searched for, in quotes. </p>
</td>
</tr>
<tr>
<td>
<p>startIndex</p>
</td>
<td>
<p>The index from which to begin the forward search. </p>
<p>The string index begins with zero (0).</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the INDEXOF function to determine the index of the first occurrence of a character in a string. The forward search for the character begins at the designated start index.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="last-index-of">
<title>LAST_INDEX_OF</title>
<p>Returns the index of the last occurrence of a character in a string, searching backward from the end of the string. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>LAST_INDEX_OF(string, 'character')</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be searched.</p>
</td>
</tr>
<tr>
<td>
<p>'character'</p>
</td>
<td>
<p>The character being searched for, in quotes.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the LAST_INDEX_OF function to determine the index of the last occurrence of a character in a string. The backward search for the character begins at the end of the string.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="lcfirst">
<title>LCFIRST</title>
<p>Converts the first character in a string to lower case. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>LCFIRST(expression)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result type is chararray.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the LCFIRST function to convert only the first character in a string to lower case.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="lower">
<title>LOWER</title>
<p>Converts all characters in a string to lower case. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>LOWER(expression)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result type is chararray.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the LOWER function to convert all characters in a string to lower case.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="ltrim">
<title>LTRIM</title>
<p>Returns a copy of a string with only leading white space removed.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>LTRIM(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is chararray. </p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the LTRIM function to remove leading white space from a string.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="regex-extract">
<title>REGEX_EXTRACT </title>
<p>Performs regular expression matching and extracts the matched group defined by an index parameter. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>REGEX_EXTRACT (string, regex, index)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string in which to perform the match.</p>
</td>
</tr>
<tr>
<td>
<p>regex</p>
</td>
<td>
<p>The regular expression.</p>
</td>
</tr>
<tr>
<td>
<p>index</p>
</td>
<td>
<p>The index of the matched group to return.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the REGEX_EXTRACT function to perform regular expression matching and to extract the matched group defined by the index parameter (where the index is a 1-based parameter.) The function uses Java regular expression form.
</p>
<p>
The function returns a string that corresponds to the matched group in the position specified by the index. If there is no matched expression at that position, NULL is returned.
</p>
</section>
<section>
<title>Example</title>
<p>
This example will return the string '192.168.1.5'.
</p>
<source>
REGEX_EXTRACT('192.168.1.5:8020', '(.*):(.*)', 1);
</source>
</section>
</section>
<!-- ======================================================== -->
<section id="regex-extract-all">
<title>REGEX_EXTRACT_ALL </title>
<p>Performs regular expression matching and extracts all matched groups.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>REGEX_EXTRACT_ALL (string, regex)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string in which to perform the match.</p>
</td>
</tr>
<tr>
<td>
<p>regex</p>
</td>
<td>
<p>The regular expression.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the REGEX_EXTRACT_ALL function to perform regular expression matching and to extract all matched groups. The function uses Java regular expression form.
</p>
<p>
The function returns a tuple where each field represents a matched expression. If there is no match, an empty tuple is returned.
</p>
</section>
<section>
<title>Example</title>
<p>
This example will return the tuple (192.168.1.5,8020).
</p>
<source>
REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');
</source>
</section>
</section>
<!-- ======================================================== -->
<section id="regex-search">
<title>REGEX_SEARCH</title>
<p>Performs regular expression matching and searches all matched characters in a string.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>REGEX_SEARCH(string, 'regExp');</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string in which to perform the match.</p>
</td>
</tr>
<tr>
<td>
<p>'regExp'</p>
</td>
<td>
<p>The regular expression to which the string is to be matched, in quotes.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the REGEX_SEARCH function to perform regular expression matching and to find all matched characters in a string.
</p>
<p>
The function returns tuples which are placed in a bag. Each tuple only contains one field which represents a matched expression.
</p>
</section>
<section>
<title>Example</title>
<p>
This is example will return the bag {(=04 ),(=06 ),(=96 )}.
</p>
<source>
REGEX_SEARCH('a=04 b=06 c=96 or more', '(=\\d+\\s)');
</source>
<p>
And this is example will return the bag {(04),(06),(96)}.
</p>
<source>
REGEX_SEARCH('a=04 b=06 c=96 or more', '=(\\d+)\\s');
</source>
</section>
</section>
<!-- ======================================================== -->
<section id="replace">
<title>REPLACE</title>
<p>Replaces existing characters in a string with new characters.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>REPLACE(string, 'regExp', 'newChar');</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be updated.</p>
</td>
</tr>
<tr>
<td>
<p>'regExp'</p>
</td>
<td>
<p>The regular expression to which the string is to be matched, in quotes.</p>
</td>
</tr>
<tr>
<td>
<p>'newChar'</p>
</td>
<td>
<p>The new characters replacing the existing characters, in quotes.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the REPLACE function to replace existing characters in a string with new characters.
</p>
<p>
For example, to change "open source software" to "open source wiki" use this statement:
REPLACE(string,'software','wiki')
</p>
<p>
Note that the REPLACE function is internally implemented using
<a href="http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String)">
java.string.replaceAll(String regex, String replacement)</a>
where 'regExp' and 'newChar' are passed as the 1st and 2nd argument respectively.
If you want to replace
<a href="http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#bs">
special characters</a> such as '[' in the string literal, it is necessary to escape them in 'regExp'
by prefixing them with double backslashes (e.g. '\\[').
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="rtrim">
<title>RTRIM</title>
<p>Returns a copy of a string with only trailing white space removed.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>RTRIM(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is chararray. </p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the RTRIM function to remove trailing white space from a string.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="sprintf">
<title>SPRINTF</title>
<p>Formats a set of values according to a printf-style template, using the <a href="http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html">native Java Formatter</a> library.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SPRINTF(format, [...vals])</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>format</p>
</td>
<td>
<p>The printf-style string describing the template.</p>
</td>
</tr>
<tr>
<td>
<p>vals</p>
</td>
<td>
<p>
The values to place in the template. There must be a tuple element
for each formatting placeholder, and it must have the correct type:
<code>int</code> or <code>long</code> for integer formats such as
<code>%d</code>; <code>float</code> or <code>double</code> for
decimal formats such as <code>%f</code>; and <code>long</code> for
date/time formats such as <code>%t</code>.
</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the SPRINTF function to format a string according to a template. For example, SPRINTF("part-%05d", 69) will return 'part-00069'.
</p>
<table>
<tr><th><p>String&nbsp;format&nbsp;specification</p></th> <th><p>arg1</p></th> <th><p>arg2</p></th> <th><p>arg3</p></th> <th><p>SPRINTF(format, arg1, arg2)</p></th> <th><p>notes</p></th></tr>
<tr><td><p><code>'%8s|%8d|%-8s'</code></p></td>
<td><p><code>1234567</code></p></td> <td><p><code>1234567</code></p></td> <td><p><code>'yay'</code></p></td>
<td><p><code>' 1234567| 1234567|yay '</code></p></td>
<td><p>Format strings with %s, integers with %d. Types are converted for you where reasonable (here, int -&gt; string).</p></td></tr>
<tr><td><p>(null value)</p></td>
<td><p><code>1234567</code></p></td> <td><p><code>1234567</code></p></td> <td><p><code>'yay'</code></p></td>
<td><p>(null value)</p></td>
<td><p>Returns null (no error or warning) with a null format string.</p></td></tr>
<tr><td><p><code>'%8s|%8d|%-8s'</code></p></td>
<td><p><code>1234567</code></p></td> <td><p>(null value)</p></td> <td><p><code>'yay'</code></p></td>
<td><p>(null value)</p></td>
<td><p>Returns null (no error or warning) if any single argument is null.</p></td></tr>
<tr><td><p><code>'%8.3f|%6x'</code></p></td>
<td><p><code>123.14159</code></p></td> <td><p><code>665568</code></p></td> <td><p><code></code></p></td>
<td><p><code>' 123.142| a27e0'</code></p></td>
<td><p>Format floats/doubles with %f, hexadecimal integers with %x (there are others besides -- see the <a href='http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html'>Java docs</a>)</p></td></tr>
<tr><td><p><code>'%,+10d|%(06d'</code></p></td>
<td><p><code>1234567</code></p></td> <td><p><code>-123</code></p></td> <td><p><code></code></p></td>
<td><p><code>'+1,234,567|(0123)'</code></p></td>
<td><p>Numerics take a prefix modifier: <code>,</code> for locale-specific thousands-delimiting, 0 for zero-padding; <code>+</code> to always show a plus sign for positive numbers; space <code> </code> to allow a space preceding positive numbers; <code>(</code> to indicate negative numbers with parentheses (accountant-style).</p></td></tr>
<tr><td><p><code>'%2$5d: %3$6s %1$3s %2$4x (%&lt;4X)'</code></p></td>
<td><p><code>'the'</code></p></td> <td><p><code>48879</code></p></td> <td><p><code>'wheres'</code></p></td>
<td><p><code>'48879: wheres the beef (BEEF)'</code></p></td>
<td><p>Refer to args positionally and as many times as you like using <code>%(pos)$...</code>. Use <code>%&lt;...</code> to refer to the previously-specified arg.</p></td></tr>
<tr><td><p><code>'Launch Time: %14d %s'</code></p></td>
<td><p><code>ToMilliSeconds(CurrentTime())</code></p></td> <td><p><code>ToString(CurrentTime(), 'yyyy-MM-dd HH:mm:ss Z')</code></p></td> <td><p><code></code></p></td>
<td><p><code>'Launch Time: 1400164132000 2014-05-15 09:28:52 -0500'</code></p></td>
<td><p>Instead use ToString to format the date/time portions and SPRINTF to layout the results.</p></td></tr>
<tr><td><p><code>'%8s|%-8s'</code></p></td> <td><p><code>1234567</code></p></td> <td><p><code></code></p></td> <td><p><code></code></p></td>
<td><p><code>MissingFormatArgumentException: Format specifier '%-8s' </code></p></td><td><p>You must supply arguments for all specifiers</p></td></tr>
<tr><td><p><code>'%8s'</code></p></td> <td><p><code>1234567</code></p></td> <td><p><code>'ignored'</code></p></td> <td><p><code>'also'</code></p></td>
<td><p><code> 1234567</code></p></td> <td><p>It's OK to supply too many, though</p></td></tr>
</table>
<p>
<em>Note: although the Java formatter (and thus this function) offers the
<code>%t</code> specifier for date/time elements, it's best avoided: it's
cumbersome, the output and timezone handling may differ from what you
expect, and it doesn't accept datetime objects from pig. Instead, just
prepare dates usint the ToString UDF as shown.</em>
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="startswith">
<title>STARTSWITH</title>
<p>Tests inputs to determine if the first argument starts with the string in the second. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>STARTSWITH(string, testAgainst)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be tested.</p>
</td>
</tr>
<tr>
<td>
<p>testAgainst</p>
</td>
<td>
<p>The string to test against.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the STARTSWITH function to determine if the first argument starts with the string in the second.
</p>
<p>
For example, STARTSWITH ('foobar', 'foo') will true, whereas STARTSWITH ('foobar', 'bar') will return false.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="strsplit">
<title>STRSPLIT</title>
<p>Splits a string around matches of a given regular expression. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>STRSPLIT(string, regex, limit)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be split.</p>
</td>
</tr>
<tr>
<td>
<p>regex</p>
</td>
<td>
<p>The regular expression.</p>
</td>
</tr>
<tr>
<td>
<p>limit</p>
</td>
<td>
<p>If the value is positive, the pattern (the compiled representation of the regular expression) is applied at most limit-1 times, therefore the value of the argument means the maximum length of the result tuple. The last element of the result tuple will contain all input after the last match.</p>
<p>If the value is negative, no limit is applied for the length of the result tuple.</p>
<p>If the value is zero, no limit is applied for the length of the result tuple too, and trailing empty strings (if any) will be removed.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the STRSPLIT function to split a string around matches of a given regular expression.
</p>
<p>
For example, given the string (open:source:software), STRSPLIT (string, ':',2) will return ((open,source:software)) and STRSPLIT (string, ':',3) will return ((open,source,software)).
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="strsplittobag">
<title>STRSPLITTOBAG</title>
<p>Splits a string around matches of a given regular expression and returns a databag</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>STRSPLITTOBAG(string, regex, limit)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string to be split.</p>
</td>
</tr>
<tr>
<td>
<p>regex</p>
</td>
<td>
<p>The regular expression.</p>
</td>
</tr>
<tr>
<td>
<p>limit</p>
</td>
<td>
<p>If the value is positive, the pattern (the compiled representation of the regular expression)
is applied at most limit-1 times, therefore the value of the argument means the maximum size
of the result bag. The last tuple of the result bag will contain all input after the last
match.
</p>
<p>If the value is negative, no limit is applied to the size of the result bag.</p>
<p>If the value is zero, no limit is applied to the size of the result bag too, and trailing
empty strings (if any) will be removed.
</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the STRSPLITTOBAG function to split a string around matches of a given regular expression.
</p>
<p>
For example, given the string (open:source:software), STRSPLITTOBAG (string, ':',2) will return
{(open),(source:software)} and STRSPLITTOBAG (string, ':',3) will return {(open),(source),(software)}.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="substring">
<title>SUBSTRING</title>
<p>Returns a substring from a given string. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SUBSTRING(string, startIndex, stopIndex)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>string</p>
</td>
<td>
<p>The string from which a substring will be extracted.</p>
</td>
</tr>
<tr>
<td>
<p>startIndex</p>
</td>
<td>
<p>The index (type integer) of the first character of the substring.</p>
<p>The index of a string begins with zero (0).</p>
</td>
</tr>
<tr>
<td>
<p>stopIndex</p>
</td>
<td>
<p>The index (type integer) of the character <em>following</em> the last character of the substring.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the SUBSTRING function to return a substring from a given string.
</p>
<p>
Given a field named alpha whose value is ABCDEF, to return substring BCD use this statement: SUBSTRING(alpha,1,4). Note that 1 is the index of B (the first character of the substring) and 4 is the index of E (the character <em>following</em> the last character of the substring).
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="trim">
<title>TRIM</title>
<p>Returns a copy of a string with leading and trailing white space removed.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TRIM(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result is chararray. </p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the TRIM function to remove leading and trailing white space from a string.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="ucfirst">
<title>UCFIRST</title>
<p>Returns a string with the first character converted to upper case. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>UCFIRST(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result type is chararray.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the UCFIRST function to convert only the first character in a string to upper case.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="upper">
<title>UPPER</title>
<p>Returns a string converted to upper case. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>UPPER(expression)</p>
</td>
</tr>
</table></section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression whose result type is chararray. </p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the UPPER function to convert all characters in a string to upper case.
</p>
</section>
</section>
<section id="uniqueid">
<title>UniqueID</title>
<p>Returns a unique id string for each record in the alias. </p>
<section>
<title>Usage</title>
<p>
UniqueID generates a unique id for each records. The id takes form "taskindex-sequence"
</p>
</section>
</section>
</section>
<!-- End String Functions -->
<!-- ======================================================== -->
<!-- ======================================================== -->
<!-- Datetime Functions -->
<section id="datetime-functions">
<title>Datetime Functions</title>
<p>
For general information about datetime type operations, see the <a href="http://docs.oracle.com/javase/6/docs/api/">Java API Specification</a>,
<a href="http://docs.oracle.com/javase/6/docs/api/java/util/Date.html">Java Date class</a>, and <a href="http://joda-time.sourceforge.net/apidocs/index.html">JODA DateTime class</a>.
And for the information of ISO date and time formats, please refer to <a href="http://www.w3.org/TR/NOTE-datetime">Date and Time Formats</a>.
</p>
<!-- ++++++++++++++++++++++++++++++++++++++++++++++ -->
<section id="add-duration">
<title>AddDuration</title>
<p>Returns the result of a DateTime object plus a <a href="http://en.wikipedia.org/wiki/ISO_8601#Durations">Duration object</a>.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>AddDuration(datetime, duration)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>duration</p>
</td>
<td>
<p>The duration string in <a href="http://en.wikipedia.org/wiki/ISO_8601#Durations">ISO 8601 format</a>.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the AddDuration function to created a new datetime object by add some duration to a given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="current-time">
<title>CurrentTime</title>
<p>Returns the DateTime object of the current time.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>CurrentTime()</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the CurrentTime function to generate a datetime object of current timestamp with millisecond accuracy.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="days-between">
<title>DaysBetween</title>
<p>Returns the number of days between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>DaysBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the DaysBetween function to get the number of days between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-day">
<title>GetDay</title>
<p>Returns the day of a month from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetDay(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetDay function to extract the day of a month from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-hour">
<title>GetHour</title>
<p>Returns the hour of a day from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetHour(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetHour function to extract the hour of a day from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-milli-second">
<title>GetMilliSecond</title>
<p>Returns the millisecond of a second from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetMilliSecond(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetMilliSecond function to extract the millsecond of a second from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-minute">
<title>GetMinute</title>
<p>Returns the minute of a hour from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetMinute(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetMinute function to extract the minute of a hour from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-month">
<title>GetMonth</title>
<p>Returns the month of a year from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetMonth(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetMonth function to extract the month of a year from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-second">
<title>GetSecond</title>
<p>Returns the second of a minute from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetSecond(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetSecond function to extract the second of a minute from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-week">
<title>GetWeek</title>
<p>Returns the week of a week year from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetWeek(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetWeek function to extract the week of a week year from the given datetime object.
Note that week year may be different from year.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-week-year">
<title>GetWeekYear</title>
<p>Returns the week year from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetWeekYear(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetWeekYear function to extract the week year from the given datetime object.
Note that week year may be different from year.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="get-year">
<title>GetYear</title>
<p>Returns the year from a DateTime object.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>GetYear(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the GetYear function to extract the year from the given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="hours-between">
<title>HoursBetween</title>
<p>Returns the number of hours between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>HoursBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the HoursBetween function to get the number of hours between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="milli-seconds-between">
<title>MilliSecondsBetween</title>
<p>Returns the number of milliseconds between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>MilliSecondsBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the MilliSecondsBetween function to get the number of millseconds between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="minutes-between">
<title>MinutesBetween</title>
<p>Returns the number of minutes between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>MinutesBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the MinutsBetween function to get the number of minutes between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="months-between">
<title>MonthsBetween</title>
<p>Returns the number of months between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>MonthsBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the MonthsBetween function to get the number of months between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="seconds-between">
<title>SecondsBetween</title>
<p>Returns the number of seconds between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SecondsBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the SecondsBetween function to get the number of seconds between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="subtract-duration">
<title>SubtractDuration</title>
<p>Returns the result of a DateTime object minus a <a href="http://en.wikipedia.org/wiki/ISO_8601#Durations">Duration object</a>.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>SubtractDuration(datetime, duration)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>duration</p>
</td>
<td>
<p>The duration string in <a href="http://en.wikipedia.org/wiki/ISO_8601#Durations">ISO 8601 format</a>.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the AddDuration function to created a new datetime object by add some duration to a given datetime object.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="to-date">
<title>ToDate</title>
<p>Returns a DateTime object according to parameters.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ToDate(milliseconds)</p>
<p>ToDate(iosstring)</p>
<p>ToDate(userstring, format)</p>
<p>ToDate(userstring, format, timezone)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>millseconds</p>
</td>
<td>
<p>The offset from 1970-01-01T00:00:00.000Z in terms of the number milliseconds (either positive or negative).</p>
</td>
</tr>
<tr>
<td>
<p>isostring</p>
</td>
<td>
<p>The datetime string in the <a href="http://www.w3.org/TR/NOTE-datetime">ISO 8601 format</a>.</p>
</td>
</tr>
<tr>
<td>
<p>userstring</p>
</td>
<td>
<p>The datetime string in the user defined format.</p>
</td>
</tr>
<tr>
<td>
<p>format</p>
</td>
<td>
<p>The date time format pattern string (see <a href="http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html">Java SimpleDateFormat class</a>).</p>
</td>
</tr>
<tr>
<td>
<p>timezone</p>
</td>
<td>
<p>The timezone string. Either the UTC offset and the location based format can be used as a parameter, while internally the timezone will be converted to the UTC offset format.</p>
<p>Please see <a href="http://joda-time.sourceforge.net/timezones.html">the Joda-Time doc</a> for available timezone IDs.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ToDate function to generate a DateTime object. Note that if the timezone is not specified with the ISO datetime string or by the timezone parameter, the default timezone will be used.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="to-milli-seconds">
<title>ToMilliSeconds</title>
<p>
Returns the number of milliseconds elapsed since January 1, 1970, 00:00:00.000 GMT
for a DateTime object.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ToMilliSeconds(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ToMilliSeconds function to convert the DateTime to the number of
milliseconds that have passed since January 1, 1970 00:00:00.000 GMT.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="to-string">
<title>ToString</title>
<p>
ToString converts the DateTime object to the ISO or the customized string.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ToString(datetime [, format string])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>format string</p>
</td>
<td>
<p>The date time format pattern string (see <a href="http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html">Java SimpleDateFormat class</a>).</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ToString function to convert the DateTime to the customized string.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="to-unix-time">
<title>ToUnixTime</title>
<p>
Returns the Unix Time as long for a DateTime object. UnixTime is the
number of seconds elapsed since January 1, 1970, 00:00:00.000 GMT.
</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>ToUnixTime(datetime)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the ToUnixTime function to convert the DateTime to Unix Time.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="weeks-between">
<title>WeeksBetween</title>
<p>Returns the number of weeks between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>WeeksBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the WeeksBetween function to get the number of weeks between the two given datetime objects.
</p>
</section>
</section>
<!-- ======================================================== -->
<section id="years-between">
<title>YearsBetween</title>
<p>Returns the number of years between two DateTime objects.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>YearsBetween(datetime1, datetime2)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>datetime1</p>
</td>
<td>
<p>A datetime object.</p>
</td>
</tr>
<tr>
<td>
<p>datetime2</p>
</td>
<td>
<p>Another datetime object.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
Use the YearsBetween function to get the number of years between the two given datetime objects.
</p>
</section>
</section>
</section>
<!-- End DateTime Functions -->
<!-- ======================================================== -->
<!-- ======================================================== -->
<!-- Other Functions -->
<section id="bag-tuple-functions">
<title>Tuple, Bag, Map Functions</title>
<!-- ======================================================== -->
<section id="totuple">
<title>TOTUPLE</title>
<p>Converts one or more expressions to type tuple. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TOTUPLE(expression [, expression ...])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression of any datatype.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Use the TOTUPLE function to convert one or more expressions to a tuple.</p>
<p>See also: <a href="basic.html#tuple">Tuple</a> data type and <a href="basic.html#type-construction">Type Construction Operators</a></p>
</section>
<section>
<title>Example</title>
<p>
In this example, fields f1, f2 and f3 are converted to a tuple.
</p>
<source>
a = LOAD 'student' AS (f1:chararray, f2:int, f3:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOTUPLE(f1,f2,f3);
DUMP b;
((John,18,4.0))
((Mary,19,3.8))
((Bill,20,3.9))
((Joe,18,3.8))
</source>
</section>
</section>
<!-- ======================================================== -->
<section id="tobag">
<title>TOBAG</title>
<p>Converts one or more expressions to type bag. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TOBAG(expression [, expression ...])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>expression</p>
</td>
<td>
<p>An expression with any data type.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Use the TOBAG function to convert one or more expressions to individual tuples which are then placed in a bag.</p>
<p>See also: <a href="basic.html#bag">Bag</a> data type and <a href="basic.html#type-construction">Type Construction Operators</a></p>
</section>
<section>
<title>Example</title>
<p>
In this example, fields f1 and f3 are converted to tuples that are then placed in a bag.
</p>
<source>
a = LOAD 'student' AS (f1:chararray, f2:int, f3:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOBAG(f1,f3);
DUMP b;
({(John),(4.0)})
({(Mary),(3.8)})
({(Bill),(3.9)})
({(Joe),(3.8)})
</source>
</section>
</section>
<!-- ======================================================== -->
<section id="tomap">
<title>TOMAP</title>
<p>Converts key/value expression pairs into a map. </p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TOMAP(key-expression, value-expression [, key-expression, value-expression ...])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>key-expression</p>
</td>
<td>
<p>An expression of type chararray.</p>
</td>
</tr>
<tr>
<td>
<p>value-expression</p>
</td>
<td>
<p>An expression of any type supported by a map.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>Use the TOMAP function to convert pairs of expressions into a map. Note the following:</p>
<ul>
<li>You must supply an even number of expressions as parameters</li>
<li>The elements must comply with map type rules:
<ul>
<li>Every odd element (key-expression) must be a chararray since only chararrays can be keys into the map</li>
<li>Every even element (value-expression) can be of any type supported by a map. </li>
</ul>
</li>
</ul>
<p></p>
<p>See also: <a href="basic.html#map">Map</a> data type and <a href="basic.html#type-construction">Type Construction Operators</a></p>
</section>
<section>
<title>Example</title>
<p>
In this example, student names (type chararray) and student GPAs (type float) are used to create three maps.
</p>
<source>
A = load 'students' as (name:chararray, age:int, gpa:float);
B = foreach A generate TOMAP(name, gpa);
store B into 'results';
Input (students)
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results)
[joe smith#3.5]
[amy chen#3.2]
[leo allen#2.1]
</source>
</section>
</section>
<!-- ======================================================== -->
<section id="topx">
<title>TOP</title>
<p>Returns the top-n tuples from a bag of tuples.</p>
<section>
<title>Syntax</title>
<table>
<tr>
<td>
<p>TOP(topN,column,relation)</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>topN</p>
</td>
<td>
<p>The number of top tuples to return (type integer).</p>
</td>
</tr>
<tr>
<td>
<p>column</p>
</td>
<td>
<p>The tuple column whose values are being compared, note 0 denotes the first column.</p>
</td>
</tr>
<tr>
<td>
<p>relation</p>
</td>
<td>
<p>The relation (bag of tuples) containing the tuple column.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Usage</title>
<p>
TOP function returns a bag containing top N tuples from the input bag where N is controlled by the first parameter to the function. The tuple comparison is performed based on a single column from the tuple. The column position is determined by the second parameter to the function. The function assumes that all tuples in the bag contain an element of the same type in the compared column.
</p>
<p>
By default, TOP function uses descending order. But it can be configured via DEFINE statement.
</p>
<source>
DEFINE asc TOP('ASC'); -- ascending order
DEFINE desc TOP('DESC'); -- descending order
</source>
</section>
<section>
<title>Example</title>
<p>
In this example the top 10 occurrences are returned.
</p>
<source>
DEFINE asc TOP('ASC'); -- ascending order
DEFINE desc TOP('DESC'); -- descending order
A = LOAD 'data' as (first: chararray, second: chararray);
B = GROUP A BY (first, second);
C = FOREACH B generate FLATTEN(group), COUNT(A) as count;
D = GROUP C BY first; -- again group by first
topResults = FOREACH D {
result = asc(10, 1, C); -- and retain top 10 (in ascending order) occurrences of 'second' in first
GENERATE FLATTEN(result);
}
bottomResults = FOREACH D {
result = desc(10, 1, C); -- and retain top 10 (in descending order) occurrences of 'second' in first
GENERATE FLATTEN(result);
}
</source>
</section>
</section>
</section>
<!-- End Other Functions -->
<!-- ======================================================== -->
<!-- ======================================================== -->
<!-- Other Functions -->
<section id="hive-udf">
<title>Hive UDF</title>
<p>Pig invokes all types of Hive UDF, including UDF, GenericUDF, UDAF, GenericUDAF and GenericUDTF. Depending on the Hive UDF you want to use, you need to declare it in Pig with HiveUDF(handles UDF and GenericUDF), HiveUDAF(handles UDAF and GenericUDAF), HiveUDTF(handles GenericUDTF).</p>
<section>
<title>Syntax</title>
<p>HiveUDF, HiveUDAF, HiveUDTF share the same syntax.</p>
<table>
<tr>
<td>
<p>HiveUDF(name[, constant parameters])</p>
</td>
</tr>
</table>
</section>
<section>
<title>Terms</title>
<table>
<tr>
<td>
<p>name</p>
</td>
<td>
<p>Hive UDF name. This can be a fully qualified class name of the Hive UDF/UDTF/UDAF class, or a registered short name in Hive FunctionRegistry (most Hive builtin UDF does that)</p>
</td>
</tr>
<tr>
<td>
<p>constant parameters</p>
</td>
<td>
<p>Optional tuple representing constant parameters of a Hive UDF/UDTF/UDAF. If Hive UDF requires a constant parameter, there is no other way Pig can pass that information to Hive, since Pig schema does not carry the information whether a parameter is constant or not. Null item in the tuple means this field is not a constant. Non-null item represents a constant field. Data type for the item is determined by Pig contant parser.</p>
</td>
</tr>
</table>
</section>
<section>
<title>Example</title>
<p>HiveUDF</p>
<source>
define sin HiveUDF('sin');
A = LOAD 'student' as (name:chararray, age:int, gpa:double);
B = foreach A generate sin(gpa);
</source>
<p>HiveUDTF</p>
<source>
define explode HiveUDTF('explode');
A = load 'mydata' as (a0:{(b0:chararray)});
B = foreach A generate flatten(explode(a0));
</source>
<p>HiveUDAF</p>
<source>
define avg HiveUDAF('avg');
A = LOAD 'student' as (name:chararray, age:int, gpa:double);
B = group A by name;
C = foreach B generate group, avg(A.age);
</source>
</section>
<p>HiveUDAF with constant parameter</p>
<source>
define in_file HiveUDF('in_file', '(null, "names.txt")');
A = load 'student' as (name:chararray, age:long, gpa:double);
B = foreach A generate in_file(name, 'names.txt');
</source>
<p>In this example, we pass (null, "names.txt") to the construct of UDF in_file, meaning the first parameter is regular, the second parameter is a constant. names.txt can be double quoted (unlike other Pig syntax), or quoted in \'. Note we need to pass 'names.txt' again in line 3. This looks stupid but we need to do this to fill the semantic gap between Pig and Hive. We need to pass the constant in the data pipeline in line 3, which is similar Pig UDF. Initialization code in Hive UDF takes ObjectInspector, which capture the data type and whether or not the parameter is a constant. However, initialization code in Pig takes schema, which only capture the former. We need to use additional mechanism (construct parameter) to convey the later.</p>
<p>Note: A few Hive 0.14 UDF contains bug which affects Pig and are fixed in Hive 1.0. Here is a list: compute_stats, context_ngrams, count, ewah_bitmap, histogram_numeric, collect_list, collect_set, ngrams, case, in, named_struct, stack, percentile_approx.</p>
</section>
</body>
</document>