| <?xml version="1.0" encoding="UTF-8"?> |
| |
| <!-- |
| Copyright 2002-2004 The Apache Software Foundation Licensed under the Apache License, Version |
| 2.0 (the "License"); you may not use this file except in compliance with the License. You may |
| obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by |
| applicable law or agreed to in writing, software distributed under the License is distributed on |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See |
| the License for the specific language governing permissions and limitations under the License. |
| --> |
| |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" |
| "http://forrest.apache.org/dtd/document-v20.dtd"> |
| |
| <document> |
| <header> |
| <title>PigUnit - Pig script testing simplified.</title> |
| </header> |
| <body> |
| |
| <section> |
| <title>Overview</title> |
| <p>PigUnit is a simple xUnit framework that enables you to easily test your Pig scripts. |
| With |
| PigUnit you can perform unit testing, regression testing, and rapid prototyping. |
| No cluster |
| set up is required if you run Pig in local mode. |
| </p> |
| </section> |
| |
| <section> |
| <title>PigUnit Example</title> |
| <p>We want to compute a top N of the most common queries. |
| The Pig script is basic and very |
| similar to the Query Phrase Popularity in the Pig tutorial. |
| It |
| expects in input a file of |
| queries and a parameter n |
| (n is 2 in our case in order to do a top 2). |
| </p> |
| <p>Setting up a test for this script is simple as the argument and the input data are |
| specified by just two arrays of text. It is the same for the expected output of the |
| script |
| that will be compared to the actual result of the execution of the Pig script. |
| </p> |
| <p> |
| Many examples are available in the |
| <a |
| href="http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java" |
| >PigUnit tests</a> |
| . |
| </p> |
| |
| <section> |
| <title>Java test</title> |
| <source> |
| @Test |
| public void testTop2Queries() { |
| String[] args = { |
| "n=2", |
| }; |
| |
| PigTest test = new PigTest("top_queries.pig", args); |
| |
| String[] input = { |
| "yahoo", |
| "yahoo", |
| "yahoo", |
| "twitter", |
| "facebook", |
| "facebook", |
| "linkedin", |
| }; |
| |
| String[] output = { |
| "(yahoo,3)", |
| "(facebook,2)", |
| }; |
| |
| test.assertOutput("data", input, "queries_limit", output); |
| } |
| </source> |
| </section> |
| |
| <section> |
| <title>top_queries.pig</title> |
| <source> |
| data = |
| LOAD 'input' |
| AS (query:CHARARRAY); |
| |
| queries_group = |
| GROUP data |
| BY query; |
| |
| queries_count = |
| FOREACH queries_group |
| GENERATE |
| group AS query, |
| COUNT(data) AS total; |
| |
| queries_ordered = |
| ORDER queries_count |
| BY total DESC, query; |
| |
| queries_limit = |
| LIMIT queries_ordered $n; |
| |
| STORE queries_limit INTO 'output'; |
| </source> |
| </section> |
| |
| <section> |
| <title>Run</title> |
| |
| <p>Then the test can be executed by JUnit (or any other Java testing framework). It |
| requires: |
| </p> |
| <ol> |
| <li>pig.jar</li> |
| <li>pigunit.jar</li> |
| </ol> |
| |
| <p>It takes about 25s to run and should pass. |
| In case of error (for example change the |
| parameter n to n=3), |
| the diff of output is displayed: |
| </p> |
| |
| <source> |
| junit.framework.ComparisonFailure: null expected:<...ahoo,3) |
| (facebook,2)[]> but was:<...ahoo,3) |
| (facebook,2)[ |
| (linkedin,1)]> |
| at junit.framework.Assert.assertEquals(Assert.java:81) |
| at junit.framework.Assert.assertEquals(Assert.java:87) |
| at org.apache.pig.pigunit.PigTest.assertEquals(PigTest.java:272) |
| </source> |
| </section> |
| </section> |
| |
| <section> |
| <title>Running in Local Mode</title> |
| <p> |
| Pig runs in local mode by default. |
| Local mode is fast and enables you to use your local file |
| system as the HDFS cluster. |
| Local mode does not require a real cluster but a new local one is |
| created each time. |
| </p> |
| </section> |
| |
| <section> |
| <title>Running in Mapreduce Mode</title> |
| <p>Pig also runs in mapreduce mode. |
| This mode requires you to use a Hadoop cluster. |
| The cluster |
| you select must be specified in the CLASSPATH |
| (similar to the HADOOP_CONF_DIR variable). |
| </p> |
| |
| <p>Notice that PigUnit comes with a standalone MiniCluster that |
| can be started |
| externally with: |
| </p> |
| |
| <source> |
| java -cp .../pig.jar:.../pigunit.jar org.apache.pig.pigunit.MiniClusterRunner |
| </source> |
| <p>This is useful when doing some prototyping in order to have a test cluster |
| ready. |
| </p> |
| </section> |
| |
| <section> |
| <title>Building PigUnit</title> |
| <p>To compile PigUnit (pigunit.jar), run this command from the Pig trunk:</p> |
| <source> |
| $pig_trunk ant pigunit-jar |
| </source> |
| </section> |
| |
| <section> |
| <title>Troubleshooting Tips</title> |
| <p>Common problems you may encounter are discussed below.</p> |
| <section> |
| <title>Classpath in Mapreduce mode</title> |
| <p>When using PigUnit in mapreduce mode, be sure to include the $HADOOP_CONF_DIR of the |
| cluster in your CLASSPATH.</p> |
| <p> |
| MiniCluster generates one in build/classes. |
| </p> |
| <source> |
| org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use local mode, please put -x local option in command line |
| </source> |
| </section> |
| |
| <section> |
| <title>UDF jars Not Found</title> |
| <p>This error means that you are missing some jars in your test environment.</p> |
| <source> |
| WARN util.JarManager: Couldn't find the jar for org.apache.pig.piggybank.evaluation.string.LOWER, skip it |
| </source> |
| </section> |
| |
| <section> |
| <title>Storing data</title> |
| <p>Pig currently drops all STORE and DUMP commands. You can tell PigUnit to keep the |
| commands and execute the script:</p> |
| <source> |
| test = new PigTest(PIG_SCRIPT, args); |
| test.unoverride("STORE"); |
| test.runScript(); |
| </source> |
| </section> |
| |
| <section> |
| <title>Cache archive</title> |
| <p>For cache archive to work, your test environment needs to have the cache archive options |
| specified by Java properties or in an additional XML configuration in its CLASSPATH.</p> |
| <p>If you use a local cluster, you need to set the required environment variables before |
| starting it:</p> |
| <source>export LD_LIBRARY_PATH=/home/path/to/lib</source> |
| </section> |
| </section> |
| |
| <section> |
| <title>Future Enhancements</title> |
| <p>Improvement and other components based on PigUnit that could be built later.</p> |
| <p>For example, we could build a PigTestCase and PigTestSuite on top of PigTest to:</p> |
| <ol> |
| <li>Add the notion of workspaces for each test.</li> |
| <li>Remove the boiler plate code appearing when there is more than one test methods.</li> |
| <li>Add a standalone utility that reads test configurations and generates a test report. |
| </li> |
| </ol> |
| </section> |
| </body> |
| </document> |