blob: 04cb00852afc03be44ca9e09f7b7933dc0243415 [file] [log] [blame]
{
"paragraphs": [
{
"text": "%md\n\n\n### [Apache Pig](http://pig.apache.org/) is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.\n\nPig\u0027s language layer currently consists of a textual language called Pig Latin, which has the following key properties:\n\n* Ease of programming. It is trivial to achieve parallel execution of simple, \"embarrassingly parallel\" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.\n* Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.\n* Extensibility. Users can create their own functions to do special-purpose processing.\n",
"user": "anonymous",
"dateUpdated": "Jan 22, 2017 12:48:50 PM",
"config": {
"colWidth": 12.0,
"enabled": true,
"results": {},
"editorSetting": {
"language": "markdown",
"editOnDblClick": true
},
"editorMode": "ace/mode/markdown",
"editorHide": true,
"tableHide": false
},
"settings": {
"params": {},
"forms": {}
},
"results": {
"code": "SUCCESS",
"msg": [
{
"type": "HTML",
"data": "\u003cdiv class\u003d\"markdown-body\"\u003e\n\u003ch3\u003e\u003ca href\u003d\"http://pig.apache.org/\"\u003eApache Pig\u003c/a\u003e is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.\u003c/h3\u003e\n\u003cp\u003ePig\u0026rsquo;s language layer currently consists of a textual language called Pig Latin, which has the following key properties:\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eEase of programming. It is trivial to achieve parallel execution of simple, \u0026ldquo;embarrassingly parallel\u0026rdquo; data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.\u003c/li\u003e\n \u003cli\u003eOptimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.\u003c/li\u003e\n \u003cli\u003eExtensibility. Users can create their own functions to do special-purpose processing.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e"
}
]
},
"apps": [],
"jobName": "paragraph_1483277502513_1156234051",
"id": "20170101-213142_1565013608",
"dateCreated": "Jan 1, 2017 9:31:42 PM",
"dateStarted": "Jan 22, 2017 12:48:50 PM",
"dateFinished": "Jan 22, 2017 12:48:51 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nThis pig tutorial use pig to do the same thing as spark tutorial. The default mode is mapreduce, you can also use other modes like local/tez_local/tez. For mapreduce mode, you need to have hadoop installed and export `HADOOP_CONF_DIR` in `zeppelin-env.sh`\n\nThe tutorial consists of 3 steps.\n\n* Use shell interpreter to download bank.csv and upload it to hdfs\n* use `%pig` to process the data\n* use `%pig.query` to query the data",
"user": "anonymous",
"dateUpdated": "Jan 22, 2017 12:48:55 PM",
"config": {
"colWidth": 12.0,
"enabled": true,
"results": {},
"editorSetting": {
"language": "markdown",
"editOnDblClick": true
},
"editorMode": "ace/mode/markdown",
"editorHide": true,
"tableHide": false
},
"settings": {
"params": {},
"forms": {}
},
"results": {
"code": "SUCCESS",
"msg": [
{
"type": "HTML",
"data": "\u003cdiv class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThis pig tutorial use pig to do the same thing as spark tutorial. The default mode is mapreduce, you can also use other modes like local/tez_local/tez. For mapreduce mode, you need to have hadoop installed and export \u003ccode\u003eHADOOP_CONF_DIR\u003c/code\u003e in \u003ccode\u003ezeppelin-env.sh\u003c/code\u003e\u003c/p\u003e\n\u003cp\u003eThe tutorial consists of 3 steps.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eUse shell interpreter to download bank.csv and upload it to hdfs\u003c/li\u003e\n \u003cli\u003euse \u003ccode\u003e%pig\u003c/code\u003e to process the data\u003c/li\u003e\n \u003cli\u003euse \u003ccode\u003e%pig.query\u003c/code\u003e to query the data\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e"
}
]
},
"apps": [],
"jobName": "paragraph_1483689316217_-629483391",
"id": "20170106-155516_1050601059",
"dateCreated": "Jan 6, 2017 3:55:16 PM",
"dateStarted": "Jan 22, 2017 12:48:55 PM",
"dateFinished": "Jan 22, 2017 12:48:55 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%sh\n\nwget https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv\nhadoop fs -put bank.csv .\n",
"user": "anonymous",
"dateUpdated": "Jan 22, 2017 12:51:48 PM",
"config": {
"colWidth": 12.0,
"enabled": true,
"results": {},
"editorSetting": {
"language": "text",
"editOnDblClick": false
},
"editorMode": "ace/mode/text"
},
"settings": {
"params": {},
"forms": {}
},
"results": {
"code": "SUCCESS",
"msg": [
{
"type": "TEXT",
"data": "--2017-01-22 12:51:48-- https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv\nResolving s3.amazonaws.com... 52.216.80.227\nConnecting to s3.amazonaws.com|52.216.80.227|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 461474 (451K) [application/octet-stream]\nSaving to: \u0027bank.csv.3\u0027\n\n 0K .......... .......... .......... .......... .......... 11% 141K 3s\n 50K .......... .......... .......... .......... .......... 22% 243K 2s\n 100K .......... .......... .......... .......... .......... 33% 449K 1s\n 150K .......... .......... .......... .......... .......... 44% 413K 1s\n 200K .......... .......... .......... .......... .......... 55% 746K 1s\n 250K .......... .......... .......... .......... .......... 66% 588K 0s\n 300K .......... .......... .......... .......... .......... 77% 840K 0s\n 350K .......... .......... .......... .......... .......... 88% 795K 0s\n 400K .......... .......... .......... .......... .......... 99% 1.35M 0s\n 450K 100% 13.2K\u003d1.1s\n\n2017-01-22 12:51:50 (409 KB/s) - \u0027bank.csv.3\u0027 saved [461474/461474]\n\n17/01/22 12:51:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
}
]
},
"apps": [],
"jobName": "paragraph_1485058437578_-1906301827",
"id": "20170122-121357_640055590",
"dateCreated": "Jan 22, 2017 12:13:57 PM",
"dateStarted": "Jan 22, 2017 12:51:48 PM",
"dateFinished": "Jan 22, 2017 12:51:52 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pig\n\nbankText \u003d load \u0027bank.csv\u0027 using PigStorage(\u0027;\u0027);\nbank \u003d foreach bankText generate $0 as age, $1 as job, $2 as marital, $3 as education, $5 as balance; \nbank \u003d filter bank by age !\u003d \u0027\"age\"\u0027;\nbank \u003d foreach bank generate (int)age, REPLACE(job,\u0027\"\u0027,\u0027\u0027) as job, REPLACE(marital, \u0027\"\u0027, \u0027\u0027) as marital, (int)(REPLACE(balance, \u0027\"\u0027, \u0027\u0027)) as balance;\n\n-- The following statement is optional, it depends on whether your needs.\n-- store bank into \u0027clean_bank.csv\u0027 using PigStorage(\u0027;\u0027);\n\n\n",
"user": "anonymous",
"dateUpdated": "Feb 24, 2017 5:08:08 PM",
"config": {
"colWidth": 12.0,
"editorMode": "ace/mode/pig",
"results": {},
"enabled": true,
"editorSetting": {
"language": "pig",
"editOnDblClick": false
}
},
"settings": {
"params": {},
"forms": {}
},
"results": {
"code": "SUCCESS",
"msg": []
},
"apps": [],
"jobName": "paragraph_1483277250237_-466604517",
"id": "20161228-140640_1560978333",
"dateCreated": "Jan 1, 2017 9:27:30 PM",
"dateStarted": "Feb 24, 2017 5:08:08 PM",
"dateFinished": "Feb 24, 2017 5:08:11 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pig.query\n\nbank_data \u003d filter bank by age \u003c 30;\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1);\n\n",
"user": "anonymous",
"dateUpdated": "Feb 24, 2017 5:08:13 PM",
"config": {
"colWidth": 4.0,
"editorMode": "ace/mode/pig",
"results": {
"0": {
"graph": {
"mode": "multiBarChart",
"height": 300.0,
"optionOpen": false
},
"helium": {}
}
},
"enabled": true,
"editorSetting": {
"language": "pig",
"editOnDblClick": false
}
},
"settings": {
"params": {},
"forms": {}
},
"results": {
"code": "SUCCESS",
"msg": [
{
"type": "TABLE",
"data": "group\tcol_1\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n"
}
]
},
"apps": [],
"jobName": "paragraph_1483277250238_-465450270",
"id": "20161228-140730_1903342877",
"dateCreated": "Jan 1, 2017 9:27:30 PM",
"dateStarted": "Feb 24, 2017 5:08:13 PM",
"dateFinished": "Feb 24, 2017 5:08:26 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pig.query\n\nbank_data \u003d filter bank by age \u003c ${maxAge\u003d40};\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1) as count;",
"user": "anonymous",
"dateUpdated": "Feb 24, 2017 5:08:14 PM",
"config": {
"colWidth": 4.0,
"editorMode": "ace/mode/pig",
"results": {
"0": {
"graph": {
"mode": "pieChart",
"height": 300.0,
"optionOpen": false
},
"helium": {}
}
},
"enabled": true,
"editorSetting": {
"language": "pig",
"editOnDblClick": false
}
},
"settings": {
"params": {
"maxAge": "36"
},
"forms": {
"maxAge": {
"name": "maxAge",
"defaultValue": "40",
"hidden": false
}
}
},
"results": {
"code": "SUCCESS",
"msg": [
{
"type": "TABLE",
"data": "group\tcount\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n30\t150\n31\t199\n32\t224\n33\t186\n34\t231\n35\t180\n"
}
]
},
"apps": [],
"jobName": "paragraph_1483277250239_-465835019",
"id": "20161228-154918_1551591203",
"dateCreated": "Jan 1, 2017 9:27:30 PM",
"dateStarted": "Feb 24, 2017 5:08:14 PM",
"dateFinished": "Feb 24, 2017 5:08:29 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pig.query\n\nbank_data \u003d filter bank by marital\u003d\u003d\u0027${marital\u003dsingle,single|divorced|married}\u0027;\nb \u003d group bank_data by age;\nforeach b generate group, COUNT($1) as count;\n\n\n",
"user": "anonymous",
"dateUpdated": "Feb 24, 2017 5:08:15 PM",
"config": {
"colWidth": 4.0,
"editorMode": "ace/mode/pig",
"results": {
"0": {
"graph": {
"mode": "scatterChart",
"height": 300.0,
"optionOpen": false
},
"helium": {}
}
},
"enabled": true,
"editorSetting": {
"language": "pig",
"editOnDblClick": false
}
},
"settings": {
"params": {
"marital": "married"
},
"forms": {
"marital": {
"name": "marital",
"defaultValue": "single",
"options": [
{
"value": "single"
},
{
"value": "divorced"
},
{
"value": "married"
}
],
"hidden": false
}
}
},
"results": {
"code": "SUCCESS",
"msg": [
{
"type": "TABLE",
"data": "group\tcount\n23\t3\n24\t11\n25\t11\n26\t18\n27\t26\n28\t23\n29\t37\n30\t56\n31\t104\n32\t105\n33\t103\n34\t142\n35\t109\n36\t117\n37\t100\n38\t99\n39\t88\n40\t105\n41\t97\n42\t91\n43\t79\n44\t68\n45\t76\n46\t82\n47\t78\n48\t91\n49\t87\n50\t74\n51\t63\n52\t66\n53\t75\n54\t56\n55\t68\n56\t50\n57\t78\n58\t67\n59\t56\n60\t36\n61\t15\n62\t5\n63\t7\n64\t6\n65\t4\n66\t7\n67\t5\n68\t1\n69\t5\n70\t5\n71\t5\n72\t4\n73\t6\n74\t2\n75\t3\n76\t1\n77\t5\n78\t2\n79\t3\n80\t6\n81\t1\n83\t2\n86\t1\n87\t1\n"
}
]
},
"apps": [],
"jobName": "paragraph_1483277250240_-480070728",
"id": "20161228-142259_575675591",
"dateCreated": "Jan 1, 2017 9:27:30 PM",
"dateStarted": "Feb 24, 2017 5:08:27 PM",
"dateFinished": "Feb 24, 2017 5:08:31 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pig\n",
"dateUpdated": "Jan 1, 2017 9:27:30 PM",
"config": {},
"settings": {
"params": {},
"forms": {}
},
"apps": [],
"jobName": "paragraph_1483277250240_-480070728",
"id": "20161228-155036_1854903164",
"dateCreated": "Jan 1, 2017 9:27:30 PM",
"status": "READY",
"errorMessage": "",
"progressUpdateIntervalMs": 500
}
],
"name": "Using Pig for querying data",
"id": "2C57UKYWR",
"angularObjects": {
"2C3RWCVAG:shared_process": [],
"2C9KGCHDE:shared_process": [],
"2C8X2BS16:shared_process": []
},
"config": {},
"info": {}
}