| { |
| "paragraphs": [ |
| { |
| "title": "Introduction", |
| "text": "%md\n\nThis is a tutorial for Spark SQL in PySpark (based on Spark 2.x). First we need to clarify several concepts of Spark SQL\n\n* **SparkSession** - This is the entry point of Spark SQL, you need use `SparkSession` to create DataFrame/Dataset, register UDF, query table and etc.\n* **DataFrame** - There\u0027s no Dataset in PySpark, but only DataFrame. The DataFrame of PySpark is very similar with DataFrame concept of Pandas, but is distributed. \n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:49.926", |
| "progress": 0, |
| "config": { |
| "tableHide": false, |
| "editorSetting": { |
| "language": "text", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": false |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/text", |
| "fontSize": 9.0, |
| "editorHide": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "title": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "HTML", |
| "data": "\u003cdiv class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThis is a tutorial for Spark SQL in PySpark (based on Spark 2.x). First we need to clarify several concepts of Spark SQL\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSparkSession\u003c/strong\u003e - This is the entry point of Spark SQL, you need use \u003ccode\u003eSparkSession\u003c/code\u003e to create DataFrame/Dataset, register UDF, query table and etc.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDataFrame\u003c/strong\u003e - There\u0026rsquo;s no Dataset in PySpark, but only DataFrame. The DataFrame of PySpark is very similar with DataFrame concept of Pandas, but is distributed.\u003c/li\u003e\n\u003c/ul\u003e\n\n\u003c/div\u003e" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700849_2035279674", |
| "id": "20180530-101118_380906698", |
| "dateCreated": "2020-01-07 17:01:40.849", |
| "dateStarted": "2021-07-26 04:33:49.929", |
| "dateFinished": "2021-07-26 04:33:49.938", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Create DataFrame", |
| "text": "%md\n\nThere\u0027re 2 ways to create DataFrame\n\n* Use SparkSession to create DataFrame directly. You can either create DataFrame from RDD, List type objects and etc.\n* Use DataFrameReader to create Dataset/DataFrame from many kinds of storages that are supported by spark, such as HDFS, jdbc and etc.", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:40.824", |
| "progress": 0, |
| "config": { |
| "tableHide": false, |
| "editorSetting": { |
| "language": "markdown", |
| "editOnDblClick": true, |
| "completionKey": "TAB", |
| "completionSupport": false |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/markdown", |
| "fontSize": 9.0, |
| "editorHide": true, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "HTML", |
| "data": "\u003cdiv class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThere\u0026rsquo;re 2 ways to create DataFrame\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eUse SparkSession to create DataFrame directly. You can either create DataFrame from RDD, List type objects and etc.\u003c/li\u003e\n\u003cli\u003eUse DataFrameReader to create Dataset/DataFrame from many kinds of storages that are supported by spark, such as HDFS, jdbc and etc.\u003c/li\u003e\n\u003c/ul\u003e\n\n\u003c/div\u003e" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_-1934281275", |
| "id": "20180530-101515_948520659", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:34:40.824", |
| "dateFinished": "2021-07-26 04:34:40.832", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Prerequisites", |
| "text": "%md\n\n\n\n**It is strongly recommended to run the following %spark.conf paragraph first to make sure correct configuration is used.**", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:50.131", |
| "progress": 0, |
| "config": { |
| "tableHide": false, |
| "editorSetting": { |
| "language": "markdown", |
| "editOnDblClick": true, |
| "completionKey": "TAB", |
| "completionSupport": false |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/markdown", |
| "fontSize": 9.0, |
| "editorHide": true, |
| "title": false, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "HTML", |
| "data": "\u003cdiv class\u003d\"markdown-body\"\u003e\n\u003cp\u003e\u003cstrong\u003eIt is strongly recommended to run the following %spark.conf paragraph first to make sure correct configuration is used.\u003c/strong\u003e\u003c/p\u003e\n\n\u003c/div\u003e" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_-1709355244", |
| "id": "20180530-110023_1756702033", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:50.134", |
| "dateFinished": "2021-07-26 04:33:50.142", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Spark Configuration", |
| "text": "%spark.conf\n\n# It is strongly recommended to set SPARK_HOME explicitly instead of using the embedded spark of Zeppelin. As the function of embedded spark of Zeppelin is limited and can only run in local mode.\n# SPARK_HOME \u003cyour_spark_dist_path\u003e\n\n# Uncomment the following line if you want to use yarn-cluster mode (It is recommended to use yarn-cluster mode from Zeppelin 0.8, as the driver will run on the remote host of yarn cluster which can mitigate memory pressure of zeppelin server)\n# master yarn-cluster\n\n# Uncomment the following line if you want to use yarn-client mode (It is not recommended to use yarn-client after 0.8. Because it would launch the driver in the same host of zeppelin server which will increase memory pressure of zeppelin server)\n# master yarn-client\n\n# Uncomment the following line to enable HiveContext, and also put hive-site.xml under SPARK_CONF_DIR\n# zeppelin.spark.useHiveContext true\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:50.234", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "text", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/text", |
| "fontSize": 9.0, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "title": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_-1532509261", |
| "id": "20180530-110007_162886838", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:50.240", |
| "dateFinished": "2021-07-26 04:33:50.246", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Create Dataset/DataFrame via SparkSession", |
| "text": "%spark.pyspark\n\n# create DataFrame from python list. It can infer schema for you.\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]).toDF(\"id\", \"name\", \"age\", \"country\")\ndf1.printSchema()\ndf1.show()\n\n# create DataFrame from pandas dataframe\ndf2 \u003d spark.createDataFrame(df1.toPandas())\ndf2.printSchema()\ndf2.show()\n\n\n\n\n\n\n\n\n\n\n\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:50.339", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "root\n |-- id: long (nullable \u003d true)\n |-- name: string (nullable \u003d true)\n |-- age: long (nullable \u003d true)\n |-- country: string (nullable \u003d true)\n\n+---+-----+---+-------+\n| id| name|age|country|\n+---+-----+---+-------+\n| 1| andy| 20| USA|\n| 2| jeff| 23| China|\n| 3|james| 18| USA|\n+---+-----+---+-------+\n\nroot\n |-- id: long (nullable \u003d true)\n |-- name: string (nullable \u003d true)\n |-- age: long (nullable \u003d true)\n |-- country: string (nullable \u003d true)\n\n+---+-----+---+-------+\n| id| name|age|country|\n+---+-----+---+-------+\n| 1| andy| 20| USA|\n| 2| jeff| 23| China|\n| 3|james| 18| USA|\n+---+-----+---+-------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d61" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d62" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d63" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d64" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d65" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_1345292725", |
| "id": "20180530-101750_1491737301", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:50.341", |
| "dateFinished": "2021-07-26 04:33:51.158", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Create DataFrame via DataFrameReader", |
| "text": "%spark.pyspark\n\nSPARK_HOME \u003d os.getenv(\u0027SPARK_HOME\u0027)\n\n# Read data from json file\n# link for this people.json (https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json)\n# Use hdfs path if you are using hdfs\ndf1 \u003d spark.read.json(\"file://\" + SPARK_HOME + \"/examples/src/main/resources/people.json\")\ndf1.printSchema()\ndf1.show()\n\n# Read data from csv file. You can customize it via spark.read.options. E.g. In the following example, we customize the sep and header\ndf2 \u003d spark.read.options(sep\u003d\";\", header\u003dTrue).csv(\"file://\" + SPARK_HOME + \"/examples/src/main/resources/people.csv\")\ndf2.printSchema()\ndf2.show()\n\n# Specify schema for your csv file\nfrom pyspark.sql.types import StructType, StringType, IntegerType\n\nschema \u003d StructType().add(\"name\", StringType(), True) \\\n .add(\"age\", IntegerType(), True) \\\n .add(\"job\", StringType(), True)\n \ndf3 \u003d spark.read.options(sep\u003d\";\", header\u003dTrue) \\\n .schema(schema) \\\n .csv(\"file://\" + SPARK_HOME + \"/examples/src/main/resources/people.csv\") \ndf3.printSchema()\ndf3.show()\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:51.241", |
| "progress": 100, |
| "config": { |
| "lineNumbers": false, |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "root\n |-- age: long (nullable \u003d true)\n |-- name: string (nullable \u003d true)\n\n+----+-------+\n| age| name|\n+----+-------+\n|null|Michael|\n| 30| Andy|\n| 19| Justin|\n+----+-------+\n\nroot\n |-- name: string (nullable \u003d true)\n |-- age: string (nullable \u003d true)\n |-- job: string (nullable \u003d true)\n\n+-----+---+---------+\n| name|age| job|\n+-----+---+---------+\n|Jorge| 30|Developer|\n| Bob| 32|Developer|\n+-----+---+---------+\n\nroot\n |-- name: string (nullable \u003d true)\n |-- age: integer (nullable \u003d true)\n |-- job: string (nullable \u003d true)\n\n+-----+---+---------+\n| name|age| job|\n+-----+---+---------+\n|Jorge| 30|Developer|\n| Bob| 32|Developer|\n+-----+---+---------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d66" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d67" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d68" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d69" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d70" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_581443636", |
| "id": "20180530-101930_1495479697", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:51.244", |
| "dateFinished": "2021-07-26 04:33:52.120", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Add New Column", |
| "text": "%spark.pyspark\n\n# withColumn could be used to add new Column\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]).toDF(\"id\", \"name\", \"age\", \"country\")\n\ndf2 \u003d df1.withColumn(\"age2\", df1[\"age\"] + 1)\ndf2.show()\n\n# the new column could replace the existing the column if the new column name is the same as the old column\ndf3 \u003d df1.withColumn(\"age\", df1[\"age\"] + 1)\ndf3.show()\n\n# Besides using expression to create new column, you could also use udf to create new column\n# Use F.upper instead of upper, because the builtin udf of spark may conclifct with that of python, such as max\nimport pyspark.sql.functions as F\ndf4 \u003d df1.withColumn(\"name\", F.upper(df1[\"name\"]))\ndf4.show()", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:52.144", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+---+-----+---+-------+----+\n| id| name|age|country|age2|\n+---+-----+---+-------+----+\n| 1| andy| 20| USA| 21|\n| 2| jeff| 23| China| 24|\n| 3|james| 18| USA| 19|\n+---+-----+---+-------+----+\n\n+---+-----+---+-------+\n| id| name|age|country|\n+---+-----+---+-------+\n| 1| andy| 21| USA|\n| 2| jeff| 24| China|\n| 3|james| 19| USA|\n+---+-----+---+-------+\n\n+---+-----+---+-------+\n| id| name|age|country|\n+---+-----+---+-------+\n| 1| ANDY| 20| USA|\n| 2| JEFF| 23| China|\n| 3|JAMES| 18| USA|\n+---+-----+---+-------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d71" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d72" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d73" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d74" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d75" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d76" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_-775755394", |
| "id": "20180530-105113_693855403", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:52.146", |
| "dateFinished": "2021-07-26 04:33:52.914", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Remove Column", |
| "text": "%spark.pyspark\n\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]).toDF(\"id\", \"name\", \"age\", \"country\")\n# drop could be used to remove Column\ndf2 \u003d df1.drop(\"id\")\ndf2.show()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:52.947", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+-----+---+-------+\n| name|age|country|\n+-----+---+-------+\n| andy| 20| USA|\n| jeff| 23| China|\n|james| 18| USA|\n+-----+---+-------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d77" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d78" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_-886487025", |
| "id": "20180530-112045_1274721210", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:52.949", |
| "dateFinished": "2021-07-26 04:33:53.319", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Select Subset of Columns", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]).toDF(\"id\", \"name\", \"age\", \"country\")\n# select can accept a list of string of the column names\ndf2 \u003d df1.select(\"id\", \"name\")\ndf2.show()\n\n# select can also accept a list of Column. You can create column via $ or udf\nimport pyspark.sql.functions as F\ndf3 \u003d df1.select(df1[\"id\"], F.upper(df1[\"name\"]), df1[\"age\"] + 1)\ndf3.show()\n\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:53.349", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+---+-----+\n| id| name|\n+---+-----+\n| 1| andy|\n| 2| jeff|\n| 3|james|\n+---+-----+\n\n+---+-----------+---------+\n| id|upper(name)|(age + 1)|\n+---+-----------+---------+\n| 1| ANDY| 21|\n| 2| JEFF| 24|\n| 3| JAMES| 19|\n+---+-----------+---------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d79" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d80" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d81" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d82" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_2124268380", |
| "id": "20180530-113042_1154914545", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:53.352", |
| "dateFinished": "2021-07-26 04:33:53.931", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Filter Rows", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]).toDF(\"id\", \"name\", \"age\", \"country\")\n\n# filter accept a Column \ndf2 \u003d df1.filter(df1[\"age\"] \u003e\u003d 20)\ndf2.show()\n\n# To be noticed, you need to use \"\u0026\" instead of \"\u0026\u0026\" or \"AND\" \ndf3 \u003d df1.filter((df1[\"age\"] \u003e\u003d 20) \u0026 (df1[\"country\"] \u003d\u003d \"China\"))\ndf3.show()\n\n\n\n\n\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:53.952", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+---+----+---+-------+\n| id|name|age|country|\n+---+----+---+-------+\n| 1|andy| 20| USA|\n| 2|jeff| 23| China|\n+---+----+---+-------+\n\n+---+----+---+-------+\n| id|name|age|country|\n+---+----+---+-------+\n| 2|jeff| 23| China|\n+---+----+---+-------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d83" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d84" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d85" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d86" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_1501705200", |
| "id": "20180530-113407_58454283", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:53.954", |
| "dateFinished": "2021-07-26 04:33:54.575", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Create UDF", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]) \\\n .toDF(\"id\", \"name\", \"age\", \"country\")\n\n# Create udf create python lambda\nfrom pyspark.sql.functions import udf\nudf1 \u003d udf(lambda e: e.upper())\ndf2 \u003d df1.select(udf1(df1[\"name\"]))\ndf2.show()\n\n# UDF could also be used in filter, in this case the return type must be Boolean\n# We can also use annotation to create udf\nfrom pyspark.sql.types import *\n@udf(returnType\u003dBooleanType())\ndef udf2(e):\n if e \u003e\u003d 20:\n return True;\n else:\n return False\n\ndf3 \u003d df1.filter(udf2(df1[\"age\"]))\ndf3.show()\n\n# UDF could also accept more than 1 argument.\nudf3 \u003d udf(lambda e1, e2: e1 + \"_\" + e2)\ndf4 \u003d df1.select(udf3(df1[\"name\"], df1[\"country\"]).alias(\"name_country\"))\ndf4.show()\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:54.655", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+--------------+\n|\u003clambda\u003e(name)|\n+--------------+\n| ANDY|\n| JEFF|\n| JAMES|\n+--------------+\n\n+---+----+---+-------+\n| id|name|age|country|\n+---+----+---+-------+\n| 1|andy| 20| USA|\n| 2|jeff| 23| China|\n+---+----+---+-------+\n\n+------------+\n|name_country|\n+------------+\n| andy_USA|\n| jeff_China|\n| james_USA|\n+------------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d87" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d88" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d89" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d90" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d91" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d92" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_574730063", |
| "id": "20180530-113720_1986531680", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:54.658", |
| "dateFinished": "2021-07-26 04:33:55.479", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "GroupBy", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]) \\\n .toDF(\"id\", \"name\", \"age\", \"country\")\n\n# You can call agg function after groupBy directly, such as count/min/max/avg/sum\ndf2 \u003d df1.groupBy(\"country\").count()\ndf2.show()\n\n# Pass a Map if you want to do multiple aggregation\ndf3 \u003d df1.groupBy(\"country\").agg({\"age\": \"avg\", \"id\": \"count\"})\ndf3.show()\n\nimport pyspark.sql.functions as F\n# Or you can pass a list of agg function\ndf4 \u003d df1.groupBy(\"country\").agg(F.avg(df1[\"age\"]).alias(\"avg_age\"), F.count(df1[\"id\"]).alias(\"count\"))\ndf4.show()\n\n# You can not pass Map if you want to do multiple aggregation on the same column as the key of Map should be unique. So in this case\n# you have to pass a list of agg functions\ndf5 \u003d df1.groupBy(\"country\").agg(F.avg(df1[\"age\"]).alias(\"avg_age\"), F.max(df1[\"age\"]).alias(\"max_age\"))\ndf5.show()\n\n\n\n\n\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:55.558", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+-------+-----+\n|country|count|\n+-------+-----+\n| China| 1|\n| USA| 2|\n+-------+-----+\n\n+-------+---------+--------+\n|country|count(id)|avg(age)|\n+-------+---------+--------+\n| China| 1| 23.0|\n| USA| 2| 19.0|\n+-------+---------+--------+\n\n+-------+-------+-----+\n|country|avg_age|count|\n+-------+-------+-----+\n| China| 23.0| 1|\n| USA| 19.0| 2|\n+-------+-------+-----+\n\n+-------+-------+-------+\n|country|avg_age|max_age|\n+-------+-------+-------+\n| China| 23.0| 23|\n| USA| 19.0| 20|\n+-------+-------+-------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d93" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d94" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d95" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d96" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d97" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d98" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d99" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d100" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d101" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d102" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d103" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d104" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d105" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d106" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d107" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d108" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d109" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d110" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d111" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d112" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700850_1233271138", |
| "id": "20180530-114404_2076888937", |
| "dateCreated": "2020-01-07 17:01:40.850", |
| "dateStarted": "2021-07-26 04:33:55.560", |
| "dateFinished": "2021-07-26 04:33:59.547", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Join on Single Field", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, 1), (2, \"jeff\", 23, 2), (3, \"james\", 18, 3)]).toDF(\"id\", \"name\", \"age\", \"c_id\")\ndf1.show()\n\ndf2 \u003d spark.createDataFrame([(1, \"USA\"), (2, \"China\")]).toDF(\"c_id\", \"c_name\")\ndf2.show()\n\n# You can just specify the key name if join on the same key\ndf3 \u003d df1.join(df2, \"c_id\")\ndf3.show()\n\n# Or you can specify the join condition expclitly in case the key is different between tables\ndf4 \u003d df1.join(df2, df1[\"c_id\"] \u003d\u003d df2[\"c_id\"])\ndf4.show()\n\n# You can specify the join type afte the join condition, by default it is inner join\ndf5 \u003d df1.join(df2, df1[\"c_id\"] \u003d\u003d df2[\"c_id\"], \"left_outer\")\ndf5.show()", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:33:59.563", |
| "progress": 25, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+---+-----+---+----+\n| id| name|age|c_id|\n+---+-----+---+----+\n| 1| andy| 20| 1|\n| 2| jeff| 23| 2|\n| 3|james| 18| 3|\n+---+-----+---+----+\n\n+----+------+\n|c_id|c_name|\n+----+------+\n| 1| USA|\n| 2| China|\n+----+------+\n\n+----+---+----+---+------+\n|c_id| id|name|age|c_name|\n+----+---+----+---+------+\n| 1| 1|andy| 20| USA|\n| 2| 2|jeff| 23| China|\n+----+---+----+---+------+\n\n+---+----+---+----+----+------+\n| id|name|age|c_id|c_id|c_name|\n+---+----+---+----+----+------+\n| 1|andy| 20| 1| 1| USA|\n| 2|jeff| 23| 2| 2| China|\n+---+----+---+----+----+------+\n\n+---+-----+---+----+----+------+\n| id| name|age|c_id|c_id|c_name|\n+---+-----+---+----+----+------+\n| 1| andy| 20| 1| 1| USA|\n| 3|james| 18| 3|null| null|\n| 2| jeff| 23| 2| 2| China|\n+---+-----+---+----+----+------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d113" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d114" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d115" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d116" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d117" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d118" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d119" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d120" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d121" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d122" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d123" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d124" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d125" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d126" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d127" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d128" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d129" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d130" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d131" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_-770209064", |
| "id": "20180530-130126_1642948432", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:33:59.566", |
| "dateFinished": "2021-07-26 04:34:03.252", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Join on Multiple Fields", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(\"andy\", 20, 1, 1), (\"jeff\", 23, 1, 2), (\"james\", 12, 2, 2)]).toDF(\"name\", \"age\", \"key_1\", \"key_2\")\ndf1.show()\n\ndf2 \u003d spark.createDataFrame([(1, 1, \"USA\"), (2, 2, \"China\")]).toDF(\"key_1\", \"key_2\", \"country\")\ndf2.show()\n\n# Join on 2 fields: key_1, key_2\n\n# You can pass a list of field name if the join field names are the same in both tables\ndf3 \u003d df1.join(df2, [\"key_1\", \"key_2\"])\ndf3.show()\n\n# Or you can specify the join condition expclitly in case when the join fields name is differetnt in the two tables\ndf4 \u003d df1.join(df2, (df1[\"key_1\"] \u003d\u003d df2[\"key_1\"]) \u0026 (df1[\"key_2\"] \u003d\u003d df2[\"key_2\"]))\ndf4.show()\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:03.269", |
| "progress": 27, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+-----+---+-----+-----+\n| name|age|key_1|key_2|\n+-----+---+-----+-----+\n| andy| 20| 1| 1|\n| jeff| 23| 1| 2|\n|james| 12| 2| 2|\n+-----+---+-----+-----+\n\n+-----+-----+-------+\n|key_1|key_2|country|\n+-----+-----+-------+\n| 1| 1| USA|\n| 2| 2| China|\n+-----+-----+-------+\n\n+-----+-----+-----+---+-------+\n|key_1|key_2| name|age|country|\n+-----+-----+-----+---+-------+\n| 1| 1| andy| 20| USA|\n| 2| 2|james| 12| China|\n+-----+-----+-----+---+-------+\n\n+-----+---+-----+-----+-----+-----+-------+\n| name|age|key_1|key_2|key_1|key_2|country|\n+-----+---+-----+-----+-----+-----+-------+\n| andy| 20| 1| 1| 1| 1| USA|\n|james| 12| 2| 2| 2| 2| China|\n+-----+---+-----+-----+-----+-----+-------+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d132" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d133" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d134" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d135" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d136" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d137" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d138" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d139" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d140" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d141" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d142" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d143" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d144" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d145" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_-177297320", |
| "id": "20180530-135600_354945835", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:03.273", |
| "dateFinished": "2021-07-26 04:34:05.760", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Use SQL directly", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]) \\\n .toDF(\"id\", \"name\", \"age\", \"country\")\n# call createOrReplaceTempView first if you want to query this DataFrame via sql\ndf1.createOrReplaceTempView(\"people\")\n# SparkSession.sql return DataFrame\ndf2 \u003d spark.sql(\"select name, age from people\")\ndf2.show()\n\n# You need to register udf if you want to use it in sql\nspark.udf.register(\"udf1\", lambda e : e.upper())\ndf3 \u003d spark.sql(\"select udf1(name), age from people\")\ndf3.show()", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:05.774", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TEXT", |
| "data": "+-----+---+\n| name|age|\n+-----+---+\n| andy| 20|\n| jeff| 23|\n|james| 18|\n+-----+---+\n\n+----------+---+\n|udf1(name)|age|\n+----------+---+\n| ANDY| 20|\n| JEFF| 23|\n| JAMES| 18|\n+----------+---+\n\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d146" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d147" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d148" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d149" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_1756979054", |
| "id": "20180530-132023_995737505", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:05.777", |
| "dateFinished": "2021-07-26 04:34:06.449", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Show Tables", |
| "text": "%spark.sql\n\nshow tables", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:06.478", |
| "progress": 0, |
| "config": { |
| "runOnSelectionChange": true, |
| "title": true, |
| "checkEmpty": true, |
| "colWidth": 12.0, |
| "fontSize": 9.0, |
| "enabled": true, |
| "results": { |
| "0": { |
| "graph": { |
| "mode": "table", |
| "height": 300.0, |
| "optionOpen": false, |
| "setting": { |
| "table": { |
| "tableGridState": {}, |
| "tableColumnTypeState": { |
| "names": { |
| "database": "string", |
| "tableName": "string", |
| "isTemporary": "string" |
| }, |
| "updated": false |
| }, |
| "tableOptionSpecHash": "[{\"name\":\"useFilter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable filter for columns\"},{\"name\":\"showPagination\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable pagination for better navigation\"},{\"name\":\"showAggregationFooter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable a footer for displaying aggregated values\"}]", |
| "tableOptionValue": { |
| "useFilter": false, |
| "showPagination": false, |
| "showAggregationFooter": false |
| }, |
| "updated": false, |
| "initialized": false |
| } |
| }, |
| "commonSetting": {} |
| } |
| } |
| }, |
| "editorSetting": { |
| "language": "sql", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "editorMode": "ace/mode/sql" |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TABLE", |
| "data": "database\ttableName\tisTemporary\n\tbank\ttrue\n\tpeople\ttrue\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578388432752_877455036", |
| "id": "paragraph_1578388432752_877455036", |
| "dateCreated": "2020-01-07 17:13:52.752", |
| "dateStarted": "2021-07-26 04:34:06.482", |
| "dateFinished": "2021-07-26 04:34:06.556", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Visualize DataFrame/Dataset", |
| "text": "%md\n\nThere\u0027s 2 approaches to visualize DataFrame/Dataset in Zeppelin\n\n* Use SparkSQLInterpreter via `%spark.sql`\n* Use ZeppelinContext via `z.show`\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:06.582", |
| "progress": 0, |
| "config": { |
| "tableHide": false, |
| "editorSetting": { |
| "language": "markdown", |
| "editOnDblClick": true, |
| "completionKey": "TAB", |
| "completionSupport": false |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/markdown", |
| "fontSize": 9.0, |
| "editorHide": true, |
| "title": false, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "HTML", |
| "data": "\u003cdiv class\u003d\"markdown-body\"\u003e\n\u003cp\u003eThere\u0026rsquo;s 2 approaches to visualize DataFrame/Dataset in Zeppelin\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eUse SparkSQLInterpreter via \u003ccode\u003e%spark.sql\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eUse ZeppelinContext via \u003ccode\u003ez.show\u003c/code\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\n\u003c/div\u003e" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_1924561483", |
| "id": "20180530-132128_2114955642", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:06.587", |
| "dateFinished": "2021-07-26 04:34:06.602", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Visualize DataFrame/Dataset via z.show", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]).toDF(\"id\", \"name\", \"age\", \"country\")\ndf2 \u003d df1.groupBy(\"country\").count()\nz.show(df2)\n\n\n\n\n\n\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:06.687", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": { |
| "0": { |
| "graph": { |
| "mode": "multiBarChart", |
| "height": 300.0, |
| "optionOpen": false, |
| "setting": { |
| "table": { |
| "tableGridState": {}, |
| "tableColumnTypeState": { |
| "names": { |
| "country": "string", |
| "count": "string" |
| }, |
| "updated": false |
| }, |
| "tableOptionSpecHash": "[{\"name\":\"useFilter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable filter for columns\"},{\"name\":\"showPagination\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable pagination for better navigation\"},{\"name\":\"showAggregationFooter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable a footer for displaying aggregated values\"}]", |
| "tableOptionValue": { |
| "useFilter": false, |
| "showPagination": false, |
| "showAggregationFooter": false |
| }, |
| "updated": false, |
| "initialized": false |
| }, |
| "multiBarChart": { |
| "rotate": { |
| "degree": "-45" |
| }, |
| "xLabelStatus": "default" |
| } |
| }, |
| "commonSetting": {}, |
| "keys": [ |
| { |
| "name": "country", |
| "index": 0.0, |
| "aggr": "sum" |
| } |
| ], |
| "groups": [], |
| "values": [ |
| { |
| "name": "count", |
| "index": 1.0, |
| "aggr": "sum" |
| } |
| ] |
| }, |
| "helium": {} |
| }, |
| "1": { |
| "graph": { |
| "mode": "multiBarChart", |
| "height": 300.0, |
| "optionOpen": false, |
| "setting": { |
| "table": { |
| "tableGridState": {}, |
| "tableColumnTypeState": { |
| "names": { |
| "country": "string", |
| "count": "string" |
| }, |
| "updated": false |
| }, |
| "tableOptionSpecHash": "[{\"name\":\"useFilter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable filter for columns\"},{\"name\":\"showPagination\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable pagination for better navigation\"},{\"name\":\"showAggregationFooter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable a footer for displaying aggregated values\"}]", |
| "tableOptionValue": { |
| "useFilter": false, |
| "showPagination": false, |
| "showAggregationFooter": false |
| }, |
| "updated": false, |
| "initialized": false |
| }, |
| "multiBarChart": { |
| "rotate": { |
| "degree": "-45" |
| }, |
| "xLabelStatus": "default" |
| } |
| }, |
| "commonSetting": {}, |
| "keys": [ |
| { |
| "name": "country", |
| "index": 0.0, |
| "aggr": "sum" |
| } |
| ], |
| "groups": [], |
| "values": [ |
| { |
| "name": "count", |
| "index": 1.0, |
| "aggr": "sum" |
| } |
| ] |
| }, |
| "helium": {} |
| } |
| }, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TABLE", |
| "data": "country\tcount\nChina\t1\nUSA\t2\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d150" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d151" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d152" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d153" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d154" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_-809695439", |
| "id": "20180530-132634_1285621466", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:06.702", |
| "dateFinished": "2021-07-26 04:34:07.612", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "Visualize DataFrame/Dataset via %spark.sql", |
| "text": "%spark.pyspark\n\ndf1 \u003d spark.createDataFrame([(1, \"andy\", 20, \"USA\"), (2, \"jeff\", 23, \"China\"), (3, \"james\", 18, \"USA\")]) \\\n .toDF(\"id\", \"name\", \"age\", \"country\")\n \n# register this DataFrame first before querying it via %spark.sql\ndf1.createOrReplaceTempView(\"people\")", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:07.700", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "python", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/python", |
| "fontSize": 9.0, |
| "title": true, |
| "results": {}, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_381269276", |
| "id": "20180530-132657_668624333", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:07.704", |
| "dateFinished": "2021-07-26 04:34:07.979", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "", |
| "text": "%spark.sql\n\nselect country, count(1) as count from people group by country", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:08.003", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "sql", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 6.0, |
| "editorMode": "ace/mode/sql", |
| "fontSize": 9.0, |
| "results": { |
| "0": { |
| "graph": { |
| "mode": "multiBarChart", |
| "height": 300.0, |
| "optionOpen": false, |
| "setting": { |
| "table": { |
| "tableGridState": { |
| "columns": [ |
| { |
| "name": "country", |
| "visible": true, |
| "width": "*", |
| "sort": {}, |
| "filters": [ |
| {} |
| ], |
| "pinned": "" |
| }, |
| { |
| "name": "count", |
| "visible": true, |
| "width": "*", |
| "sort": {}, |
| "filters": [ |
| {} |
| ], |
| "pinned": "" |
| } |
| ], |
| "scrollFocus": {}, |
| "selection": [], |
| "grouping": { |
| "grouping": [], |
| "aggregations": [], |
| "rowExpandedStates": {} |
| }, |
| "treeView": {}, |
| "pagination": { |
| "paginationCurrentPage": 1.0, |
| "paginationPageSize": 250.0 |
| } |
| }, |
| "tableColumnTypeState": { |
| "names": { |
| "country": "string", |
| "count": "number" |
| }, |
| "updated": false |
| }, |
| "tableOptionSpecHash": "[{\"name\":\"useFilter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable filter for columns\"},{\"name\":\"showPagination\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable pagination for better navigation\"},{\"name\":\"showAggregationFooter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable a footer for displaying aggregated values\"}]", |
| "tableOptionValue": { |
| "useFilter": false, |
| "showPagination": false, |
| "showAggregationFooter": false |
| }, |
| "updated": false, |
| "initialized": false |
| }, |
| "multiBarChart": { |
| "rotate": { |
| "degree": "-45" |
| }, |
| "xLabelStatus": "default", |
| "stacked": false |
| }, |
| "stackedAreaChart": { |
| "rotate": { |
| "degree": "-45" |
| }, |
| "xLabelStatus": "default" |
| }, |
| "lineChart": { |
| "rotate": { |
| "degree": "-45" |
| }, |
| "xLabelStatus": "default" |
| } |
| }, |
| "commonSetting": {}, |
| "keys": [ |
| { |
| "name": "country", |
| "index": 0.0, |
| "aggr": "sum" |
| } |
| ], |
| "groups": [], |
| "values": [ |
| { |
| "name": "count", |
| "index": 1.0, |
| "aggr": "sum" |
| } |
| ] |
| }, |
| "helium": {} |
| } |
| }, |
| "enabled": true, |
| "runOnSelectionChange": true, |
| "title": false, |
| "checkEmpty": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [ |
| { |
| "type": "TABLE", |
| "data": "country\tcount\nChina\t1\nUSA\t2\n" |
| } |
| ] |
| }, |
| "apps": [], |
| "runtimeInfos": { |
| "jobUrl": { |
| "propertyName": "jobUrl", |
| "label": "SPARK JOB", |
| "tooltip": "View in Spark web UI", |
| "group": "spark", |
| "values": [ |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d155" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d156" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d157" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d158" |
| }, |
| { |
| "jobUrl": "http://3745c7ed824d:4040/jobs/job?id\u003d159" |
| } |
| ], |
| "interpreterSettingId": "spark" |
| } |
| }, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_601200360", |
| "id": "20180530-132823_944494152", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:08.006", |
| "dateFinished": "2021-07-26 04:34:08.652", |
| "status": "FINISHED" |
| }, |
| { |
| "title": "", |
| "text": "%spark.sql\n", |
| "user": "anonymous", |
| "dateUpdated": "2021-07-26 04:34:08.707", |
| "progress": 0, |
| "config": { |
| "editorSetting": { |
| "language": "sql", |
| "editOnDblClick": false, |
| "completionKey": "TAB", |
| "completionSupport": true |
| }, |
| "colWidth": 12.0, |
| "editorMode": "ace/mode/sql", |
| "fontSize": 9.0, |
| "results": {}, |
| "enabled": true |
| }, |
| "settings": { |
| "params": {}, |
| "forms": {} |
| }, |
| "results": { |
| "code": "SUCCESS", |
| "msg": [] |
| }, |
| "apps": [], |
| "runtimeInfos": {}, |
| "progressUpdateIntervalMs": 500, |
| "jobName": "paragraph_1578387700851_-2136199261", |
| "id": "20180530-132849_1305166760", |
| "dateCreated": "2020-01-07 17:01:40.851", |
| "dateStarted": "2021-07-26 04:34:08.710", |
| "dateFinished": "2021-07-26 04:34:08.716", |
| "status": "FINISHED" |
| } |
| ], |
| "name": "3. Spark SQL (PySpark)", |
| "id": "2EWM84JXA", |
| "defaultInterpreterGroup": "spark", |
| "version": "0.9.0-SNAPSHOT", |
| "noteParams": {}, |
| "noteForms": {}, |
| "angularObjects": {}, |
| "config": { |
| "isZeppelinNotebookCronEnable": false |
| }, |
| "info": { |
| "isRunning": false |
| } |
| } |