{"nbformat_minor": 0, "cells": [{"source": "# Flight Delay Prediction Demo Using SystemML", "cell_type": "markdown", "metadata": {}}, {"source": "This notebook is based on datascientistworkbench.com's tutorial notebook for predicting flight delay.", "cell_type": "markdown", "metadata": {}}, {"source": "## Loading SystemML ", "cell_type": "markdown", "metadata": {}}, {"source": "To use one of the released version, use \"%AddDeps org.apache.systemml systemml 0.9.0-incubating\". To use nightly build, \"%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar\"\n\nOr you provide SystemML.jar and dependency through commandline when starting the notebook (for example: --packages com.databricks:spark-csv_2.10:1.4.0 --jars SystemML.jar)", "cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Using cached version of SystemML.jar\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "Use Spark's CSV package for loading the CSV file", "cell_type": "markdown", "metadata": {}}, {"execution_count": 2, "cell_type": "code", "source": "%AddDeps com.databricks spark-csv_2.10 1.4.0", "outputs": [{"output_type": "stream", "name": "stdout", "text": ":: loading settings :: url = jar:file:/usr/local/spark-kernel/lib/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n:: resolving dependencies :: com.ibm.spark#spark-kernel;working [not transitive]\n\tconfs: [default]\n\tfound com.databricks#spark-csv_2.10;1.4.0 in central\ndownloading https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.4.0/spark-csv_2.10-1.4.0.jar ...\n\t[SUCCESSFUL ] com.databricks#spark-csv_2.10;1.4.0!spark-csv_2.10.jar (68ms)\n:: resolution report :: resolve 642ms :: artifacts dl 72ms\n\t:: modules in use:\n\tcom.databricks#spark-csv_2.10;1.4.0 from central in [default]\n\t---------------------------------------------------------------------\n\t| | modules || artifacts |\n\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n\t---------------------------------------------------------------------\n\t| default | 1 | 1 | 1 | 0 || 1 | 1 |\n\t---------------------------------------------------------------------\n:: retrieving :: com.ibm.spark#spark-kernel\n\tconfs: [default]\n\t1 artifacts copied, 0 already retrieved (153kB/9ms)\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Import Data", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"source": "Download the airline dataset from stat-computing.org if not already downloaded", "cell_type": "markdown", "metadata": {}}, {"execution_count": 3, "cell_type": "code", "source": "import sys.process._\nimport java.net.URL\nimport java.io.File\nval url = \"http://stat-computing.org/dataexpo/2009/2007.csv.bz2\"\nval localFilePath = \"airline2007.csv.bz2\"\nif(!new java.io.File(localFilePath).exists) {\n new URL(url) #> new File(localFilePath) !!\n}", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "Load the dataset into DataFrame using Spark CSV package", "cell_type": "markdown", "metadata": {}}, {"execution_count": 4, "cell_type": "code", "source": "import org.apache.spark.sql.SQLContext\nimport org.apache.spark.storage.StorageLevel\nval sqlContext = new SQLContext(sc)\nval fmt = sqlContext.read.format(\"com.databricks.spark.csv\")\nval opt = fmt.options(Map(\"header\"->\"true\", \"inferSchema\"->\"true\"))\nval airline = opt.load(localFilePath).na.replace( \"*\", Map(\"NA\" -> \"0.0\") )", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 5, "cell_type": "code", "source": "airline.printSchema", "outputs": [{"output_type": "stream", "name": "stdout", "text": "root\n |-- Year: integer (nullable = true)\n |-- Month: integer (nullable = true)\n |-- DayofMonth: integer (nullable = true)\n |-- DayOfWeek: integer (nullable = true)\n |-- DepTime: string (nullable = true)\n |-- CRSDepTime: integer (nullable = true)\n |-- ArrTime: string (nullable = true)\n |-- CRSArrTime: integer (nullable = true)\n |-- UniqueCarrier: string (nullable = true)\n |-- FlightNum: integer (nullable = true)\n |-- TailNum: string (nullable = true)\n |-- ActualElapsedTime: string (nullable = true)\n |-- CRSElapsedTime: string (nullable = true)\n |-- AirTime: string (nullable = true)\n |-- ArrDelay: string (nullable = true)\n |-- DepDelay: string (nullable = true)\n |-- Origin: string (nullable = true)\n |-- Dest: string (nullable = true)\n |-- Distance: integer (nullable = true)\n |-- TaxiIn: integer (nullable = true)\n |-- TaxiOut: integer (nullable = true)\n |-- Cancelled: integer (nullable = true)\n |-- CancellationCode: string (nullable = true)\n |-- Diverted: integer (nullable = true)\n |-- CarrierDelay: integer (nullable = true)\n |-- WeatherDelay: integer (nullable = true)\n |-- NASDelay: integer (nullable = true)\n |-- SecurityDelay: integer (nullable = true)\n |-- LateAircraftDelay: integer (nullable = true)\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Data Exploration\nWhich airports have the most delays?", "cell_type": "markdown", "metadata": {}}, {"execution_count": 6, "cell_type": "code", "source": "airline.registerTempTable(\"airline\")\nsqlContext.sql(\"\"\"SELECT Origin, count(*) conFlight, avg(DepDelay) delay\n FROM airline\n GROUP BY Origin\n ORDER BY delay DESC\"\"\").show", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+------+---------+------------------+\n|Origin|conFlight| delay|\n+------+---------+------------------+\n| PIR| 4| 45.5|\n| ACK| 314|45.296178343949045|\n| SOP| 195| 34.02051282051282|\n| HHH| 997| 22.58776328986961|\n| MCN| 992|22.496975806451612|\n| AKN| 235|21.123404255319148|\n| CEC| 1055|20.807582938388627|\n| GNV| 1927| 20.69797612869746|\n| EYW| 1052|20.224334600760457|\n| ACY| 735|20.141496598639456|\n| SPI| 1745|19.545558739255014|\n| GST| 90|19.233333333333334|\n| EWR| 154113|18.800853918877706|\n| BRW| 726| 18.02754820936639|\n| AGS| 2286|17.728346456692915|\n| ORD| 375784|17.695756072637472|\n| TRI| 1207| 17.63628831814416|\n| SBN| 5128|17.505850234009362|\n| FAY| 2185| 17.48970251716247|\n| PHL| 104063|17.067776250924922|\n+------+---------+------------------+\nonly showing top 20 rows\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Modeling: Logistic Regression\n\nPredict departure delays of greater than 15 of flights from JFK", "cell_type": "markdown", "metadata": {}}, {"execution_count": 8, "cell_type": "code", "source": "sqlContext.udf.register(\"checkDelay\", (depDelay:String) => try { if(depDelay.toDouble > 15) 1.0 else 2.0 } catch { case e:Exception => 1.0 })\nval tempSmallAirlineData = sqlContext.sql(\"SELECT *, checkDelay(DepDelay) label FROM airline WHERE Origin = 'JFK'\").persist(StorageLevel.MEMORY_AND_DISK)\nval popularDest = tempSmallAirlineData.select(\"Dest\").map(y => (y.get(0).toString, 1)).reduceByKey(_ + _).filter(_._2 > 1000).collect.toMap\nsqlContext.udf.register(\"onlyUsePopularDest\", (x:String) => popularDest.contains(x))\ntempSmallAirlineData.registerTempTable(\"tempAirline\")\nval smallAirlineData = sqlContext.sql(\"SELECT * FROM tempAirline WHERE onlyUsePopularDest(Dest)\")\n\nval datasets = smallAirlineData.randomSplit(Array(0.7, 0.3))\nval trainDataset = datasets(0).cache\nval testDataset = datasets(1).cache\ntrainDataset.count\ntestDataset.count", "outputs": [{"execution_count": 8, "output_type": "execute_result", "data": {"text/plain": "34773"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Feature selection", "cell_type": "markdown", "metadata": {}}, {"source": "Encode the destination using one-hot encoding and include the columns Year, Month, DayofMonth, DayOfWeek, Distance", "cell_type": "markdown", "metadata": {}}, {"execution_count": 9, "cell_type": "code", "source": "import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}\n\nval indexer = new StringIndexer().setInputCol(\"Dest\").setOutputCol(\"DestIndex\") // .setHandleInvalid(\"skip\") // Only works on Spark 1.6 or later\nval encoder = new OneHotEncoder().setInputCol(\"DestIndex\").setOutputCol(\"DestVec\")\nval assembler = new VectorAssembler().setInputCols(Array(\"Year\",\"Month\",\"DayofMonth\",\"DayOfWeek\",\"Distance\",\"DestVec\")).setOutputCol(\"features\")", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Build the model: Use SystemML's MLPipeline wrapper. \n\nThis wrapper invokes MultiLogReg.dml (for training) and GLM-predict.dml (for prediction). These DML algorithms are available at https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms", "cell_type": "markdown", "metadata": {}}, {"execution_count": 10, "cell_type": "code", "source": "import org.apache.spark.ml.Pipeline\nimport org.apache.sysml.api.ml.LogisticRegression\n\nval lr = new LogisticRegression(\"log\", sc).setRegParam(1e-4).setTol(1e-2).setMaxInnerIter(0).setMaxOuterIter(100)\n\nval pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, lr))\nval model = pipeline.fit(trainDataset)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 56433.27085246851, Gradient Norm = 4.469119635504498E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 9262.13484840509, Predicted = 8912.05664442707 (A/P: 1.0393), Trust Delta = 4.1513539310828525E-4\n -- New Objective = 47171.13600406342, Beta Change Norm = 3.9882828705797336E-4, Gradient Norm = 3491408.311614066\n \n-- Outer Iteration 2: Had 2 CG iterations\n -- Obj.Reduction: Actual = 107.11137476684962, Predicted = 105.31921188128369 (A/P: 1.017), Trust Delta = 4.1513539310828525E-4\n -- New Objective = 47064.02462929657, Beta Change Norm = 1.0302143846288746E-4, Gradient Norm = 84892.35372269012\nTermination / Convergence condition satisfied.\n"}], "metadata": {"scrolled": true, "collapsed": false, "trusted": true}}, {"source": "### Evaluate the model \n\nOutput RMS error on test data", "cell_type": "markdown", "metadata": {}}, {"execution_count": 11, "cell_type": "code", "source": "val predictions = model.transform(testDataset.withColumnRenamed(\"label\", \"OriginalLabel\"))\npredictions.select(\"prediction\", \"OriginalLabel\").show\nsqlContext.udf.register(\"square\", (x:Double) => Math.pow(x, 2.0))", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+----------+-------------+\n|prediction|OriginalLabel|\n+----------+-------------+\n| 1.0| 2.0|\n| 1.0| 1.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 1.0|\n| 1.0| 2.0|\n| 1.0| 1.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 1.0|\n| 1.0| 2.0|\n| 1.0| 2.0|\n| 1.0| 1.0|\n| 1.0| 1.0|\n| 1.0| 1.0|\n+----------+-------------+\nonly showing top 20 rows\n\n"}, {"execution_count": 11, "output_type": "execute_result", "data": {"text/plain": "UserDefinedFunction(<function1>,DoubleType,List())"}, "metadata": {}}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 12, "cell_type": "code", "source": "predictions.registerTempTable(\"predictions\")\nsqlContext.sql(\"SELECT sqrt(avg(square(OriginalLabel - prediction))) FROM predictions\").show", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+------------------+\n| _c0|\n+------------------+\n|0.8557362892866146|\n+------------------+\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Perform k-fold cross-validation to tune the hyperparameters\n\nPerform cross-validation to tune the regularization parameter for Logistic regression.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 13, "cell_type": "code", "source": "import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\nimport org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}\n\nval crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)\nval paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 1e-3, 1e-6)).build()\ncrossval.setEstimatorParamMaps(paramGrid)\ncrossval.setNumFolds(2) // Setting k = 2\nval cvmodel = crossval.fit(trainDataset)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 28202.772482623055, Gradient Norm = 2.221087060254761E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 4576.927438869821, Predicted = 4405.651264293149 (A/P: 1.0389), Trust Delta = 4.127578309122139E-4\n -- New Objective = 23625.845043753234, Beta Change Norm = 3.9671126297839183E-4, Gradient Norm = 1718538.331150294\nTermination / Convergence condition satisfied.\nBEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 28202.772482623055, Gradient Norm = 2.221087060254761E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 4576.927438878782, Predicted = 4405.651264300938 (A/P: 1.0389), Trust Delta = 4.127578309130283E-4\n -- New Objective = 23625.845043744273, Beta Change Norm = 3.967112629790933E-4, Gradient Norm = 1718538.3311583179\n \n-- Outer Iteration 2: Had 2 CG iterations\n -- Obj.Reduction: Actual = 52.06267761322306, Predicted = 51.207226997373795 (A/P: 1.0167), Trust Delta = 4.127578309130283E-4\n -- New Objective = 23573.78236613105, Beta Change Norm = 1.0195505438829344E-4, Gradient Norm = 41072.985998067124\n \n-- Outer Iteration 3: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.03776156834283029, Predicted = 0.037741389955733964 (A/P: 1.0005), Trust Delta = 4.127578309130283E-4\n -- New Objective = 23573.744604562708, Beta Change Norm = 3.3257729178954336E-6, Gradient Norm = 3559.0088415221207\nTermination / Convergence condition satisfied.\nBEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 28202.772482623055, Gradient Norm = 2.221087060254761E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 4576.927438878873, Predicted = 4405.651264301018 (A/P: 1.0389), Trust Delta = 4.1275783091303654E-4\n -- New Objective = 23625.845043744182, Beta Change Norm = 3.9671126297910036E-4, Gradient Norm = 1718538.331158408\n \n-- Outer Iteration 2: Had 2 CG iterations\n -- Obj.Reduction: Actual = 52.062677613230335, Predicted = 51.20722699738286 (A/P: 1.0167), Trust Delta = 4.1275783091303654E-4\n -- New Objective = 23573.782366130952, Beta Change Norm = 1.0195505438831547E-4, Gradient Norm = 41072.98599806662\n \n-- Outer Iteration 3: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.03776156833919231, Predicted = 0.037741389955751575 (A/P: 1.0005), Trust Delta = 4.1275783091303654E-4\n -- New Objective = 23573.744604562613, Beta Change Norm = 3.3257729178972746E-6, Gradient Norm = 3559.008841523661\n \n-- Outer Iteration 4: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 1.3742707646742929, Predicted = 1.374282851981874 (A/P: 1.0), Trust Delta = 0.0016510313236521462\n -- New Objective = 23572.37033379794, Beta Change Norm = 4.1275783091303654E-4, Gradient Norm = 23218.782943544382\n \n-- Outer Iteration 5: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 5.475667862796399, Predicted = 5.475595423716493 (A/P: 1.0), Trust Delta = 0.006604125294608585\n -- New Objective = 23566.894665935142, Beta Change Norm = 0.0016510313236521464, Gradient Norm = 3400.306136071355\n \n-- Outer Iteration 6: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 19.796611347293947, Predicted = 19.796668922654057 (A/P: 1.0), Trust Delta = 0.02641650117843434\n -- New Objective = 23547.09805458785, Beta Change Norm = 0.006604125294608585, Gradient Norm = 12384.979229404262\n \n-- Outer Iteration 7: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 48.9038754012945, Predicted = 48.86479486479853 (A/P: 1.0008), Trust Delta = 0.039975464358656405\n -- New Objective = 23498.194179186554, Beta Change Norm = 0.026416501178434335, Gradient Norm = 25887.667183269536\n \n-- Outer Iteration 8: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.007870123248721939, Predicted = 0.007868226951946769 (A/P: 1.0002), Trust Delta = 0.039975464358656405\n -- New Objective = 23498.186309063305, Beta Change Norm = 6.078745447586554E-7, Gradient Norm = 1345.8027775103888\n \n-- Outer Iteration 9: Had 5 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 25.04238552428069, Predicted = 25.024767443519863 (A/P: 1.0007), Trust Delta = 0.0405590959281579\n -- New Objective = 23473.143923539024, Beta Change Norm = 0.039975464358656405, Gradient Norm = 63769.52436782582\n \n-- Outer Iteration 10: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.04773861860303441, Predicted = 0.04771039962536379 (A/P: 1.0006), Trust Delta = 0.0405590959281579\n -- New Objective = 23473.09618492042, Beta Change Norm = 1.4963385754664812E-6, Gradient Norm = 720.8018323328566\n \n-- Outer Iteration 11: Had 5 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 8.123822556943196, Predicted = 8.128868676639112 (A/P: 0.9994), Trust Delta = 0.10966765508915642\n -- New Objective = 23464.972362363478, Beta Change Norm = 0.040559095928157894, Gradient Norm = 72691.91595482397\n \n-- Outer Iteration 12: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.06196295309564448, Predicted = 0.061921093377362 (A/P: 1.0007), Trust Delta = 0.10966765508915642\n -- New Objective = 23464.910399410383, Beta Change Norm = 1.7036583109418734E-6, Gradient Norm = 482.30416635512506\n \n-- Outer Iteration 13: Had 6 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 17.71440401360087, Predicted = 17.616303961789683 (A/P: 1.0056), Trust Delta = 0.16941777360208057\n -- New Objective = 23447.19599539678, Beta Change Norm = 0.10966765508915642, Gradient Norm = 448422.2320019876\n \n-- Outer Iteration 14: Had 1 CG iterations\n -- Obj.Reduction: Actual = 2.386916461367946, Predicted = 2.397254649433668 (A/P: 0.9957), Trust Delta = 0.16941777360208057\n -- New Objective = 23444.809078935414, Beta Change Norm = 1.0691952710422448E-5, Gradient Norm = 2940.4721234861527\n \n-- Outer Iteration 15: Had 4 CG iterations\n -- Obj.Reduction: Actual = 4.294265273932979, Predicted = 4.301599925371988 (A/P: 0.9983), Trust Delta = 0.16941777360208057\n -- New Objective = 23440.51481366148, Beta Change Norm = 0.018008719957742635, Gradient Norm = 4590.1170762087395\n \n-- Outer Iteration 16: Had 1 CG iterations\n -- Obj.Reduction: Actual = 2.4845889129210263E-4, Predicted = 2.4844829761319425E-4 (A/P: 1.0), Trust Delta = 0.16941777360208057\n -- New Objective = 23440.51456520259, Beta Change Norm = 1.0825357762700158E-7, Gradient Norm = 280.5707172598387\n \n-- Outer Iteration 17: Had 8 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 22.440803682489786, Predicted = 22.42170069553472 (A/P: 1.0009), Trust Delta = 0.2496076412979077\n -- New Objective = 23418.0737615201, Beta Change Norm = 0.16941777360208057, Gradient Norm = 37677.05806399844\n \n-- Outer Iteration 18: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.15241017882726737, Predicted = 0.15239595431754965 (A/P: 1.0001), Trust Delta = 0.2496076412979077\n -- New Objective = 23417.921351341272, Beta Change Norm = 8.477249180981066E-6, Gradient Norm = 707.427496995126\n \n-- Outer Iteration 19: Had 8 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 36.817799356838805, Predicted = 36.84419020002096 (A/P: 0.9993), Trust Delta = 0.3890684157185231\n -- New Objective = 23381.103551984434, Beta Change Norm = 0.2496076412979077, Gradient Norm = 181659.30511599063\n \n-- Outer Iteration 20: Had 2 CG iterations\n -- Obj.Reduction: Actual = 3.9036142495642707, Predicted = 3.907242243615839 (A/P: 0.9991), Trust Delta = 0.3890684157185231\n -- New Objective = 23377.19993773487, Beta Change Norm = 4.3252276508826854E-5, Gradient Norm = 4562.596683929567\n \n-- Outer Iteration 21: Had 1 CG iterations\n -- Obj.Reduction: Actual = 2.4621394186397083E-4, Predicted = 2.462032554160668E-4 (A/P: 1.0), Trust Delta = 0.3890684157185231\n -- New Objective = 23377.199691520927, Beta Change Norm = 1.0792242771895522E-7, Gradient Norm = 293.5155793389021\n \n-- Outer Iteration 22: Had 8 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 32.60430984508639, Predicted = 32.63142558199526 (A/P: 0.9992), Trust Delta = 0.6911480264449816\n -- New Objective = 23344.59538167584, Beta Change Norm = 0.38906841571852313, Gradient Norm = 13358.735388646046\n \n-- Outer Iteration 23: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.0021210133490967564, Predicted = 0.002120754723733256 (A/P: 1.0001), Trust Delta = 0.6911480264449816\n -- New Objective = 23344.593260662492, Beta Change Norm = 3.175083062930857E-7, Gradient Norm = 969.5458081582332\n \n-- Outer Iteration 24: Had 6 CG iterations\n -- Obj.Reduction: Actual = 1.0072309033639613, Predicted = 1.0078398039430247 (A/P: 0.9994), Trust Delta = 0.6911480264449816\n -- New Objective = 23343.586029759128, Beta Change Norm = 0.008749259137025917, Gradient Norm = 1067.7896535923433\n \n-- Outer Iteration 25: Had 1 CG iterations\n -- Obj.Reduction: Actual = 1.3547600246965885E-5, Predicted = 1.3547465425594469E-5 (A/P: 1.0), Trust Delta = 0.6911480264449816\n -- New Objective = 23343.586016211528, Beta Change Norm = 2.5374783095185467E-8, Gradient Norm = 83.20291366858535\n \n-- Outer Iteration 26: Had 12 CG iterations\n -- Obj.Reduction: Actual = 15.302215361618437, Predicted = 15.310868474305936 (A/P: 0.9994), Trust Delta = 0.6911480264449816\n -- New Objective = 23328.28380084991, Beta Change Norm = 0.5120342239089952, Gradient Norm = 15756.152919911565\n \n-- Outer Iteration 27: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.0029535907960962504, Predicted = 0.002953150612459315 (A/P: 1.0001), Trust Delta = 0.6911480264449816\n -- New Objective = 23328.280847259113, Beta Change Norm = 3.74856810221399E-7, Gradient Norm = 933.6635694330404\n \n-- Outer Iteration 28: Had 2 CG iterations\n -- Obj.Reduction: Actual = 1.0478267358848825E-4, Predicted = 1.0478219919535331E-4 (A/P: 1.0), Trust Delta = 0.6911480264449816\n -- New Objective = 23328.28074247644, Beta Change Norm = 2.2480413822676833E-7, Gradient Norm = 5.538385572102319\nTermination / Convergence condition satisfied.\nBEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 28230.498369845453, Gradient Norm = 2.248032584752783E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 4685.514381090245, Predicted = 4506.656096079343 (A/P: 1.0397), Trust Delta = 4.1751229311831877E-4\n -- New Objective = 23544.983988755208, Beta Change Norm = 4.0094223959613487E-4, Gradient Norm = 1773112.5532909825\nTermination / Convergence condition satisfied.\nBEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 28230.498369845453, Gradient Norm = 2.248032584752783E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 4685.51438109942, Predicted = 4506.6560960873 (A/P: 1.0397), Trust Delta = 4.17512293119143E-4\n -- New Objective = 23544.983988746033, Beta Change Norm = 4.0094223959684285E-4, Gradient Norm = 1773112.553299248\n \n-- Outer Iteration 2: Had 2 CG iterations\n -- Obj.Reduction: Actual = 55.08478867724625, Predicted = 54.14637164341834 (A/P: 1.0173), Trust Delta = 4.17512293119143E-4\n -- New Objective = 23489.899200068787, Beta Change Norm = 1.0409436207463608E-4, Gradient Norm = 43863.264421495034\n \n-- Outer Iteration 3: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.0425455416625482, Predicted = 0.0425210724118125 (A/P: 1.0006), Trust Delta = 4.17512293119143E-4\n -- New Objective = 23489.856654527124, Beta Change Norm = 3.4860035525762597E-6, Gradient Norm = 3473.0626928235138\nTermination / Convergence condition satisfied.\nBEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 28230.498369845453, Gradient Norm = 2.248032584752783E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 4685.514381099514, Predicted = 4506.65609608738 (A/P: 1.0397), Trust Delta = 4.175122931191516E-4\n -- New Objective = 23544.98398874594, Beta Change Norm = 4.0094223959685E-4, Gradient Norm = 1773112.5532993283\n \n-- Outer Iteration 2: Had 2 CG iterations\n -- Obj.Reduction: Actual = 55.08478867725353, Predicted = 54.14637164342853 (A/P: 1.0173), Trust Delta = 4.175122931191516E-4\n -- New Objective = 23489.899200068685, Beta Change Norm = 1.0409436207466114E-4, Gradient Norm = 43863.264421514185\n \n-- Outer Iteration 3: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.0425455416625482, Predicted = 0.04252107241182405 (A/P: 1.0006), Trust Delta = 4.175122931191516E-4\n -- New Objective = 23489.856654527022, Beta Change Norm = 3.486003552576232E-6, Gradient Norm = 3473.0626928274914\n \n-- Outer Iteration 4: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 1.3618165665211563, Predicted = 1.3618300786307123 (A/P: 1.0), Trust Delta = 0.0016700491724766064\n -- New Objective = 23488.4948379605, Beta Change Norm = 4.1751229311915155E-4, Gradient Norm = 22750.17168667339\n \n-- Outer Iteration 5: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 5.399070530791505, Predicted = 5.398983505048864 (A/P: 1.0), Trust Delta = 0.006680196689906426\n -- New Objective = 23483.09576742971, Beta Change Norm = 0.0016700491724766064, Gradient Norm = 3277.243187563727\n \n-- Outer Iteration 6: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 19.04347611745834, Predicted = 19.043530665204045 (A/P: 1.0), Trust Delta = 0.026720786759625702\n -- New Objective = 23464.05229131225, Beta Change Norm = 0.006680196689906425, Gradient Norm = 12014.210859652962\n \n-- Outer Iteration 7: Had 3 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 41.1452816738456, Predicted = 41.09983187966176 (A/P: 1.0011), Trust Delta = 0.03287099410333282\n -- New Objective = 23422.907009638406, Beta Change Norm = 0.0267207867596257, Gradient Norm = 30568.57509207747\n \n-- Outer Iteration 8: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.011022901580872713, Predicted = 0.01101972206380871 (A/P: 1.0003), Trust Delta = 0.03287099410333282\n -- New Objective = 23422.895986736825, Beta Change Norm = 7.209836919526366E-7, Gradient Norm = 1251.6678613601161\n \n-- Outer Iteration 9: Had 8 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 13.978709434930352, Predicted = 13.974847661855666 (A/P: 1.0003), Trust Delta = 0.033257209609599145\n -- New Objective = 23408.917277301895, Beta Change Norm = 0.03287099410333282, Gradient Norm = 15328.859090870203\n \n-- Outer Iteration 10: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.004639191432943335, Predicted = 0.004638318429644279 (A/P: 1.0002), Trust Delta = 0.033257209609599145\n -- New Objective = 23408.91263811046, Beta Change Norm = 1.0519781798129972E-6, Gradient Norm = 335.02440722968106\n \n-- Outer Iteration 11: Had 4 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 6.3662166226313275, Predicted = 6.366164181244294 (A/P: 1.0), Trust Delta = 0.06697443441569934\n -- New Objective = 23402.54642148783, Beta Change Norm = 0.033257209609599145, Gradient Norm = 2307.51433331859\n \n-- Outer Iteration 12: Had 7 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 11.15761233725425, Predicted = 11.149031539741129 (A/P: 1.0008), Trust Delta = 0.10211243265236637\n -- New Objective = 23391.388809150576, Beta Change Norm = 0.06697443441569932, Gradient Norm = 71503.76594916714\n \n-- Outer Iteration 13: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.600488582651451, Predicted = 0.6001508149708464 (A/P: 1.0006), Trust Delta = 0.10211243265236637\n -- New Objective = 23390.788320567925, Beta Change Norm = 1.6834966454979097E-5, Gradient Norm = 840.347770623361\n \n-- Outer Iteration 14: Had 8 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 19.757560698417365, Predicted = 19.765740859017424 (A/P: 0.9996), Trust Delta = 0.24398632984391763\n -- New Objective = 23371.030759869507, Beta Change Norm = 0.10211243265236637, Gradient Norm = 48752.608649999434\n \n-- Outer Iteration 15: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.2778570437403687, Predicted = 0.2779044747609064 (A/P: 0.9998), Trust Delta = 0.24398632984391763\n -- New Objective = 23370.752902825767, Beta Change Norm = 1.1465782794751552E-5, Gradient Norm = 490.74546662109907\n \n-- Outer Iteration 16: Had 7 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 35.87021488765458, Predicted = 35.87139479548606 (A/P: 1.0), Trust Delta = 0.5998608188063514\n -- New Objective = 23334.882687938112, Beta Change Norm = 0.24398632984391766, Gradient Norm = 114111.92221839691\n \n-- Outer Iteration 17: Had 2 CG iterations\n -- Obj.Reduction: Actual = 1.5378956803469919, Predicted = 1.5387644534721423 (A/P: 0.9994), Trust Delta = 0.5998608188063514\n -- New Objective = 23333.344792257765, Beta Change Norm = 2.7062912410241883E-5, Gradient Norm = 1827.5390228667288\n \n-- Outer Iteration 18: Had 8 CG iterations, trust bound REACHED\n -- Obj.Reduction: Actual = 55.357956099222065, Predicted = 55.4569565918232 (A/P: 0.9982), Trust Delta = 0.8894009952541146\n -- New Objective = 23277.986836158543, Beta Change Norm = 0.5998608188063514, Gradient Norm = 30684.985380679016\n \n-- Outer Iteration 19: Had 2 CG iterations\n -- Obj.Reduction: Actual = 0.017656232350418577, Predicted = 0.017644837185100737 (A/P: 1.0006), Trust Delta = 0.8894009952541146\n -- New Objective = 23277.969179926193, Beta Change Norm = 1.984483688888249E-6, Gradient Norm = 137.4544897991739\n \n-- Outer Iteration 20: Had 10 CG iterations\n -- Obj.Reduction: Actual = 13.663528841007064, Predicted = 13.567360160458493 (A/P: 1.0071), Trust Delta = 0.8894009952541146\n -- New Objective = 23264.305651085186, Beta Change Norm = 0.4790943358344082, Gradient Norm = 15753.857353150117\n \n-- Outer Iteration 21: Had 1 CG iterations\n -- Obj.Reduction: Actual = 0.002973383649077732, Predicted = 0.002972929227391132 (A/P: 1.0002), Trust Delta = 0.8894009952541146\n -- New Objective = 23264.302677701537, Beta Change Norm = 3.774223875140864E-7, Gradient Norm = 1264.8256951027395\n \n-- Outer Iteration 22: Had 2 CG iterations\n -- Obj.Reduction: Actual = 1.9038948812521994E-4, Predicted = 1.9038853266221582E-4 (A/P: 1.0), Trust Delta = 0.8894009952541146\n -- New Objective = 23264.30248731205, Beta Change Norm = 3.019597152404477E-7, Gradient Norm = 10.843636813611397\nTermination / Convergence condition satisfied.\nBEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n-- Initially: Objective = 56433.27085246851, Gradient Norm = 4.469119635504498E7, Trust Delta = 0.001024586722033724\n-- Outer Iteration 1: Had 1 CG iterations\n -- Obj.Reduction: Actual = 9262.134848396847, Predicted = 8912.05664441991 (A/P: 1.0393), Trust Delta = 4.151353931079128E-4\n -- New Objective = 47171.13600407166, Beta Change Norm = 3.9882828705765304E-4, Gradient Norm = 3491408.3116066065\nTermination / Convergence condition satisfied.\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Evaluate the cross-validated model", "cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "val cvpredictions = cvmodel.transform(testDataset.withColumnRenamed(\"label\", \"OriginalLabel\"))\ncvpredictions.registerTempTable(\"cvpredictions\")\nsqlContext.sql(\"SELECT sqrt(avg(square(OriginalLabel - prediction))) FROM cvpredictions\").show", "outputs": [{"output_type": "stream", "name": "stdout", "text": "+------------------+\n| _c0|\n+------------------+\n|0.8557362892866146|\n+------------------+\n\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Homework ;)\n\nRead http://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression and perform cross validation on other hyperparameters: for example: icpt, tol, maxOuterIter, maxInnerIter", "cell_type": "markdown", "metadata": {}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Scala 2.10.4 (Spark 1.5.2)", "name": "spark", "language": "scala"}, "language_info": {"name": "scala"}}} |