layout: global title: “Migration Guide: SparkR (R on Spark)” displayTitle: “Migration Guide: SparkR (R on Spark)” license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Note that this migration guide describes the items specific to SparkR. Many items of SQL migration can be applied when migrating SparkR to higher versions. Please refer Migration Guide: SQL, Datasets and DataFrame.
SPARKR_ASK_INSTALLATION environment variable to FALSE.parquetFile, saveAsParquetFile, jsonFile, jsonRDD have been removed. Use read.parquet, write.parquet, read.json instead.spark.mlp. For example, if the training data only has two labels, a layers param like c(1, 3) doesn’t cause an error previously, now it does.start parameter of substr method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with substr in R. In version 2.3.1 and later, it has been fixed so the start parameter of substr method is now 1-based. As an example, substr(lit('abcdef'), 2, 4)) would result to abc in SparkR 2.3.0, and the result would be bcd in SparkR 2.3.1.stringsAsFactors parameter was previously ignored with collect, for example, in collect(createDataFrame(iris), stringsAsFactors = TRUE)). It has been corrected.summary, option for statistics to compute has been added. Its output is changed from that from describe.numPartitions parameter has been added to createDataFrame and as.DataFrame. When splitting the data, the partition position calculation has been made to match the one in Scala.createExternalTable has been deprecated to be replaced by createTable. Either methods can be called to create external or managed table. Additional catalog methods have also been added.tempdir(). This will be created when instantiating the SparkSession with enableHiveSupport set to TRUE.spark.lda was not setting the optimizer correctly. It has been corrected.coefficients as matrix. This includes spark.logit, spark.kmeans, spark.glm. Model summary outputs for spark.gaussianMixture have added log-likelihood as loglik.join no longer performs Cartesian Product by default, use crossJoin instead.table has been removed and replaced by tableToDF.DataFrame has been renamed to SparkDataFrame to avoid name conflicts.SQLContext and HiveContext have been deprecated to be replaced by SparkSession. Instead of sparkR.init(), call sparkR.session() in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.sparkExecutorEnv is not supported by sparkR.session. To set environment for the executors, set Spark config properties with the prefix “spark.executorEnv.VAR_NAME”, for example, “spark.executorEnv.PATH”sqlContext parameter is no longer required for these functions: createDataFrame, as.DataFrame, read.json, jsonFile, read.parquet, parquetFile, read.text, sql, tables, tableNames, cacheTable, uncacheTable, clearCache, dropTempTable, read.df, loadDF, createExternalTable.registerTempTable has been deprecated to be replaced by createOrReplaceTempView.dropTempTable has been deprecated to be replaced by dropTempView.sc SparkContext parameter is no longer required for these functions: setJobGroup, clearJobGroup, cancelJobGroupappend. It was changed in Spark 1.6.0 to error to match the Scala API.NA in R to null and vice-versa.