layout: global title: “Migration Guide: SparkR (R on Spark)” displayTitle: “Migration Guide: SparkR (R on Spark)” license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Note that this migration guide describes the items specific to SparkR. Many items of SQL migration can be applied when migrating SparkR to higher versions. Please refer Migration Guide: SQL, Datasets and DataFrame.
SPARKR_ASK_INSTALLATION
environment variable to FALSE
.parquetFile
, saveAsParquetFile
, jsonFile
, jsonRDD
have been removed. Use read.parquet
, write.parquet
, read.json
instead.spark.mlp
. For example, if the training data only has two labels, a layers
param like c(1, 3)
doesn’t cause an error previously, now it does.start
parameter of substr
method was wrongly subtracted by one and considered as 0-based. This can lead to inconsistent substring results and also does not match with the behaviour with substr
in R. In version 2.3.1 and later, it has been fixed so the start
parameter of substr
method is now 1-based. As an example, substr(lit('abcdef'), 2, 4))
would result to abc
in SparkR 2.3.0, and the result would be bcd
in SparkR 2.3.1.stringsAsFactors
parameter was previously ignored with collect
, for example, in collect(createDataFrame(iris), stringsAsFactors = TRUE))
. It has been corrected.summary
, option for statistics to compute has been added. Its output is changed from that from describe
.numPartitions
parameter has been added to createDataFrame
and as.DataFrame
. When splitting the data, the partition position calculation has been made to match the one in Scala.createExternalTable
has been deprecated to be replaced by createTable
. Either methods can be called to create external or managed table. Additional catalog methods have also been added.tempdir()
. This will be created when instantiating the SparkSession with enableHiveSupport
set to TRUE
.spark.lda
was not setting the optimizer correctly. It has been corrected.coefficients
as matrix
. This includes spark.logit
, spark.kmeans
, spark.glm
. Model summary outputs for spark.gaussianMixture
have added log-likelihood as loglik
.join
no longer performs Cartesian Product by default, use crossJoin
instead.table
has been removed and replaced by tableToDF
.DataFrame
has been renamed to SparkDataFrame
to avoid name conflicts.SQLContext
and HiveContext
have been deprecated to be replaced by SparkSession
. Instead of sparkR.init()
, call sparkR.session()
in its place to instantiate the SparkSession. Once that is done, that currently active SparkSession will be used for SparkDataFrame operations.sparkExecutorEnv
is not supported by sparkR.session
. To set environment for the executors, set Spark config properties with the prefix “spark.executorEnv.VAR_NAME”, for example, “spark.executorEnv.PATH”sqlContext
parameter is no longer required for these functions: createDataFrame
, as.DataFrame
, read.json
, jsonFile
, read.parquet
, parquetFile
, read.text
, sql
, tables
, tableNames
, cacheTable
, uncacheTable
, clearCache
, dropTempTable
, read.df
, loadDF
, createExternalTable
.registerTempTable
has been deprecated to be replaced by createOrReplaceTempView
.dropTempTable
has been deprecated to be replaced by dropTempView
.sc
SparkContext parameter is no longer required for these functions: setJobGroup
, clearJobGroup
, cancelJobGroup
append
. It was changed in Spark 1.6.0 to error
to match the Scala API.NA
in R to null
and vice-versa.