site/docs/2.1.1/api/R/spark.randomForest.html - spark-website - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <html><head><title>R: Random Forest Model for Regression and Classification</title>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 <link rel="stylesheet" type="text/css" href="R.css">

 <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
 <script>hljs.initHighlightingOnLoad();</script>
 </head><body>

 <table width="100%" summary="page for spark.randomForest {SparkR}"><tr><td>spark.randomForest {SparkR}</td><td align="right">R Documentation</td></tr></table>

 <h2>Random Forest Model for Regression and Classification</h2>

 <h3>Description</h3>

 <p><code>spark.randomForest</code> fits a Random Forest Regression model or Classification model on
 a SparkDataFrame. Users can call <code>summary</code> to get a summary of the fitted Random Forest
 model, <code>predict</code> to make predictions on new data, and <code>write.ml</code>/<code>read.ml</code> to
 save/load fitted models.
 For more details, see
 <a href="http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression">
 Random Forest Regression</a> and
 <a href="http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier">
 Random Forest Classification</a>
 </p>


 <h3>Usage</h3>

 <pre>
 spark.randomForest(data, formula, ...)

 ## S4 method for signature 'SparkDataFrame,formula'
 spark.randomForest(data, formula,
   type = c("regression", "classification"), maxDepth = 5, maxBins = 32,
   numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto",
   seed = NULL, subsamplingRate = 1, minInstancesPerNode = 1,
   minInfoGain = 0, checkpointInterval = 10, maxMemoryInMB = 256,
   cacheNodeIds = FALSE)

 ## S4 method for signature 'RandomForestRegressionModel'
 predict(object, newData)

 ## S4 method for signature 'RandomForestClassificationModel'
 predict(object, newData)

 ## S4 method for signature 'RandomForestRegressionModel,character'
 write.ml(object, path,
   overwrite = FALSE)

 ## S4 method for signature 'RandomForestClassificationModel,character'
 write.ml(object, path,
   overwrite = FALSE)

 ## S4 method for signature 'RandomForestRegressionModel'
 summary(object)

 ## S4 method for signature 'RandomForestClassificationModel'
 summary(object)

 ## S3 method for class 'summary.RandomForestRegressionModel'
 print(x, ...)

 ## S3 method for class 'summary.RandomForestClassificationModel'
 print(x, ...)
 </pre>


 <h3>Arguments</h3>

 <table summary="R argblock">
 <tr valign="top"><td><code>data</code></td>
 <td>
 <p>a SparkDataFrame for training.</p>
 </td></tr>
 <tr valign="top"><td><code>formula</code></td>
 <td>
 <p>a symbolic description of the model to be fitted. Currently only a few formula
 operators are supported, including '~', ':', '+', and '-'.</p>
 </td></tr>
 <tr valign="top"><td><code>...</code></td>
 <td>
 <p>additional arguments passed to the method.</p>
 </td></tr>
 <tr valign="top"><td><code>type</code></td>
 <td>
 <p>type of model, one of &quot;regression&quot; or &quot;classification&quot;, to fit</p>
 </td></tr>
 <tr valign="top"><td><code>maxDepth</code></td>
 <td>
 <p>Maximum depth of the tree (&gt;= 0).</p>
 </td></tr>
 <tr valign="top"><td><code>maxBins</code></td>
 <td>
 <p>Maximum number of bins used for discretizing continuous features and for choosing
 how to split on features at each node. More bins give higher granularity. Must be
 &gt;= 2 and &gt;= number of categories in any categorical feature.</p>
 </td></tr>
 <tr valign="top"><td><code>numTrees</code></td>
 <td>
 <p>Number of trees to train (&gt;= 1).</p>
 </td></tr>
 <tr valign="top"><td><code>impurity</code></td>
 <td>
 <p>Criterion used for information gain calculation.
 For regression, must be &quot;variance&quot;. For classification, must be one of
 &quot;entropy&quot; and &quot;gini&quot;, default is &quot;gini&quot;.</p>
 </td></tr>
 <tr valign="top"><td><code>featureSubsetStrategy</code></td>
 <td>
 <p>The number of features to consider for splits at each tree node.
 Supported options: &quot;auto&quot;, &quot;all&quot;, &quot;onethird&quot;, &quot;sqrt&quot;, &quot;log2&quot;, (0.0-1.0], [1-n].</p>
 </td></tr>
 <tr valign="top"><td><code>seed</code></td>
 <td>
 <p>integer seed for random number generation.</p>
 </td></tr>
 <tr valign="top"><td><code>subsamplingRate</code></td>
 <td>
 <p>Fraction of the training data used for learning each decision tree, in
 range (0, 1].</p>
 </td></tr>
 <tr valign="top"><td><code>minInstancesPerNode</code></td>
 <td>
 <p>Minimum number of instances each child must have after split.</p>
 </td></tr>
 <tr valign="top"><td><code>minInfoGain</code></td>
 <td>
 <p>Minimum information gain for a split to be considered at a tree node.</p>
 </td></tr>
 <tr valign="top"><td><code>checkpointInterval</code></td>
 <td>
 <p>Param for set checkpoint interval (&gt;= 1) or disable checkpoint (-1).</p>
 </td></tr>
 <tr valign="top"><td><code>maxMemoryInMB</code></td>
 <td>
 <p>Maximum memory in MB allocated to histogram aggregation.</p>
 </td></tr>
 <tr valign="top"><td><code>cacheNodeIds</code></td>
 <td>
 <p>If FALSE, the algorithm will pass trees to executors to match instances with
 nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching
 can speed up training of deeper trees. Users can set how often should the
 cache be checkpointed or disable it by setting checkpointInterval.</p>
 </td></tr>
 <tr valign="top"><td><code>object</code></td>
 <td>
 <p>A fitted Random Forest regression model or classification model.</p>
 </td></tr>
 <tr valign="top"><td><code>newData</code></td>
 <td>
 <p>a SparkDataFrame for testing.</p>
 </td></tr>
 <tr valign="top"><td><code>path</code></td>
 <td>
 <p>The directory where the model is saved.</p>
 </td></tr>
 <tr valign="top"><td><code>overwrite</code></td>
 <td>
 <p>Overwrites or not if the output path already exists. Default is FALSE
 which means throw exception if the output path exists.</p>
 </td></tr>
 <tr valign="top"><td><code>x</code></td>
 <td>
 <p>summary object of Random Forest regression model or classification model
 returned by <code>summary</code>.</p>
 </td></tr>
 </table>


 <h3>Value</h3>

 <p><code>spark.randomForest</code> returns a fitted Random Forest model.
 </p>
 <p><code>predict</code> returns a SparkDataFrame containing predicted labeled in a column named
 &quot;prediction&quot;.
 </p>
 <p><code>summary</code> returns summary information of the fitted model, which is a list.
 The list of components includes <code>formula</code> (formula),
 <code>numFeatures</code> (number of features), <code>features</code> (list of features),
 <code>featureImportances</code> (feature importances), <code>numTrees</code> (number of trees),
 and <code>treeWeights</code> (tree weights).
 </p>


 <h3>Note</h3>

 <p>spark.randomForest since 2.1.0
 </p>
 <p>predict(RandomForestRegressionModel) since 2.1.0
 </p>
 <p>predict(RandomForestClassificationModel) since 2.1.0
 </p>
 <p>write.ml(RandomForestRegressionModel, character) since 2.1.0
 </p>
 <p>write.ml(RandomForestClassificationModel, character) since 2.1.0
 </p>
 <p>summary(RandomForestRegressionModel) since 2.1.0
 </p>
 <p>summary(RandomForestClassificationModel) since 2.1.0
 </p>
 <p>print.summary.RandomForestRegressionModel since 2.1.0
 </p>
 <p>print.summary.RandomForestClassificationModel since 2.1.0
 </p>


 <h3>Examples</h3>

 <pre><code class="r">## Not run:
 ##D # fit a Random Forest Regression Model
 ##D df &lt;- createDataFrame(longley)
 ##D model &lt;- spark.randomForest(df, Employed ~ ., type = &quot;regression&quot;, maxDepth = 5, maxBins = 16)
 ##D
 ##D # get the summary of the model
 ##D summary(model)
 ##D
 ##D # make predictions
 ##D predictions &lt;- predict(model, df)
 ##D
 ##D # save and load the model
 ##D path &lt;- &quot;path/to/model&quot;
 ##D write.ml(model, path)
 ##D savedModel &lt;- read.ml(path)
 ##D summary(savedModel)
 ##D
 ##D # fit a Random Forest Classification Model
 ##D df &lt;- createDataFrame(iris)
 ##D model &lt;- spark.randomForest(df, Species ~ Petal_Length + Petal_Width, &quot;classification&quot;)
 ## End(Not run)
 </code></pre>


 <hr><div align="center">[Package <em>SparkR</em> version 2.1.1 <a href="00Index.html">Index</a>]</div>
 </body></html>
	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<html><head><title>R: Random Forest Model for Regression and Classification</title>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<link rel="stylesheet" type="text/css" href="R.css">

	<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
	<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
	<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
	<script>hljs.initHighlightingOnLoad();</script>
	</head><body>

	<table width="100%" summary="page for spark.randomForest {SparkR}"><tr><td>spark.randomForest {SparkR}</td><td align="right">R Documentation</td></tr></table>

	<h2>Random Forest Model for Regression and Classification</h2>

	<h3>Description</h3>

	<p><code>spark.randomForest</code> fits a Random Forest Regression model or Classification model on
	a SparkDataFrame. Users can call <code>summary</code> to get a summary of the fitted Random Forest
	model, <code>predict</code> to make predictions on new data, and <code>write.ml</code>/<code>read.ml</code> to
	save/load fitted models.
	For more details, see
	<a href="http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression">
	Random Forest Regression</a> and
	<a href="http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier">
	Random Forest Classification</a>
	</p>


	<h3>Usage</h3>

	<pre>
	spark.randomForest(data, formula, ...)

	## S4 method for signature 'SparkDataFrame,formula'
	spark.randomForest(data, formula,
	type = c("regression", "classification"), maxDepth = 5, maxBins = 32,
	numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto",
	seed = NULL, subsamplingRate = 1, minInstancesPerNode = 1,
	minInfoGain = 0, checkpointInterval = 10, maxMemoryInMB = 256,
	cacheNodeIds = FALSE)

	## S4 method for signature 'RandomForestRegressionModel'
	predict(object, newData)

	## S4 method for signature 'RandomForestClassificationModel'
	predict(object, newData)

	## S4 method for signature 'RandomForestRegressionModel,character'
	write.ml(object, path,
	overwrite = FALSE)

	## S4 method for signature 'RandomForestClassificationModel,character'
	write.ml(object, path,
	overwrite = FALSE)

	## S4 method for signature 'RandomForestRegressionModel'
	summary(object)

	## S4 method for signature 'RandomForestClassificationModel'
	summary(object)

	## S3 method for class 'summary.RandomForestRegressionModel'
	print(x, ...)

	## S3 method for class 'summary.RandomForestClassificationModel'
	print(x, ...)
	</pre>


	<h3>Arguments</h3>

	<table summary="R argblock">
	<tr valign="top"><td><code>data</code></td>
	<td>
	<p>a SparkDataFrame for training.</p>
	</td></tr>
	<tr valign="top"><td><code>formula</code></td>
	<td>
	<p>a symbolic description of the model to be fitted. Currently only a few formula
	operators are supported, including '~', ':', '+', and '-'.</p>
	</td></tr>
	<tr valign="top"><td><code>...</code></td>
	<td>
	<p>additional arguments passed to the method.</p>
	</td></tr>
	<tr valign="top"><td><code>type</code></td>
	<td>
	<p>type of model, one of "regression" or "classification", to fit</p>
	</td></tr>
	<tr valign="top"><td><code>maxDepth</code></td>
	<td>
	<p>Maximum depth of the tree (>= 0).</p>
	</td></tr>
	<tr valign="top"><td><code>maxBins</code></td>
	<td>
	<p>Maximum number of bins used for discretizing continuous features and for choosing
	how to split on features at each node. More bins give higher granularity. Must be
	>= 2 and >= number of categories in any categorical feature.</p>
	</td></tr>
	<tr valign="top"><td><code>numTrees</code></td>
	<td>
	<p>Number of trees to train (>= 1).</p>
	</td></tr>
	<tr valign="top"><td><code>impurity</code></td>
	<td>
	<p>Criterion used for information gain calculation.
	For regression, must be "variance". For classification, must be one of
	"entropy" and "gini", default is "gini".</p>
	</td></tr>
	<tr valign="top"><td><code>featureSubsetStrategy</code></td>
	<td>
	<p>The number of features to consider for splits at each tree node.
	Supported options: "auto", "all", "onethird", "sqrt", "log2", (0.0-1.0], [1-n].</p>
	</td></tr>
	<tr valign="top"><td><code>seed</code></td>
	<td>
	<p>integer seed for random number generation.</p>
	</td></tr>
	<tr valign="top"><td><code>subsamplingRate</code></td>
	<td>
	<p>Fraction of the training data used for learning each decision tree, in
	range (0, 1].</p>
	</td></tr>
	<tr valign="top"><td><code>minInstancesPerNode</code></td>
	<td>
	<p>Minimum number of instances each child must have after split.</p>
	</td></tr>
	<tr valign="top"><td><code>minInfoGain</code></td>
	<td>
	<p>Minimum information gain for a split to be considered at a tree node.</p>
	</td></tr>
	<tr valign="top"><td><code>checkpointInterval</code></td>
	<td>
	<p>Param for set checkpoint interval (>= 1) or disable checkpoint (-1).</p>
	</td></tr>
	<tr valign="top"><td><code>maxMemoryInMB</code></td>
	<td>
	<p>Maximum memory in MB allocated to histogram aggregation.</p>
	</td></tr>
	<tr valign="top"><td><code>cacheNodeIds</code></td>
	<td>
	<p>If FALSE, the algorithm will pass trees to executors to match instances with
	nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching
	can speed up training of deeper trees. Users can set how often should the
	cache be checkpointed or disable it by setting checkpointInterval.</p>
	</td></tr>
	<tr valign="top"><td><code>object</code></td>
	<td>
	<p>A fitted Random Forest regression model or classification model.</p>
	</td></tr>
	<tr valign="top"><td><code>newData</code></td>
	<td>
	<p>a SparkDataFrame for testing.</p>
	</td></tr>
	<tr valign="top"><td><code>path</code></td>
	<td>
	<p>The directory where the model is saved.</p>
	</td></tr>
	<tr valign="top"><td><code>overwrite</code></td>
	<td>
	<p>Overwrites or not if the output path already exists. Default is FALSE
	which means throw exception if the output path exists.</p>
	</td></tr>
	<tr valign="top"><td><code>x</code></td>
	<td>
	<p>summary object of Random Forest regression model or classification model
	returned by <code>summary</code>.</p>
	</td></tr>
	</table>


	<h3>Value</h3>

	<p><code>spark.randomForest</code> returns a fitted Random Forest model.
	</p>
	<p><code>predict</code> returns a SparkDataFrame containing predicted labeled in a column named
	"prediction".
	</p>
	<p><code>summary</code> returns summary information of the fitted model, which is a list.
	The list of components includes <code>formula</code> (formula),
	<code>numFeatures</code> (number of features), <code>features</code> (list of features),
	<code>featureImportances</code> (feature importances), <code>numTrees</code> (number of trees),
	and <code>treeWeights</code> (tree weights).
	</p>


	<h3>Note</h3>

	<p>spark.randomForest since 2.1.0
	</p>
	<p>predict(RandomForestRegressionModel) since 2.1.0
	</p>
	<p>predict(RandomForestClassificationModel) since 2.1.0
	</p>
	<p>write.ml(RandomForestRegressionModel, character) since 2.1.0
	</p>
	<p>write.ml(RandomForestClassificationModel, character) since 2.1.0
	</p>
	<p>summary(RandomForestRegressionModel) since 2.1.0
	</p>
	<p>summary(RandomForestClassificationModel) since 2.1.0
	</p>
	<p>print.summary.RandomForestRegressionModel since 2.1.0
	</p>
	<p>print.summary.RandomForestClassificationModel since 2.1.0
	</p>


	<h3>Examples</h3>

	<pre><code class="r">## Not run:
	##D # fit a Random Forest Regression Model
	##D df <- createDataFrame(longley)
	##D model <- spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16)
	##D
	##D # get the summary of the model
	##D summary(model)
	##D
	##D # make predictions
	##D predictions <- predict(model, df)
	##D
	##D # save and load the model
	##D path <- "path/to/model"
	##D write.ml(model, path)
	##D savedModel <- read.ml(path)
	##D summary(savedModel)
	##D
	##D # fit a Random Forest Classification Model
	##D df <- createDataFrame(iris)
	##D model <- spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification")
	## End(Not run)
	</code></pre>


	<hr><div align="center">[Package <em>SparkR</em> version 2.1.1 <a href="00Index.html">Index</a>]</div>
	</body></html>