| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| <html><head><title>R: Random Forest Model for Regression and Classification</title> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <link rel="stylesheet" type="text/css" href="R.css"> |
| |
| <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css"> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script> |
| <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script> |
| <script>hljs.initHighlightingOnLoad();</script> |
| </head><body> |
| |
| <table width="100%" summary="page for spark.randomForest {SparkR}"><tr><td>spark.randomForest {SparkR}</td><td align="right">R Documentation</td></tr></table> |
| |
| <h2>Random Forest Model for Regression and Classification</h2> |
| |
| <h3>Description</h3> |
| |
| <p><code>spark.randomForest</code> fits a Random Forest Regression model or Classification model on |
| a SparkDataFrame. Users can call <code>summary</code> to get a summary of the fitted Random Forest |
| model, <code>predict</code> to make predictions on new data, and <code>write.ml</code>/<code>read.ml</code> to |
| save/load fitted models. |
| For more details, see |
| <a href="http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression"> |
| Random Forest Regression</a> and |
| <a href="http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier"> |
| Random Forest Classification</a> |
| </p> |
| |
| |
| <h3>Usage</h3> |
| |
| <pre> |
| spark.randomForest(data, formula, ...) |
| |
| ## S4 method for signature 'SparkDataFrame,formula' |
| spark.randomForest(data, formula, |
| type = c("regression", "classification"), maxDepth = 5, maxBins = 32, |
| numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto", |
| seed = NULL, subsamplingRate = 1, minInstancesPerNode = 1, |
| minInfoGain = 0, checkpointInterval = 10, maxMemoryInMB = 256, |
| cacheNodeIds = FALSE) |
| |
| ## S4 method for signature 'RandomForestRegressionModel' |
| predict(object, newData) |
| |
| ## S4 method for signature 'RandomForestClassificationModel' |
| predict(object, newData) |
| |
| ## S4 method for signature 'RandomForestRegressionModel,character' |
| write.ml(object, path, |
| overwrite = FALSE) |
| |
| ## S4 method for signature 'RandomForestClassificationModel,character' |
| write.ml(object, path, |
| overwrite = FALSE) |
| |
| ## S4 method for signature 'RandomForestRegressionModel' |
| summary(object) |
| |
| ## S4 method for signature 'RandomForestClassificationModel' |
| summary(object) |
| |
| ## S3 method for class 'summary.RandomForestRegressionModel' |
| print(x, ...) |
| |
| ## S3 method for class 'summary.RandomForestClassificationModel' |
| print(x, ...) |
| </pre> |
| |
| |
| <h3>Arguments</h3> |
| |
| <table summary="R argblock"> |
| <tr valign="top"><td><code>data</code></td> |
| <td> |
| <p>a SparkDataFrame for training.</p> |
| </td></tr> |
| <tr valign="top"><td><code>formula</code></td> |
| <td> |
| <p>a symbolic description of the model to be fitted. Currently only a few formula |
| operators are supported, including '~', ':', '+', and '-'.</p> |
| </td></tr> |
| <tr valign="top"><td><code>...</code></td> |
| <td> |
| <p>additional arguments passed to the method.</p> |
| </td></tr> |
| <tr valign="top"><td><code>type</code></td> |
| <td> |
| <p>type of model, one of "regression" or "classification", to fit</p> |
| </td></tr> |
| <tr valign="top"><td><code>maxDepth</code></td> |
| <td> |
| <p>Maximum depth of the tree (>= 0).</p> |
| </td></tr> |
| <tr valign="top"><td><code>maxBins</code></td> |
| <td> |
| <p>Maximum number of bins used for discretizing continuous features and for choosing |
| how to split on features at each node. More bins give higher granularity. Must be |
| >= 2 and >= number of categories in any categorical feature.</p> |
| </td></tr> |
| <tr valign="top"><td><code>numTrees</code></td> |
| <td> |
| <p>Number of trees to train (>= 1).</p> |
| </td></tr> |
| <tr valign="top"><td><code>impurity</code></td> |
| <td> |
| <p>Criterion used for information gain calculation. |
| For regression, must be "variance". For classification, must be one of |
| "entropy" and "gini", default is "gini".</p> |
| </td></tr> |
| <tr valign="top"><td><code>featureSubsetStrategy</code></td> |
| <td> |
| <p>The number of features to consider for splits at each tree node. |
| Supported options: "auto", "all", "onethird", "sqrt", "log2", (0.0-1.0], [1-n].</p> |
| </td></tr> |
| <tr valign="top"><td><code>seed</code></td> |
| <td> |
| <p>integer seed for random number generation.</p> |
| </td></tr> |
| <tr valign="top"><td><code>subsamplingRate</code></td> |
| <td> |
| <p>Fraction of the training data used for learning each decision tree, in |
| range (0, 1].</p> |
| </td></tr> |
| <tr valign="top"><td><code>minInstancesPerNode</code></td> |
| <td> |
| <p>Minimum number of instances each child must have after split.</p> |
| </td></tr> |
| <tr valign="top"><td><code>minInfoGain</code></td> |
| <td> |
| <p>Minimum information gain for a split to be considered at a tree node.</p> |
| </td></tr> |
| <tr valign="top"><td><code>checkpointInterval</code></td> |
| <td> |
| <p>Param for set checkpoint interval (>= 1) or disable checkpoint (-1).</p> |
| </td></tr> |
| <tr valign="top"><td><code>maxMemoryInMB</code></td> |
| <td> |
| <p>Maximum memory in MB allocated to histogram aggregation.</p> |
| </td></tr> |
| <tr valign="top"><td><code>cacheNodeIds</code></td> |
| <td> |
| <p>If FALSE, the algorithm will pass trees to executors to match instances with |
| nodes. If TRUE, the algorithm will cache node IDs for each instance. Caching |
| can speed up training of deeper trees. Users can set how often should the |
| cache be checkpointed or disable it by setting checkpointInterval.</p> |
| </td></tr> |
| <tr valign="top"><td><code>object</code></td> |
| <td> |
| <p>A fitted Random Forest regression model or classification model.</p> |
| </td></tr> |
| <tr valign="top"><td><code>newData</code></td> |
| <td> |
| <p>a SparkDataFrame for testing.</p> |
| </td></tr> |
| <tr valign="top"><td><code>path</code></td> |
| <td> |
| <p>The directory where the model is saved.</p> |
| </td></tr> |
| <tr valign="top"><td><code>overwrite</code></td> |
| <td> |
| <p>Overwrites or not if the output path already exists. Default is FALSE |
| which means throw exception if the output path exists.</p> |
| </td></tr> |
| <tr valign="top"><td><code>x</code></td> |
| <td> |
| <p>summary object of Random Forest regression model or classification model |
| returned by <code>summary</code>.</p> |
| </td></tr> |
| </table> |
| |
| |
| <h3>Value</h3> |
| |
| <p><code>spark.randomForest</code> returns a fitted Random Forest model. |
| </p> |
| <p><code>predict</code> returns a SparkDataFrame containing predicted labeled in a column named |
| "prediction". |
| </p> |
| <p><code>summary</code> returns summary information of the fitted model, which is a list. |
| The list of components includes <code>formula</code> (formula), |
| <code>numFeatures</code> (number of features), <code>features</code> (list of features), |
| <code>featureImportances</code> (feature importances), <code>numTrees</code> (number of trees), |
| and <code>treeWeights</code> (tree weights). |
| </p> |
| |
| |
| <h3>Note</h3> |
| |
| <p>spark.randomForest since 2.1.0 |
| </p> |
| <p>predict(RandomForestRegressionModel) since 2.1.0 |
| </p> |
| <p>predict(RandomForestClassificationModel) since 2.1.0 |
| </p> |
| <p>write.ml(RandomForestRegressionModel, character) since 2.1.0 |
| </p> |
| <p>write.ml(RandomForestClassificationModel, character) since 2.1.0 |
| </p> |
| <p>summary(RandomForestRegressionModel) since 2.1.0 |
| </p> |
| <p>summary(RandomForestClassificationModel) since 2.1.0 |
| </p> |
| <p>print.summary.RandomForestRegressionModel since 2.1.0 |
| </p> |
| <p>print.summary.RandomForestClassificationModel since 2.1.0 |
| </p> |
| |
| |
| <h3>Examples</h3> |
| |
| <pre><code class="r">## Not run: |
| ##D # fit a Random Forest Regression Model |
| ##D df <- createDataFrame(longley) |
| ##D model <- spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) |
| ##D |
| ##D # get the summary of the model |
| ##D summary(model) |
| ##D |
| ##D # make predictions |
| ##D predictions <- predict(model, df) |
| ##D |
| ##D # save and load the model |
| ##D path <- "path/to/model" |
| ##D write.ml(model, path) |
| ##D savedModel <- read.ml(path) |
| ##D summary(savedModel) |
| ##D |
| ##D # fit a Random Forest Classification Model |
| ##D df <- createDataFrame(iris) |
| ##D model <- spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification") |
| ## End(Not run) |
| </code></pre> |
| |
| |
| <hr><div align="center">[Package <em>SparkR</em> version 2.1.1 <a href="00Index.html">Index</a>]</div> |
| </body></html> |