blob: 50ecf0fdbdd799276ad00cdd51931c81d45c126d [file] [log] [blame]
<!DOCTYPE html><html><head><title>R: Generalized Linear Models</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.15.3/dist/katex.min.css">
<script type="text/javascript">
const macros = { "\\R": "\\textsf{R}", "\\code": "\\texttt"};
function processMathHTML() {
var l = document.getElementsByClassName('reqn');
for (let e of l) { katex.render(e.textContent, e, { throwOnError: false, macros }); }
return;
}</script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.15.3/dist/katex.min.js"
onload="processMathHTML();"></script>
<link rel="stylesheet" type="text/css" href="R.css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head><body><div class="container">
<table style="width: 100%;"><tr><td>spark.glm {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>
<h2>Generalized Linear Models</h2>
<h3>Description</h3>
<p>Fits generalized linear model against a SparkDataFrame.
Users can call <code>summary</code> to print a summary of the fitted model, <code>predict</code> to make
predictions on new data, and <code>write.ml</code>/<code>read.ml</code> to save/load fitted models.
</p>
<h3>Usage</h3>
<pre><code class='language-R'>spark.glm(data, formula, ...)
## S4 method for signature 'SparkDataFrame,formula'
spark.glm(
data,
formula,
family = gaussian,
tol = 1e-06,
maxIter = 25,
weightCol = NULL,
regParam = 0,
var.power = 0,
link.power = 1 - var.power,
stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", "alphabetDesc",
"alphabetAsc"),
offsetCol = NULL
)
## S4 method for signature 'GeneralizedLinearRegressionModel'
summary(object)
## S3 method for class 'summary.GeneralizedLinearRegressionModel'
print(x, ...)
## S4 method for signature 'GeneralizedLinearRegressionModel'
predict(object, newData)
## S4 method for signature 'GeneralizedLinearRegressionModel,character'
write.ml(object, path, overwrite = FALSE)
</code></pre>
<h3>Arguments</h3>
<table>
<tr style="vertical-align: top;"><td><code>data</code></td>
<td>
<p>a SparkDataFrame for training.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>formula</code></td>
<td>
<p>a symbolic description of the model to be fitted. Currently only a few formula
operators are supported, including '~', '.', ':', '+', '-', '*', and '^'.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>...</code></td>
<td>
<p>additional arguments passed to the method.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>family</code></td>
<td>
<p>a description of the error distribution and link function to be used in the model.
This can be a character string naming a family function, a family function or
the result of a call to a family function. Refer R family at
<a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html">https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html</a>.
Currently these families are supported: <code>binomial</code>, <code>gaussian</code>,
<code>Gamma</code>, <code>poisson</code> and <code>tweedie</code>.
</p>
<p>Note that there are two ways to specify the tweedie family.
</p>
<ul>
<li><p> Set <code>family = "tweedie"</code> and specify the var.power and link.power;
</p>
</li>
<li><p> When package <code>statmod</code> is loaded, the tweedie family is specified
using the family definition therein, i.e., <code>tweedie(var.power, link.power)</code>.
</p>
</li></ul>
</td></tr>
<tr style="vertical-align: top;"><td><code>tol</code></td>
<td>
<p>positive convergence tolerance of iterations.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>maxIter</code></td>
<td>
<p>integer giving the maximal number of IRLS iterations.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>weightCol</code></td>
<td>
<p>the weight column name. If this is not set or <code>NULL</code>, we treat all instance
weights as 1.0.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>regParam</code></td>
<td>
<p>regularization parameter for L2 regularization.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>var.power</code></td>
<td>
<p>the power in the variance function of the Tweedie distribution which provides
the relationship between the variance and mean of the distribution. Only
applicable to the Tweedie family.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>link.power</code></td>
<td>
<p>the index in the power link function. Only applicable to the Tweedie family.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>stringIndexerOrderType</code></td>
<td>
<p>how to order categories of a string feature column. This is used to
decide the base level of a string feature as the last category
after ordering is dropped when encoding strings. Supported options
are &quot;frequencyDesc&quot;, &quot;frequencyAsc&quot;, &quot;alphabetDesc&quot;, and
&quot;alphabetAsc&quot;. The default value is &quot;frequencyDesc&quot;. When the
ordering is set to &quot;alphabetDesc&quot;, this drops the same category
as R when encoding strings.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>offsetCol</code></td>
<td>
<p>the offset column name. If this is not set or empty, we treat all instance
offsets as 0.0. The feature specified as offset has a constant coefficient of
1.0.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>object</code></td>
<td>
<p>a fitted generalized linear model.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>x</code></td>
<td>
<p>summary object of fitted generalized linear model returned by <code>summary</code> function.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>newData</code></td>
<td>
<p>a SparkDataFrame for testing.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>path</code></td>
<td>
<p>the directory where the model is saved.</p>
</td></tr>
<tr style="vertical-align: top;"><td><code>overwrite</code></td>
<td>
<p>overwrites or not if the output path already exists. Default is FALSE
which means throw exception if the output path exists.</p>
</td></tr>
</table>
<h3>Value</h3>
<p><code>spark.glm</code> returns a fitted generalized linear model.
</p>
<p><code>summary</code> returns summary information of the fitted model, which is a list.
The list of components includes at least the <code>coefficients</code> (coefficients matrix,
which includes coefficients, standard error of coefficients, t value and p value),
<code>null.deviance</code> (null/residual degrees of freedom), <code>aic</code> (AIC)
and <code>iter</code> (number of iterations IRLS takes). If there are collinear columns in
the data, the coefficients matrix only provides coefficients.
</p>
<p><code>predict</code> returns a SparkDataFrame containing predicted labels in a column named
&quot;prediction&quot;.
</p>
<h3>Note</h3>
<p>spark.glm since 2.0.0
</p>
<p>summary(GeneralizedLinearRegressionModel) since 2.0.0
</p>
<p>print.summary.GeneralizedLinearRegressionModel since 2.0.0
</p>
<p>predict(GeneralizedLinearRegressionModel) since 1.5.0
</p>
<p>write.ml(GeneralizedLinearRegressionModel, character) since 2.0.0
</p>
<h3>See Also</h3>
<p><a href="../../SparkR/help/glm.html">glm</a>, <a href="../../SparkR/help/read.ml.html">read.ml</a>
</p>
<h3>Examples</h3>
<pre><code class="r">## Not run:
##D sparkR.session()
##D t &lt;- as.data.frame(Titanic, stringsAsFactors = FALSE)
##D df &lt;- createDataFrame(t)
##D model &lt;- spark.glm(df, Freq ~ Sex + Age, family = &quot;gaussian&quot;)
##D summary(model)
##D
##D # fitted values on training data
##D fitted &lt;- predict(model, df)
##D head(select(fitted, &quot;Freq&quot;, &quot;prediction&quot;))
##D
##D # save fitted model to input path
##D path &lt;- &quot;path/to/model&quot;
##D write.ml(model, path)
##D
##D # can also read back the saved model and print
##D savedModel &lt;- read.ml(path)
##D summary(savedModel)
##D
##D # note that the default string encoding is different from R&#39;s glm
##D model2 &lt;- glm(Freq ~ Sex + Age, family = &quot;gaussian&quot;, data = t)
##D summary(model2)
##D # use stringIndexerOrderType = &quot;alphabetDesc&quot; to force string encoding
##D # to be consistent with R
##D model3 &lt;- spark.glm(df, Freq ~ Sex + Age, family = &quot;gaussian&quot;,
##D stringIndexerOrderType = &quot;alphabetDesc&quot;)
##D summary(model3)
##D
##D # fit tweedie model
##D model &lt;- spark.glm(df, Freq ~ Sex + Age, family = &quot;tweedie&quot;,
##D var.power = 1.2, link.power = 0)
##D summary(model)
##D
##D # use the tweedie family from statmod
##D library(statmod)
##D model &lt;- spark.glm(df, Freq ~ Sex + Age, family = tweedie(1.2, 0))
##D summary(model)
## End(Not run)
</code></pre>
<hr /><div style="text-align: center;">[Package <em>SparkR</em> version 3.2.2 <a href="00Index.html">Index</a>]</div>
</div>
</body></html>