site/docs/3.2.2/api/R/spark.glm.html - spark-website - Git at Google

 <!DOCTYPE html><html><head><title>R: Generalized Linear Models</title>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
 <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.15.3/dist/katex.min.css">
 <script type="text/javascript">
 const macros = { "\\R": "\\textsf{R}", "\\code": "\\texttt"};
 function processMathHTML() {
     var l = document.getElementsByClassName('reqn');
     for (let e of l) { katex.render(e.textContent, e, { throwOnError: false, macros }); }
     return;
 }</script>
 <script defer src="https://cdn.jsdelivr.net/npm/katex@0.15.3/dist/katex.min.js"
     onload="processMathHTML();"></script>
 <link rel="stylesheet" type="text/css" href="R.css" />

 <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
 <script>hljs.initHighlightingOnLoad();</script>
 </head><body><div class="container">

 <table style="width: 100%;"><tr><td>spark.glm {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>

 <h2>Generalized Linear Models</h2>

 <h3>Description</h3>

 <p>Fits generalized linear model against a SparkDataFrame.
 Users can call <code>summary</code> to print a summary of the fitted model, <code>predict</code> to make
 predictions on new data, and <code>write.ml</code>/<code>read.ml</code> to save/load fitted models.
 </p>


 <h3>Usage</h3>

 <pre><code class='language-R'>spark.glm(data, formula, ...)

 ## S4 method for signature 'SparkDataFrame,formula'
 spark.glm(
   data,
   formula,
   family = gaussian,
   tol = 1e-06,
   maxIter = 25,
   weightCol = NULL,
   regParam = 0,
   var.power = 0,
   link.power = 1 - var.power,
   stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", "alphabetDesc",
     "alphabetAsc"),
   offsetCol = NULL
 )

 ## S4 method for signature 'GeneralizedLinearRegressionModel'
 summary(object)

 ## S3 method for class 'summary.GeneralizedLinearRegressionModel'
 print(x, ...)

 ## S4 method for signature 'GeneralizedLinearRegressionModel'
 predict(object, newData)

 ## S4 method for signature 'GeneralizedLinearRegressionModel,character'
 write.ml(object, path, overwrite = FALSE)
 </code></pre>


 <h3>Arguments</h3>

 <table>
 <tr style="vertical-align: top;"><td><code>data</code></td>
 <td>
 <p>a SparkDataFrame for training.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>formula</code></td>
 <td>
 <p>a symbolic description of the model to be fitted. Currently only a few formula
 operators are supported, including '~', '.', ':', '+', '-', '*', and '^'.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>...</code></td>
 <td>
 <p>additional arguments passed to the method.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>family</code></td>
 <td>
 <p>a description of the error distribution and link function to be used in the model.
 This can be a character string naming a family function, a family function or
 the result of a call to a family function. Refer R family at
 <a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html">https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html</a>.
 Currently these families are supported: <code>binomial</code>, <code>gaussian</code>,
 <code>Gamma</code>, <code>poisson</code> and <code>tweedie</code>.
 </p>
 <p>Note that there are two ways to specify the tweedie family.
 </p>

 <ul>
 <li><p> Set <code>family = "tweedie"</code> and specify the var.power and link.power;
 </p>
 </li>
 <li><p> When package <code>statmod</code> is loaded, the tweedie family is specified
 using the family definition therein, i.e., <code>tweedie(var.power, link.power)</code>.
 </p>
 </li></ul>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>tol</code></td>
 <td>
 <p>positive convergence tolerance of iterations.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>maxIter</code></td>
 <td>
 <p>integer giving the maximal number of IRLS iterations.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>weightCol</code></td>
 <td>
 <p>the weight column name. If this is not set or <code>NULL</code>, we treat all instance
 weights as 1.0.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>regParam</code></td>
 <td>
 <p>regularization parameter for L2 regularization.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>var.power</code></td>
 <td>
 <p>the power in the variance function of the Tweedie distribution which provides
 the relationship between the variance and mean of the distribution. Only
 applicable to the Tweedie family.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>link.power</code></td>
 <td>
 <p>the index in the power link function. Only applicable to the Tweedie family.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>stringIndexerOrderType</code></td>
 <td>
 <p>how to order categories of a string feature column. This is used to
 decide the base level of a string feature as the last category
 after ordering is dropped when encoding strings. Supported options
 are &quot;frequencyDesc&quot;, &quot;frequencyAsc&quot;, &quot;alphabetDesc&quot;, and
 &quot;alphabetAsc&quot;. The default value is &quot;frequencyDesc&quot;. When the
 ordering is set to &quot;alphabetDesc&quot;, this drops the same category
 as R when encoding strings.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>offsetCol</code></td>
 <td>
 <p>the offset column name. If this is not set or empty, we treat all instance
 offsets as 0.0. The feature specified as offset has a constant coefficient of
 1.0.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>object</code></td>
 <td>
 <p>a fitted generalized linear model.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>x</code></td>
 <td>
 <p>summary object of fitted generalized linear model returned by <code>summary</code> function.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>newData</code></td>
 <td>
 <p>a SparkDataFrame for testing.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>path</code></td>
 <td>
 <p>the directory where the model is saved.</p>
 </td></tr>
 <tr style="vertical-align: top;"><td><code>overwrite</code></td>
 <td>
 <p>overwrites or not if the output path already exists. Default is FALSE
 which means throw exception if the output path exists.</p>
 </td></tr>
 </table>


 <h3>Value</h3>

 <p><code>spark.glm</code> returns a fitted generalized linear model.
 </p>
 <p><code>summary</code> returns summary information of the fitted model, which is a list.
 The list of components includes at least the <code>coefficients</code> (coefficients matrix,
 which includes coefficients, standard error of coefficients, t value and p value),
 <code>null.deviance</code> (null/residual degrees of freedom), <code>aic</code> (AIC)
 and <code>iter</code> (number of iterations IRLS takes). If there are collinear columns in
 the data, the coefficients matrix only provides coefficients.
 </p>
 <p><code>predict</code> returns a SparkDataFrame containing predicted labels in a column named
 &quot;prediction&quot;.
 </p>


 <h3>Note</h3>

 <p>spark.glm since 2.0.0
 </p>
 <p>summary(GeneralizedLinearRegressionModel) since 2.0.0
 </p>
 <p>print.summary.GeneralizedLinearRegressionModel since 2.0.0
 </p>
 <p>predict(GeneralizedLinearRegressionModel) since 1.5.0
 </p>
 <p>write.ml(GeneralizedLinearRegressionModel, character) since 2.0.0
 </p>


 <h3>See Also</h3>

 <p><a href="../../SparkR/help/glm.html">glm</a>, <a href="../../SparkR/help/read.ml.html">read.ml</a>
 </p>


 <h3>Examples</h3>

 <pre><code class="r">## Not run:
 ##D sparkR.session()
 ##D t &lt;- as.data.frame(Titanic, stringsAsFactors = FALSE)
 ##D df &lt;- createDataFrame(t)
 ##D model &lt;- spark.glm(df, Freq ~ Sex + Age, family = &quot;gaussian&quot;)
 ##D summary(model)
 ##D
 ##D # fitted values on training data
 ##D fitted &lt;- predict(model, df)
 ##D head(select(fitted, &quot;Freq&quot;, &quot;prediction&quot;))
 ##D
 ##D # save fitted model to input path
 ##D path &lt;- &quot;path/to/model&quot;
 ##D write.ml(model, path)
 ##D
 ##D # can also read back the saved model and print
 ##D savedModel &lt;- read.ml(path)
 ##D summary(savedModel)
 ##D
 ##D # note that the default string encoding is different from R&#39;s glm
 ##D model2 &lt;- glm(Freq ~ Sex + Age, family = &quot;gaussian&quot;, data = t)
 ##D summary(model2)
 ##D # use stringIndexerOrderType = &quot;alphabetDesc&quot; to force string encoding
 ##D # to be consistent with R
 ##D model3 &lt;- spark.glm(df, Freq ~ Sex + Age, family = &quot;gaussian&quot;,
 ##D                    stringIndexerOrderType = &quot;alphabetDesc&quot;)
 ##D summary(model3)
 ##D
 ##D # fit tweedie model
 ##D model &lt;- spark.glm(df, Freq ~ Sex + Age, family = &quot;tweedie&quot;,
 ##D                    var.power = 1.2, link.power = 0)
 ##D summary(model)
 ##D
 ##D # use the tweedie family from statmod
 ##D library(statmod)
 ##D model &lt;- spark.glm(df, Freq ~ Sex + Age, family = tweedie(1.2, 0))
 ##D summary(model)
 ## End(Not run)
 </code></pre>


 <hr /><div style="text-align: center;">[Package <em>SparkR</em> version 3.2.2 <a href="00Index.html">Index</a>]</div>
 </div>
 </body></html>
	<!DOCTYPE html><html><head><title>R: Generalized Linear Models</title>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.15.3/dist/katex.min.css">
	<script type="text/javascript">
	const macros = { "\\R": "\\textsf{R}", "\\code": "\\texttt"};
	function processMathHTML() {
	var l = document.getElementsByClassName('reqn');
	for (let e of l) { katex.render(e.textContent, e, { throwOnError: false, macros }); }
	return;
	}</script>
	<script defer src="https://cdn.jsdelivr.net/npm/katex@0.15.3/dist/katex.min.js"
	onload="processMathHTML();"></script>
	<link rel="stylesheet" type="text/css" href="R.css" />

	<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/styles/github.min.css">
	<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/highlight.min.js"></script>
	<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.3/languages/r.min.js"></script>
	<script>hljs.initHighlightingOnLoad();</script>
	</head><body><div class="container">

	<table style="width: 100%;"><tr><td>spark.glm {SparkR}</td><td style="text-align: right;">R Documentation</td></tr></table>

	<h2>Generalized Linear Models</h2>

	<h3>Description</h3>

	<p>Fits generalized linear model against a SparkDataFrame.
	Users can call <code>summary</code> to print a summary of the fitted model, <code>predict</code> to make
	predictions on new data, and <code>write.ml</code>/<code>read.ml</code> to save/load fitted models.
	</p>


	<h3>Usage</h3>

	<pre><code class='language-R'>spark.glm(data, formula, ...)

	## S4 method for signature 'SparkDataFrame,formula'
	spark.glm(
	data,
	formula,
	family = gaussian,
	tol = 1e-06,
	maxIter = 25,
	weightCol = NULL,
	regParam = 0,
	var.power = 0,
	link.power = 1 - var.power,
	stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", "alphabetDesc",
	"alphabetAsc"),
	offsetCol = NULL
	)

	## S4 method for signature 'GeneralizedLinearRegressionModel'
	summary(object)

	## S3 method for class 'summary.GeneralizedLinearRegressionModel'
	print(x, ...)

	## S4 method for signature 'GeneralizedLinearRegressionModel'
	predict(object, newData)

	## S4 method for signature 'GeneralizedLinearRegressionModel,character'
	write.ml(object, path, overwrite = FALSE)
	</code></pre>


	<h3>Arguments</h3>

	<table>
	<tr style="vertical-align: top;"><td><code>data</code></td>
	<td>
	<p>a SparkDataFrame for training.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>formula</code></td>
	<td>
	<p>a symbolic description of the model to be fitted. Currently only a few formula
	operators are supported, including '~', '.', ':', '+', '-', '*', and '^'.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>...</code></td>
	<td>
	<p>additional arguments passed to the method.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>family</code></td>
	<td>
	<p>a description of the error distribution and link function to be used in the model.
	This can be a character string naming a family function, a family function or
	the result of a call to a family function. Refer R family at
	<a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html">https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html</a>.
	Currently these families are supported: <code>binomial</code>, <code>gaussian</code>,
	<code>Gamma</code>, <code>poisson</code> and <code>tweedie</code>.
	</p>
	<p>Note that there are two ways to specify the tweedie family.
	</p>

	<ul>
	<li><p> Set <code>family = "tweedie"</code> and specify the var.power and link.power;
	</p>
	</li>
	<li><p> When package <code>statmod</code> is loaded, the tweedie family is specified
	using the family definition therein, i.e., <code>tweedie(var.power, link.power)</code>.
	</p>
	</li></ul>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>tol</code></td>
	<td>
	<p>positive convergence tolerance of iterations.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>maxIter</code></td>
	<td>
	<p>integer giving the maximal number of IRLS iterations.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>weightCol</code></td>
	<td>
	<p>the weight column name. If this is not set or <code>NULL</code>, we treat all instance
	weights as 1.0.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>regParam</code></td>
	<td>
	<p>regularization parameter for L2 regularization.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>var.power</code></td>
	<td>
	<p>the power in the variance function of the Tweedie distribution which provides
	the relationship between the variance and mean of the distribution. Only
	applicable to the Tweedie family.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>link.power</code></td>
	<td>
	<p>the index in the power link function. Only applicable to the Tweedie family.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>stringIndexerOrderType</code></td>
	<td>
	<p>how to order categories of a string feature column. This is used to
	decide the base level of a string feature as the last category
	after ordering is dropped when encoding strings. Supported options
	are "frequencyDesc", "frequencyAsc", "alphabetDesc", and
	"alphabetAsc". The default value is "frequencyDesc". When the
	ordering is set to "alphabetDesc", this drops the same category
	as R when encoding strings.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>offsetCol</code></td>
	<td>
	<p>the offset column name. If this is not set or empty, we treat all instance
	offsets as 0.0. The feature specified as offset has a constant coefficient of
	1.0.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>object</code></td>
	<td>
	<p>a fitted generalized linear model.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>x</code></td>
	<td>
	<p>summary object of fitted generalized linear model returned by <code>summary</code> function.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>newData</code></td>
	<td>
	<p>a SparkDataFrame for testing.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>path</code></td>
	<td>
	<p>the directory where the model is saved.</p>
	</td></tr>
	<tr style="vertical-align: top;"><td><code>overwrite</code></td>
	<td>
	<p>overwrites or not if the output path already exists. Default is FALSE
	which means throw exception if the output path exists.</p>
	</td></tr>
	</table>


	<h3>Value</h3>

	<p><code>spark.glm</code> returns a fitted generalized linear model.
	</p>
	<p><code>summary</code> returns summary information of the fitted model, which is a list.
	The list of components includes at least the <code>coefficients</code> (coefficients matrix,
	which includes coefficients, standard error of coefficients, t value and p value),
	<code>null.deviance</code> (null/residual degrees of freedom), <code>aic</code> (AIC)
	and <code>iter</code> (number of iterations IRLS takes). If there are collinear columns in
	the data, the coefficients matrix only provides coefficients.
	</p>
	<p><code>predict</code> returns a SparkDataFrame containing predicted labels in a column named
	"prediction".
	</p>


	<h3>Note</h3>

	<p>spark.glm since 2.0.0
	</p>
	<p>summary(GeneralizedLinearRegressionModel) since 2.0.0
	</p>
	<p>print.summary.GeneralizedLinearRegressionModel since 2.0.0
	</p>
	<p>predict(GeneralizedLinearRegressionModel) since 1.5.0
	</p>
	<p>write.ml(GeneralizedLinearRegressionModel, character) since 2.0.0
	</p>


	<h3>See Also</h3>

	<p><a href="../../SparkR/help/glm.html">glm</a>, <a href="../../SparkR/help/read.ml.html">read.ml</a>
	</p>


	<h3>Examples</h3>

	<pre><code class="r">## Not run:
	##D sparkR.session()
	##D t <- as.data.frame(Titanic, stringsAsFactors = FALSE)
	##D df <- createDataFrame(t)
	##D model <- spark.glm(df, Freq ~ Sex + Age, family = "gaussian")
	##D summary(model)
	##D
	##D # fitted values on training data
	##D fitted <- predict(model, df)
	##D head(select(fitted, "Freq", "prediction"))
	##D
	##D # save fitted model to input path
	##D path <- "path/to/model"
	##D write.ml(model, path)
	##D
	##D # can also read back the saved model and print
	##D savedModel <- read.ml(path)
	##D summary(savedModel)
	##D
	##D # note that the default string encoding is different from R's glm
	##D model2 <- glm(Freq ~ Sex + Age, family = "gaussian", data = t)
	##D summary(model2)
	##D # use stringIndexerOrderType = "alphabetDesc" to force string encoding
	##D # to be consistent with R
	##D model3 <- spark.glm(df, Freq ~ Sex + Age, family = "gaussian",
	##D stringIndexerOrderType = "alphabetDesc")
	##D summary(model3)
	##D
	##D # fit tweedie model
	##D model <- spark.glm(df, Freq ~ Sex + Age, family = "tweedie",
	##D var.power = 1.2, link.power = 0)
	##D summary(model)
	##D
	##D # use the tweedie family from statmod
	##D library(statmod)
	##D model <- spark.glm(df, Freq ~ Sex + Age, family = tweedie(1.2, 0))
	##D summary(model)
	## End(Not run)
	</code></pre>


	<hr /><div style="text-align: center;">[Package <em>SparkR</em> version 3.2.2 <a href="00Index.html">Index</a>]</div>
	</div>
	</body></html>