The standard scaler scales the given data set, so that all features will have a user specified mean and variance. In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1. Given a set of input data $x_1, x_2,... x_n$, with mean:
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
and standard deviation:
$$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$$
The scaled data set $z_1, z_2,...,z_n$ will be:
$$z_{i}= std \left (\frac{x_{i} - \bar{x} }{\sigma_{x}}\right ) + mean$$
where $\textit{std}$ and $\textit{mean}$ are the user specified values for the standard deviation and mean.
StandardScaler
is a Transformer
. As such, it supports the fit
and transform
operation.
StandardScaler is trained on all subtypes of Vector
or LabeledVector
:
fit[T <: Vector]: DataSet[T] => Unit
fit: DataSet[LabeledVector] => Unit
StandardScaler transforms all subtypes of Vector
or LabeledVector
into the respective type:
transform[T <: Vector]: DataSet[T] => DataSet[T]
transform: DataSet[LabeledVector] => DataSet[LabeledVector]
The standard scaler implementation can be controlled by the following two parameters:
{% highlight scala %} // Create standard scaler transformer val scaler = StandardScaler() .setMean(10.0) .setStd(2.0)
// Obtain data set to be scaled val dataSet: DataSet[Vector] = ...
// Learn the mean and standard deviation of the training data scaler.fit(dataSet)
// Scale the provided data set to have mean=10.0 and std=2.0 val scaledDS = scaler.transform(dataSet) {% endhighlight %}