SVM: Lower bound the default for n_components JIRA: MADLIB-1384

commit: 1b5ba4afd58ca9b263ccac47769ef281b45e3466 [log] [tgz]
author: Orhan Kislal <okislal@apache.org> Fri Oct 04 14:45:06 2019 -0400
committer: Frank McQuillan <fmcquillan@pivotal.io> Mon Oct 07 09:49:02 2019 -0700
tree: fe0fe5f950b4c5c90796385d4744ea7d9323bb13
parent: 63f40e70f8dbb6c9ed2b1b91c847fd3819b1a627 [diff]
diff --git a/src/ports/postgres/modules/svm/svm.py_in b/src/ports/postgres/modules/svm/svm.py_in
index b4f4f45..1532cb2 100644
--- a/src/ports/postgres/modules/svm/svm.py_in
+++ b/src/ports/postgres/modules/svm/svm.py_in

@@ -1330,7 +1330,7 @@
 def _extract_kernel_params(kernel_params='', n_features=10):
     params_default = {
         # common params
-        'n_components': 2 * n_features,
+        'n_components': max(100, 2 * n_features),
         'fit_intercept': False,
         'random_state': 1,
 

diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in
index cb6b69e..ba05e86 100644
--- a/src/ports/postgres/modules/svm/svm.sql_in
+++ b/src/ports/postgres/modules/svm/svm.sql_in

@@ -319,23 +319,22 @@
 is the intercept.
 </DD>
 <DT>n_components</DT>
-<DD>Default: 2*num_features. The dimensionality of the transformed feature space.
+<DD>Default: max(100, 2*num_features). The dimensionality of the transformed feature space.
 A larger value lowers the variance of the estimate of the kernel but requires
 more memory and takes longer to train.</DD>
 @note
-Setting the \e n_components kernel parameter properly is important
-to generate an accurate decision boundary.  This parameter
-is the dimensionality of the transformed feature space that arises
-from using the primal formulation.  We use primal in MADlib
-because we are implementing in a distributed system,
-compared to an R or other single node implementations
-that can use the dual formulation.  The primal approach
-implements an approximation of the kernel function using random
-feature maps, so in the case of a gaussian kernel, the
-dimensionality of the transformed feature space is not
-infinite (as in dual), but rather of size \e n_components.
-Try increasing \e n_components higher than the default if you are
-not getting an accurate decision boundary.
+Setting the \e n_components kernel parameter properly is important to
+generate an accurate decision boundary and can make the difference between a
+good model and a useless model. Try increasing the value of \e n_components
+ if you are not getting an accurate decision boundary. This parameter arises
+from using the primal formulation, in which we map data into a relatively
+low-dimensional randomized feature space [2, 3]. The parameter
+\e n_components is the dimension of that feature space.  We use the primal in
+MADlib to support scaling to large data sets, compared to R or other single
+node implementations  that use the dual formulation and hence do not have this
+type of mapping, since the the dimensionality of  the transformed feature
+space in the dual is effectively infinite.
+
 <DT>random_state</DT>
 <DD>Default: 1. Seed used by a random number generator. </DD>
 </DL>
@@ -641,8 +640,7 @@
 -# Train using Gaussian kernel. This time we specify
 the initial step size and maximum number of iterations to run. As part of the
 kernel parameter, we choose 10 as the dimension of the space where we train
-SVM. A larger number will lead to a more powerful model but run the risk of
-overfitting. As a result, the model will be a 10 dimensional vector, instead
+SVM. As a result, the model will be a 10 dimensional vector, instead
 of 4 as in the case of linear model.
 <pre class="example">
 DROP TABLE IF EXISTS houses_svm_gaussian, houses_svm_gaussian_summary, houses_svm_gaussian_random;
commit	1b5ba4afd58ca9b263ccac47769ef281b45e3466	[log] [tgz]
author	Orhan Kislal <okislal@apache.org>	Fri Oct 04 14:45:06 2019 -0400
committer	Frank McQuillan <fmcquillan@pivotal.io>	Mon Oct 07 09:49:02 2019 -0700
tree	fe0fe5f950b4c5c90796385d4744ea7d9323bb13
parent	63f40e70f8dbb6c9ed2b1b91c847fd3819b1a627 [diff]