| <!DOCTYPE html> |
| <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> |
| <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> |
| <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--> |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Feature Extraction and Transformation - MLlib - Spark 1.1.1 Documentation</title> |
| <meta name="description" content=""> |
| |
| |
| |
| <link rel="stylesheet" href="css/bootstrap.min.css"> |
| <style> |
| body { |
| padding-top: 60px; |
| padding-bottom: 40px; |
| } |
| </style> |
| <meta name="viewport" content="width=device-width"> |
| <link rel="stylesheet" href="css/bootstrap-responsive.min.css"> |
| <link rel="stylesheet" href="css/main.css"> |
| |
| <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script> |
| |
| <link rel="stylesheet" href="css/pygments-default.css"> |
| |
| |
| <!-- Google analytics script --> |
| <script type="text/javascript"> |
| var _gaq = _gaq || []; |
| _gaq.push(['_setAccount', 'UA-32518208-1']); |
| _gaq.push(['_trackPageview']); |
| |
| (function() { |
| var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; |
| ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; |
| var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); |
| })(); |
| </script> |
| |
| |
| </head> |
| <body> |
| <!--[if lt IE 7]> |
| <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p> |
| <![endif]--> |
| |
| <!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html --> |
| |
| <div class="navbar navbar-fixed-top" id="topbar"> |
| <div class="navbar-inner"> |
| <div class="container"> |
| <div class="brand"><a href="index.html"> |
| <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">1.1.1</span> |
| </div> |
| <ul class="nav"> |
| <!--TODO(andyk): Add class="active" attribute to li some how.--> |
| <li><a href="index.html">Overview</a></li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="quick-start.html">Quick Start</a></li> |
| <li><a href="programming-guide.html">Spark Programming Guide</a></li> |
| <li class="divider"></li> |
| <li><a href="streaming-programming-guide.html">Spark Streaming</a></li> |
| <li><a href="sql-programming-guide.html">Spark SQL</a></li> |
| <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li> |
| <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li> |
| <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="api/scala/index.html#org.apache.spark.package">Scaladoc</a></li> |
| <li><a href="api/java/index.html">Javadoc</a></li> |
| <li><a href="api/python/index.html">Python API</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="cluster-overview.html">Overview</a></li> |
| <li><a href="submitting-applications.html">Submitting Applications</a></li> |
| <li class="divider"></li> |
| <li><a href="ec2-scripts.html">Amazon EC2</a></li> |
| <li><a href="spark-standalone.html">Standalone Mode</a></li> |
| <li><a href="running-on-mesos.html">Mesos</a></li> |
| <li><a href="running-on-yarn.html">YARN</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a> |
| <ul class="dropdown-menu"> |
| <li><a href="configuration.html">Configuration</a></li> |
| <li><a href="monitoring.html">Monitoring</a></li> |
| <li><a href="tuning.html">Tuning Guide</a></li> |
| <li><a href="job-scheduling.html">Job Scheduling</a></li> |
| <li><a href="security.html">Security</a></li> |
| <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li> |
| <li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li> |
| <li class="divider"></li> |
| <li><a href="building-with-maven.html">Building Spark with Maven</a></li> |
| <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li> |
| </ul> |
| </li> |
| </ul> |
| <!--<p class="navbar-text pull-right"><span class="version-text">v1.1.1</span></p>--> |
| </div> |
| </div> |
| </div> |
| |
| <div class="container" id="content"> |
| |
| <h1 class="title"><a href="mllib-guide.html">MLlib</a> - Feature Extraction and Transformation</h1> |
| |
| |
| <ul id="markdown-toc"> |
| <li><a href="#tf-idf">TF-IDF</a></li> |
| <li><a href="#word2vec">Word2Vec</a> <ul> |
| <li><a href="#model">Model</a></li> |
| <li><a href="#example">Example</a></li> |
| </ul> |
| </li> |
| <li><a href="#standardscaler">StandardScaler</a> <ul> |
| <li><a href="#model-fitting">Model Fitting</a></li> |
| <li><a href="#example-1">Example</a></li> |
| </ul> |
| </li> |
| <li><a href="#normalizer">Normalizer</a> <ul> |
| <li><a href="#example-2">Example</a></li> |
| </ul> |
| </li> |
| </ul> |
| |
| <h2 id="tf-idf">TF-IDF</h2> |
| |
| <p><a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">Term frequency-inverse document frequency (TF-IDF)</a> is a feature |
| vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. |
| Denote a term by <code>$t$</code>, a document by <code>$d$</code>, and the corpus by <code>$D$</code>. |
| Term frequency <code>$TF(t, d)$</code> is the number of times that term <code>$t$</code> appears in document <code>$d$</code>, |
| while document frequency <code>$DF(t, D)$</code> is the number of documents that contains term <code>$t$</code>. |
| If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that |
| appear very often but carry little information about the document, e.g., “a”, “the”, and “of”. |
| If a term appears very often across the corpus, it means it doesn’t carry special information about |
| a particular document. |
| Inverse document frequency is a numerical measure of how much information a term provides: |
| <code>\[ |
| IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1}, |
| \]</code> |
| where <code>$|D|$</code> is the total number of documents in the corpus. |
| Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. |
| Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. |
| The TF-IDF measure is simply the product of TF and IDF: |
| <code>\[ |
| TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). |
| \]</code> |
| There are several variants on the definition of term frequency and document frequency. |
| In MLlib, we separate TF and IDF to make them flexible.</p> |
| |
| <p>Our implementation of term frequency utilizes the |
| <a href="http://en.wikipedia.org/wiki/Feature_hashing">hashing trick</a>. |
| A raw feature is mapped into an index (term) by applying a hash function. |
| Then term frequencies are calculated based on the mapped indices. |
| This approach avoids the need to compute a global term-to-index map, |
| which can be expensive for a large corpus, but it suffers from potential hash collisions, |
| where different raw features may become the same term after hashing. |
| To reduce the chance of collision, we can increase the target feature dimension, i.e., |
| the number of buckets of the hash table. |
| The default feature dimension is <code>$2^{20} = 1,048,576$</code>.</p> |
| |
| <p><strong>Note:</strong> MLlib doesn’t provide tools for text segmentation. |
| We refer users to the <a href="http://nlp.stanford.edu/">Stanford NLP Group</a> and |
| <a href="https://github.com/scalanlp/chalk">scalanlp/chalk</a>.</p> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <p>TF and IDF are implemented in <a href="api/scala/index.html#org.apache.spark.mllib.feature.HashingTF">HashingTF</a> |
| and <a href="api/scala/index.html#org.apache.spark.mllib.feature.IDF">IDF</a>. |
| <code>HashingTF</code> takes an <code>RDD[Iterable[_]]</code> as the input. |
| Each record could be an iterable of strings or other types.</p> |
| |
| <div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.SparkContext</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.HashingTF</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vector</span> |
| |
| <span class="k">val</span> <span class="n">sc</span><span class="k">:</span> <span class="kt">SparkContext</span> <span class="o">=</span> <span class="o">...</span> |
| |
| <span class="c1">// Load documents (one per line).</span> |
| <span class="k">val</span> <span class="n">documents</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">String</span><span class="o">]]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"..."</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">toSeq</span><span class="o">)</span> |
| |
| <span class="k">val</span> <span class="n">hashingTF</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">HashingTF</span><span class="o">()</span> |
| <span class="k">val</span> <span class="n">tf</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Vector</span><span class="o">]</span> <span class="k">=</span> <span class="n">hashingTF</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="n">documents</span><span class="o">)</span> |
| </code></pre></div> |
| |
| <p>While applying <code>HashingTF</code> only needs a single pass to the data, applying <code>IDF</code> needs two passes: |
| first to compute the IDF vector and second to scale the term frequencies by IDF.</p> |
| |
| <div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.IDF</span> |
| |
| <span class="c1">// ... continue from the previous example</span> |
| <span class="n">tf</span><span class="o">.</span><span class="n">cache</span><span class="o">()</span> |
| <span class="k">val</span> <span class="n">idf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">IDF</span><span class="o">().</span><span class="n">fit</span><span class="o">(</span><span class="n">tf</span><span class="o">)</span> |
| <span class="k">val</span> <span class="n">tfidf</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Vector</span><span class="o">]</span> <span class="k">=</span> <span class="n">idf</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="n">tf</span><span class="o">)</span> |
| </code></pre></div> |
| |
| </div> |
| </div> |
| |
| <h2 id="word2vec">Word2Vec</h2> |
| |
| <p><a href="https://code.google.com/p/word2vec/">Word2Vec</a> computes distributed vector representation of words. |
| The main advantage of the distributed |
| representations is that similar words are close in the vector space, which makes generalization to |
| novel patterns easier and model estimation more robust. Distributed vector representation is |
| showed to be useful in many natural language processing applications such as named entity |
| recognition, disambiguation, parsing, tagging and machine translation.</p> |
| |
| <h3 id="model">Model</h3> |
| |
| <p>In our implementation of Word2Vec, we used skip-gram model. The training objective of skip-gram is |
| to learn word vector representations that are good at predicting its context in the same sentence. |
| Mathematically, given a sequence of training words <code>$w_1, w_2, \dots, w_T$</code>, the objective of the |
| skip-gram model is to maximize the average log-likelihood |
| <code>\[ |
| \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t) |
| \]</code> |
| where $k$ is the size of the training window. </p> |
| |
| <p>In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are |
| vector representations of $w$ as word and context respectively. The probability of correctly |
| predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is |
| <code>\[ |
| p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})} |
| \]</code> |
| where $V$ is the vocabulary size. </p> |
| |
| <p>The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ |
| is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, |
| we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to |
| $O(\log(V))$</p> |
| |
| <h3 id="example">Example</h3> |
| |
| <p>The example below demonstrates how to load a text file, parse it as an RDD of <code>Seq[String]</code>, |
| construct a <code>Word2Vec</code> instance and then fit a <code>Word2VecModel</code> with the input data. Finally, |
| we display the top 40 synonyms of the specified word. To run the example, first download |
| the <a href="http://mattmahoney.net/dc/text8.zip">text8</a> data and extract it to your preferred directory. |
| Here we assume the extracted file is <code>text8</code> and in same directory as you run the spark shell. </p> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">org.apache.spark._</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.rdd._</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.SparkContext._</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.Word2Vec</span> |
| |
| <span class="k">val</span> <span class="n">input</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">"text8"</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="n">line</span> <span class="k">=></span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="o">(</span><span class="s">" "</span><span class="o">).</span><span class="n">toSeq</span><span class="o">)</span> |
| |
| <span class="k">val</span> <span class="n">word2vec</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Word2Vec</span><span class="o">()</span> |
| |
| <span class="k">val</span> <span class="n">model</span> <span class="k">=</span> <span class="n">word2vec</span><span class="o">.</span><span class="n">fit</span><span class="o">(</span><span class="n">input</span><span class="o">)</span> |
| |
| <span class="k">val</span> <span class="n">synonyms</span> <span class="k">=</span> <span class="n">model</span><span class="o">.</span><span class="n">findSynonyms</span><span class="o">(</span><span class="s">"china"</span><span class="o">,</span> <span class="mi">40</span><span class="o">)</span> |
| |
| <span class="k">for</span><span class="o">((</span><span class="n">synonym</span><span class="o">,</span> <span class="n">cosineSimilarity</span><span class="o">)</span> <span class="k"><-</span> <span class="n">synonyms</span><span class="o">)</span> <span class="o">{</span> |
| <span class="n">println</span><span class="o">(</span><span class="n">s</span><span class="s">"$synonym $cosineSimilarity"</span><span class="o">)</span> |
| <span class="o">}</span> |
| </code></pre></div> |
| |
| </div> |
| </div> |
| |
| <h2 id="standardscaler">StandardScaler</h2> |
| |
| <p>Standardizes features by scaling to unit variance and/or removing the mean using column summary |
| statistics on the samples in the training set. This is a very common pre-processing step.</p> |
| |
| <p>For example, RBF kernel of Support Vector Machines or the L1 and L2 regularized linear models |
| typically work better when all features have unit variance and/or zero mean.</p> |
| |
| <p>Standardization can improve the convergence rate during the optimization process, and also prevents |
| against features with very large variances exerting an overly large influence during model training.</p> |
| |
| <h3 id="model-fitting">Model Fitting</h3> |
| |
| <p><a href="api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler"><code>StandardScaler</code></a> has the |
| following parameters in the constructor:</p> |
| |
| <ul> |
| <li><code>withMean</code> False by default. Centers the data with mean before scaling. It will build a dense |
| output, so this does not work on sparse input and will raise an exception.</li> |
| <li><code>withStd</code> True by default. Scales the data to unit variance.</li> |
| </ul> |
| |
| <p>We provide a <a href="api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler"><code>fit</code></a> method in |
| <code>StandardScaler</code> which can take an input of <code>RDD[Vector]</code>, learn the summary statistics, and then |
| return a model which can transform the input dataset into unit variance and/or zero mean features |
| depending how we configure the <code>StandardScaler</code>.</p> |
| |
| <p>This model implements <a href="api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer"><code>VectorTransformer</code></a> |
| which can apply the standardization on a <code>Vector</code> to produce a transformed <code>Vector</code> or on |
| an <code>RDD[Vector]</code> to produce a transformed <code>RDD[Vector]</code>.</p> |
| |
| <p>Note that if the variance of a feature is zero, it will return default <code>0.0</code> value in the <code>Vector</code> |
| for that feature.</p> |
| |
| <h3 id="example-1">Example</h3> |
| |
| <p>The example below demonstrates how to load a dataset in libsvm format, and standardize the features |
| so that the new features have unit variance and/or zero mean.</p> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext._</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.StandardScaler</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> |
| |
| <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="nc">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">)</span> |
| |
| <span class="k">val</span> <span class="n">scaler1</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">StandardScaler</span><span class="o">().</span><span class="n">fit</span><span class="o">(</span><span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="o">))</span> |
| <span class="k">val</span> <span class="n">scaler2</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">StandardScaler</span><span class="o">(</span><span class="n">withMean</span> <span class="k">=</span> <span class="kc">true</span><span class="o">,</span> <span class="n">withStd</span> <span class="k">=</span> <span class="kc">true</span><span class="o">).</span><span class="n">fit</span><span class="o">(</span><span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="o">))</span> |
| |
| <span class="c1">// data1 will be unit variance.</span> |
| <span class="k">val</span> <span class="n">data1</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">label</span><span class="o">,</span> <span class="n">scaler1</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="o">)))</span> |
| |
| <span class="c1">// Without converting the features into dense vectors, transformation with zero mean will raise</span> |
| <span class="c1">// exception on sparse vector.</span> |
| <span class="c1">// data2 will be unit variance and zero mean.</span> |
| <span class="k">val</span> <span class="n">data2</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">label</span><span class="o">,</span> <span class="n">scaler2</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="nc">Vectors</span><span class="o">.</span><span class="n">dense</span><span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="o">.</span><span class="n">toArray</span><span class="o">))))</span> |
| </code></pre></div> |
| |
| </div> |
| </div> |
| |
| <h2 id="normalizer">Normalizer</h2> |
| |
| <p>Normalizer scales individual samples to have unit $L^p$ norm. This is a common operation for text |
| classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors |
| is the cosine similarity of the vectors.</p> |
| |
| <p><a href="api/scala/index.html#org.apache.spark.mllib.feature.Normalizer"><code>Normalizer</code></a> has the following |
| parameter in the constructor:</p> |
| |
| <ul> |
| <li><code>p</code> Normalization in $L^p$ space, $p = 2$ by default.</li> |
| </ul> |
| |
| <p><code>Normalizer</code> implements <a href="api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer"><code>VectorTransformer</code></a> |
| which can apply the normalization on a <code>Vector</code> to produce a transformed <code>Vector</code> or on |
| an <code>RDD[Vector]</code> to produce a transformed <code>RDD[Vector]</code>.</p> |
| |
| <p>Note that if the norm of the input is zero, it will return the input vector.</p> |
| |
| <h3 id="example-2">Example</h3> |
| |
| <p>The example below demonstrates how to load a dataset in libsvm format, and normalizes the features |
| with $L^2$ norm, and $L^\infty$ norm.</p> |
| |
| <div class="codetabs"> |
| <div data-lang="scala"> |
| |
| <div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">org.apache.spark.SparkContext._</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.feature.Normalizer</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.linalg.Vectors</span> |
| <span class="k">import</span> <span class="nn">org.apache.spark.mllib.util.MLUtils</span> |
| |
| <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="nc">MLUtils</span><span class="o">.</span><span class="n">loadLibSVMFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">"data/mllib/sample_libsvm_data.txt"</span><span class="o">)</span> |
| |
| <span class="k">val</span> <span class="n">normalizer1</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Normalizer</span><span class="o">()</span> |
| <span class="k">val</span> <span class="n">normalizer2</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Normalizer</span><span class="o">(</span><span class="n">p</span> <span class="k">=</span> <span class="nc">Double</span><span class="o">.</span><span class="nc">PositiveInfinity</span><span class="o">)</span> |
| |
| <span class="c1">// Each sample in data1 will be normalized using $L^2$ norm.</span> |
| <span class="k">val</span> <span class="n">data1</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">label</span><span class="o">,</span> <span class="n">normalizer1</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="o">)))</span> |
| |
| <span class="c1">// Each sample in data2 will be normalized using $L^\infty$ norm.</span> |
| <span class="k">val</span> <span class="n">data2</span> <span class="k">=</span> <span class="n">data</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">x</span> <span class="k">=></span> <span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">label</span><span class="o">,</span> <span class="n">normalizer2</span><span class="o">.</span><span class="n">transform</span><span class="o">(</span><span class="n">x</span><span class="o">.</span><span class="n">features</span><span class="o">)))</span> |
| </code></pre></div> |
| |
| </div> |
| </div> |
| |
| |
| </div> <!-- /container --> |
| |
| <script src="js/vendor/jquery-1.8.0.min.js"></script> |
| <script src="js/vendor/bootstrap.min.js"></script> |
| <script src="js/main.js"></script> |
| |
| <!-- MathJax Section --> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| TeX: { equationNumbers: { autoNumber: "AMS" } } |
| }); |
| </script> |
| <script> |
| // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS. |
| // We could use "//cdn.mathjax...", but that won't support "file://". |
| (function(d, script) { |
| script = d.createElement('script'); |
| script.type = 'text/javascript'; |
| script.async = true; |
| script.onload = function(){ |
| MathJax.Hub.Config({ |
| tex2jax: { |
| inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], |
| displayMath: [ ["$$","$$"], ["\\[", "\\]"] ], |
| processEscapes: true, |
| skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] |
| } |
| }); |
| }; |
| script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + |
| 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; |
| d.getElementsByTagName('head')[0].appendChild(script); |
| }(document)); |
| </script> |
| </body> |
| </html> |