| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <title> |
| MLlib | Apache Spark |
| |
| </title> |
| |
| |
| |
| |
| <meta name="description" content="MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala, Python, and R."> |
| |
| |
| <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/css/bootstrap.min.css" rel="stylesheet" |
| integrity="sha384-EVSTQN3/azprG1Anm3QDgpJLIm9Nao0Yz1ztcQTwFspd3yD65VohhpuuCOmLASjC" crossorigin="anonymous"> |
| <link rel="preconnect" href="https://fonts.googleapis.com"> |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> |
| <link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,wght@0,400;0,500;0,700;1,400;1,500;1,700&Courier+Prime:wght@400;700&display=swap" rel="stylesheet"> |
| <link href="/css/custom.css" rel="stylesheet"> |
| <!-- Code highlighter CSS --> |
| <link href="/css/pygments-default.css" rel="stylesheet"> |
| <link rel="icon" href="/favicon.ico" type="image/x-icon"> |
| |
| <!-- Matomo --> |
| <script> |
| var _paq = window._paq = window._paq || []; |
| /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ |
| _paq.push(["disableCookies"]); |
| _paq.push(['trackPageView']); |
| _paq.push(['enableLinkTracking']); |
| (function() { |
| var u="https://analytics.apache.org/"; |
| _paq.push(['setTrackerUrl', u+'matomo.php']); |
| _paq.push(['setSiteId', '40']); |
| var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; |
| g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); |
| })(); |
| </script> |
| <!-- End Matomo Code --> |
| </head> |
| <body class="global"> |
| <nav class="navbar navbar-expand-lg navbar-dark p-0 px-4" style="background: #1D6890;"> |
| <a class="navbar-brand" href="/"> |
| <img src="/images/spark-logo-rev.svg" alt="" width="141" height="72"> |
| </a> |
| <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarContent" |
| aria-controls="navbarContent" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| <div class="collapse navbar-collapse col-md-12 col-lg-auto pt-4" id="navbarContent"> |
| |
| <ul class="navbar-nav me-auto"> |
| <li class="nav-item"> |
| <a class="nav-link active" aria-current="page" href="/downloads.html">Download</a> |
| </li> |
| <li class="nav-item dropdown"> |
| <a class="nav-link dropdown-toggle" href="#" id="libraries" role="button" data-bs-toggle="dropdown" |
| aria-expanded="false"> |
| Libraries |
| </a> |
| <ul class="dropdown-menu" aria-labelledby="libraries"> |
| <li><a class="dropdown-item" href="/sql/">SQL and DataFrames</a></li> |
| <li><a class="dropdown-item" href="/streaming/">Spark Streaming</a></li> |
| <li><a class="dropdown-item" href="/mllib/">MLlib (machine learning)</a></li> |
| <li><a class="dropdown-item" href="/graphx/">GraphX (graph)</a></li> |
| <li> |
| <hr class="dropdown-divider"> |
| </li> |
| <li><a class="dropdown-item" href="/third-party-projects.html">Third-Party Projects</a></li> |
| </ul> |
| </li> |
| <li class="nav-item dropdown"> |
| <a class="nav-link dropdown-toggle" href="#" id="documentation" role="button" data-bs-toggle="dropdown" |
| aria-expanded="false"> |
| Documentation |
| </a> |
| <ul class="dropdown-menu" aria-labelledby="documentation"> |
| <li><a class="dropdown-item" href="/docs/latest/">Latest Release</a></li> |
| <li><a class="dropdown-item" href="/documentation.html">Older Versions and Other Resources</a></li> |
| <li><a class="dropdown-item" href="/faq.html">Frequently Asked Questions</a></li> |
| </ul> |
| </li> |
| <li class="nav-item"> |
| <a class="nav-link active" aria-current="page" href="/examples.html">Examples</a> |
| </li> |
| <li class="nav-item dropdown"> |
| <a class="nav-link dropdown-toggle" href="#" id="community" role="button" data-bs-toggle="dropdown" |
| aria-expanded="false"> |
| Community |
| </a> |
| <ul class="dropdown-menu" aria-labelledby="community"> |
| <li><a class="dropdown-item" href="/community.html">Mailing Lists & Resources</a></li> |
| <li><a class="dropdown-item" href="/contributing.html">Contributing to Spark</a></li> |
| <li><a class="dropdown-item" href="/improvement-proposals.html">Improvement Proposals (SPIP)</a> |
| </li> |
| <li><a class="dropdown-item" href="https://issues.apache.org/jira/browse/SPARK">Issue Tracker</a> |
| </li> |
| <li><a class="dropdown-item" href="/powered-by.html">Powered By</a></li> |
| <li><a class="dropdown-item" href="/committers.html">Project Committers</a></li> |
| <li><a class="dropdown-item" href="/history.html">Project History</a></li> |
| </ul> |
| </li> |
| <li class="nav-item dropdown"> |
| <a class="nav-link dropdown-toggle" href="#" id="developers" role="button" data-bs-toggle="dropdown" |
| aria-expanded="false"> |
| Developers |
| </a> |
| <ul class="dropdown-menu" aria-labelledby="developers"> |
| <li><a class="dropdown-item" href="/developer-tools.html">Useful Developer Tools</a></li> |
| <li><a class="dropdown-item" href="/versioning-policy.html">Versioning Policy</a></li> |
| <li><a class="dropdown-item" href="/release-process.html">Release Process</a></li> |
| <li><a class="dropdown-item" href="/security.html">Security</a></li> |
| </ul> |
| </li> |
| </ul> |
| <ul class="navbar-nav ml-auto"> |
| <li class="nav-item dropdown"> |
| <a class="nav-link dropdown-toggle" href="#" id="apacheFoundation" role="button" |
| data-bs-toggle="dropdown" aria-expanded="false"> |
| Apache Software Foundation |
| </a> |
| <ul class="dropdown-menu" aria-labelledby="apacheFoundation"> |
| <li><a class="dropdown-item" href="https://www.apache.org/">Apache Homepage</a></li> |
| <li><a class="dropdown-item" href="https://www.apache.org/licenses/">License</a></li> |
| <li><a class="dropdown-item" |
| href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> |
| <li><a class="dropdown-item" href="https://www.apache.org/foundation/thanks.html">Thanks</a></li> |
| <li><a class="dropdown-item" href="https://www.apache.org/security/">Security</a></li> |
| <li><a class="dropdown-item" href="https://www.apache.org/events/current-event">Event</a></li> |
| </ul> |
| </li> |
| </ul> |
| </div> |
| </nav> |
| |
| <div class="container"> |
| <div class="row mt-4"> |
| <div class="col-12 col-md-9"> |
| <div class="jumbotron"> |
| <b>MLlib</b> is Apache Spark's scalable machine learning library. |
| </div> |
| |
| <div class="row row-padded"> |
| <div class="col-md-7 col-sm-7"> |
| <h2>Ease of use</h2> |
| <p class="lead"> |
| Usable in Java, Scala, Python, and R. |
| </p> |
| <p> |
| MLlib fits into <a href="/">Spark</a>'s |
| APIs and interoperates with <a href="http://www.numpy.org">NumPy</a> |
| in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). |
| You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it |
| easy to plug into Hadoop workflows. |
| </p> |
| </div> |
| <div class="col-md-5 col-sm-5 col-padded-top col-center"> |
| |
| <div style="margin-top: 15px; text-align: left; display: inline-block;"> |
| <div class="code"> |
| data = spark.read.format(<span class="string">"libsvm"</span>)\<br /> |
| .load(<span class="string">"hdfs://..."</span>)<br /> |
| <br /> |
| model = <span class="sparkop">KMeans</span>(k=10).fit(data) |
| </div> |
| <div class="caption">Calling MLlib in Python</div> |
| </div> |
| </div> |
| </div> |
| |
| <div class="row row-padded"> |
| <div class="col-md-7 col-sm-7"> |
| <h2>Performance</h2> |
| <p class="lead"> |
| High-quality algorithms, 100x faster than MapReduce. |
| </p> |
| <p> |
| Spark excels at iterative computation, enabling MLlib to run fast. |
| At the same time, we care about algorithmic performance: |
| MLlib contains high-quality algorithms that leverage iteration, and |
| can yield better results than the one-pass approximations sometimes used on MapReduce. |
| </p> |
| </div> |
| <div class="col-md-5 col-sm-5 col-padded-top col-center"> |
| <div style="width: 100%; max-width: 272px; display: inline-block; text-align: center;"> |
| <img src="/images/logistic-regression.png" style="width: 100%; max-width: 250px;" /> |
| <div class="caption" style="min-width: 272px;">Logistic regression in Hadoop and Spark</div> |
| </div> |
| </div> |
| </div> |
| |
| <div class="row row-padded" style="margin-bottom: 15px;"> |
| <div class="col-md-7 col-sm-7"> |
| <h2>Runs everywhere</h2> |
| <p class="lead"> |
| Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources. |
| </p> |
| <p> |
| You can run Spark using its <a href="/docs/latest/spark-standalone.html">standalone cluster mode</a>, |
| on <a href="https://github.com/amplab/spark-ec2">EC2</a>, |
| on <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Hadoop YARN</a>, |
| on <a href="https://mesos.apache.org">Mesos</a>, or |
| on <a href="https://kubernetes.io/">Kubernetes</a>. |
| Access data in <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">HDFS</a>, |
| <a href="https://cassandra.apache.org">Apache Cassandra</a>, |
| <a href="https://hbase.apache.org">Apache HBase</a>, |
| <a href="https://hive.apache.org">Apache Hive</a>, |
| and hundreds of other data sources. |
| </p> |
| </div> |
| <div class="col-md-5 col-sm-5 col-padded-top col-center"> |
| <img src="/images/hadoop.jpg" style="width: 100%; max-width: 280px;" /> |
| </div> |
| </div> |
| |
| <div class="row"> |
| <div class="col-md-4 col-padded"> |
| <h3>Algorithms</h3> |
| <p> |
| MLlib contains many algorithms and utilities. |
| </p> |
| <p> |
| ML algorithms include: |
| </p> |
| <ul class="list-narrow"> |
| <li>Classification: logistic regression, naive Bayes,...</li> |
| <li>Regression: generalized linear regression, survival regression,...</li> |
| <li>Decision trees, random forests, and gradient-boosted trees</li> |
| <li>Recommendation: alternating least squares (ALS)</li> |
| <li>Clustering: K-means, Gaussian mixtures (GMMs),...</li> |
| <li>Topic modeling: latent Dirichlet allocation (LDA)</li> |
| <li>Frequent itemsets, association rules, and sequential pattern mining</li> |
| </ul> |
| <p> |
| ML workflow utilities include: |
| </p> |
| <ul class="list-narrow"> |
| <li>Feature transformations: standardization, normalization, hashing,...</li> |
| <li>ML Pipeline construction</li> |
| <li>Model evaluation and hyper-parameter tuning</li> |
| <li>ML persistence: saving and loading models and Pipelines</li> |
| </ul> |
| <p> |
| Other utilities include: |
| </p> |
| <ul class="list-narrow"> |
| <li>Distributed linear algebra: SVD, PCA,...</li> |
| <li>Statistics: summary statistics, hypothesis testing,...</li> |
| </ul> |
| <p>Refer to the <a href="/docs/latest/ml-guide.html">MLlib guide</a> for usage examples.</p> |
| </div> |
| |
| <div class="col-md-4 col-padded"> |
| <h3>Community</h3> |
| <p> |
| MLlib is developed as part of the Apache Spark project. It thus gets |
| tested and updated with each Spark release. |
| </p> |
| <p> |
| If you have questions about the library, ask on the |
| <a href="/community.html#mailing-lists">Spark mailing lists</a>. |
| </p> |
| <p> |
| MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an algorithm to MLlib, |
| read <a href="/contributing.html">how to |
| contribute to Spark</a> and send us a patch! |
| </p> |
| </div> |
| |
| <div class="col-md-4 col-padded"> |
| <h3>Getting started</h3> |
| <p> |
| To get started with MLlib: |
| </p> |
| <ul class="list-narrow"> |
| <li><a href="/downloads.html">Download Spark</a>. MLlib is included as a module.</li> |
| <li>Read the <a href="/docs/latest/ml-guide.html">MLlib guide</a>, which includes |
| various usage examples.</li> |
| <li>Learn how to <a href="/docs/latest/#launching-on-a-cluster">deploy</a> Spark on a cluster |
| if you'd like to run in distributed mode. You can also run locally on a multicore machine |
| without any setup. |
| </li> |
| </ul> |
| </div> |
| </div> |
| |
| <div class="row"> |
| <div class="col-sm-12 col-center"> |
| <a href="/downloads.html" class="btn btn-cta btn-lg btn-multiline"> |
| Download Apache Spark<br /><span class="small">Includes MLlib</span> |
| </a> |
| </div> |
| </div> |
| |
| </div> |
| <div class="col-12 col-md-3"> |
| <div class="news" style="margin-bottom: 20px;"> |
| <h5>Latest News</h5> |
| <ul class="list-unstyled"> |
| |
| <li><a href="/news/spark-3-5-1-released.html">Spark 3.5.1 released</a> |
| <span class="small">(Feb 23, 2024)</span></li> |
| |
| <li><a href="/news/spark-3-3-4-released.html">Spark 3.3.4 released</a> |
| <span class="small">(Dec 16, 2023)</span></li> |
| |
| <li><a href="/news/spark-3-4-2-released.html">Spark 3.4.2 released</a> |
| <span class="small">(Nov 30, 2023)</span></li> |
| |
| <li><a href="/news/spark-3-5-0-released.html">Spark 3.5.0 released</a> |
| <span class="small">(Sep 13, 2023)</span></li> |
| |
| </ul> |
| <p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p> |
| </div> |
| <div style="text-align:center; margin-bottom: 20px;"> |
| <a href="https://www.apache.org/events/current-event.html"> |
| <img src="https://www.apache.org/events/current-event-234x60.png" style="max-width: 100%;"/> |
| </a> |
| </div> |
| <div class="hidden-xs hidden-sm"> |
| <a href="/downloads.html" class="btn btn-cta btn-lg d-grid" style="margin-bottom: 30px;"> |
| Download Spark |
| </a> |
| <p style="font-size: 16px; font-weight: 500; color: #555;"> |
| Built-in Libraries: |
| </p> |
| <ul class="list-none"> |
| <li><a href="/sql/">SQL and DataFrames</a></li> |
| <li><a href="/streaming/">Spark Streaming</a></li> |
| <li><a href="/mllib/">MLlib (machine learning)</a></li> |
| <li><a href="/graphx/">GraphX (graph)</a></li> |
| </ul> |
| <a href="/third-party-projects.html">Third-Party Projects</a> |
| </div> |
| </div> |
| </div> |
| |
| |
| |
| <footer class="small"> |
| <hr> |
| Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered |
| trademarks or trademarks of The Apache Software Foundation in the United States and other countries. |
| See guidance on use of Apache Spark <a href="/trademarks.html">trademarks</a>. |
| All other marks mentioned may be trademarks or registered trademarks of their respective owners. |
| Copyright © 2018 The Apache Software Foundation, Licensed under the |
| <a href="https://www.apache.org/licenses/">Apache License, Version 2.0</a>. |
| </footer> |
| </div> |
| |
| <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/js/bootstrap.bundle.min.js" |
| integrity="sha384-MrcW6ZMFYlzcLA8Nl+NtUVF0sA7MsXsP1UyJoMp4YLEuNSfAP+JcXn/tWtIaxVXM" |
| crossorigin="anonymous"></script> |
| <script src="https://code.jquery.com/jquery.js"></script> |
| <script src="/js/lang-tabs.js"></script> |
| <script src="/js/downloads.js"></script> |
| </body> |
| </html> |