blob: 565854a1b4540b5520d2533ac60f66ab26ca2cac [file] [log] [blame] [view]
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->
# SystemML
**Documentation:** [SystemML Documentation](http://systemml.apache.org/documentation)<br/>
**Mailing List:** [Dev Mailing List](mailto:dev@systemml.apache.org)<br/>
**Build Status:** [![Build Status](https://sparktc.ibmcloud.com/jenkins/job/SystemML-DailyTest/badge/icon)](https://sparktc.ibmcloud.com/jenkins/job/SystemML-DailyTest)<br/>
**Issue Tracker:** [JIRA](https://issues.apache.org/jira/browse/SYSTEMML)<br/>
**Download:** [Download SystemML](http://systemml.apache.org/download.html)<br/>
**SystemML** is now an **Apache Top Level Project**! Please see the [**Apache SystemML**](http://systemml.apache.org/)
website for more information.
SystemML is a flexible, scalable machine learning system.
SystemML's distinguishing characteristics are:
1. **Algorithm customizability via R-like and Python-like languages**.
2. **Multiple execution modes**, including Spark MLContext API, Spark Batch, Hadoop Batch, Standalone, and JMLC.
3. **Automatic optimization** based on data and cluster characteristics to ensure both efficiency and scalability.
The latest version of SystemML supports: Java 8+, Scala 2.11+, Python 2.7/3.5+, Hadoop 2.6+, and Spark 2.1+.
## Algorithm Customizability
ML algorithms in SystemML are specified in a high-level, declarative machine learning (DML) language.
Algorithms can be expressed in either an R-like syntax or a Python-like syntax. DML includes
linear algebra primitives, statistical functions, and additional constructs.
This high-level language significantly increases the productivity of
data scientists as it provides (1) full flexibility in expressing custom
analytics and (2) data independence from the underlying input formats and
physical data representations.
## Multiple Execution Modes
SystemML computations can be executed in a variety of different modes. To begin with, SystemML
can be operated in Standalone mode on a single machine, allowing data scientists to develop
algorithms locally without need of a distributed cluster. In order to scale up, algorithms can also be distributed
across a cluster using Spark or Hadoop.
This flexibility allows the utilization of an organization's existing resources and expertise.
In addition, SystemML features a
[Spark MLContext API](http://apache.github.io/systemml/spark-mlcontext-programming-guide.html)
that allows for programmatic interaction via Scala, Python, and Java. SystemML also features an
embedded API for scoring models.
## Automatic Optimization
Algorithms specified in DML are dynamically compiled and optimized based on data and cluster characteristics
using rule-based and cost-based optimization techniques. The optimizer automatically generates hybrid runtime
execution plans ranging from in-memory, single-node execution, to distributed computations on Spark or Hadoop.
This ensures both efficiency and scalability. Automatic optimization reduces or eliminates the need to hand-tune
distributed runtime execution plans and system configurations.
## ML Algorithms
SystemML features a suite of production-level examples that can be grouped into six broad categories:
Descriptive Statistics, Classification, Clustering, Regression, Matrix Factorization, and Survival Analysis.
Detailed descriptions of these algorithms can be found in the
[SystemML Algorithms Reference](http://apache.github.io/systemml/algorithms-reference.html). The goal of these provided algorithms is to serve as production-level examples that can modified or used as inspiration for a new custom algorithm.
## Download & Setup
Before you get started on SystemML, make sure that your environment is set up and ready to go.
1. **If you’re on OS X, we recommend installing [Homebrew](http://brew.sh) if you haven’t already. For Linux users, the [Linuxbrew project](http://linuxbrew.sh/) is equivalent.**
OS X:
```
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```
Linux:
```
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"
```
2. **Install Java (need Java 8).**
```
brew tap caskroom/cask
brew install Caskroom/cask/java
```
3. **Install Spark 2.1.**
```
brew tap homebrew/versions
brew install apache-spark21
```
4. **Download SystemML.**
Go to the [SystemML Downloads page](http://systemml.apache.org/download.html), download `systemml-0.15.0.zip` (should be 2nd), and unzip it to a location of your choice.
*The next step is optional, but it will make your life a lot easier.*
5. **[OPTIONAL] Set `SYSTEMML_HOME` in your bash profile.**
Add the following to `~/.bash_profile`, replacing `path/to/` with the location of the download in step 5.
```
export SYSTEMML_HOME=path/to/systemml-0.15.0
```
*Make sure to open a new tab in terminal so that you make sure the changes have been made.*
6. **[OPTIONAL] Install Python or Python 3 (to follow along with our Jupyter notebook examples).**
Python 2:
```
brew install python
pip install jupyter matplotlib numpy
```
Python 3:
```
brew install python3
pip3 install jupyter matplotlib numpy
```
**Congrats! You can now use SystemML!**
## Next Steps!
To get started, please consult the
[SystemML Documentation](http://systemml.apache.org/documentation). We
recommend using the [Spark MLContext API](http://apache.github.io/systemml/spark-mlcontext-programming-guide.html)
to run SystemML from Scala or Python using `spark-shell`, `pyspark`, or `spark-submit`.