{% include JB/setup %}
An Apache Spark distribution is required to be installed before installing Apache Toree. You can download a copy of Apache Spark here. Throughout the rest of this guide we will assume you have downloaded and extracted the Apache Spark distribution to /usr/local/bin/apache-spark/
.
The quickest way to install Apache Toree is through the toree pip package.
pip install toree
This will install a jupyter application called toree
, which can be used to install and configure different Apache Toree kernels.
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
You can confirm the installation by verifying the apache_toree_scala
kernel is listed in the following command:
jupyter kernelspec list
Arguments that take values are actually convenience aliases to full Configurables, whose aliases are listed on the help line. For more information on full configurables, see ‘--help-all’.
--user Install to the per-user kernel registry --debug set log level to logging.DEBUG (maximize logging output) --replace Replace any existing kernel spec with this name. --sys-prefix Install to Python's sys.prefix. Useful in conda/virtual environments. --interpreters=<Unicode> (ToreeInstall.interpreters) Default: 'Scala' A comma separated list of the interpreters to install. The names of the interpreters are case sensitive. --toree_opts=<Unicode> (ToreeInstall.toree_opts) Default: '' Specify command line arguments for Apache Toree. --python_exec=<Unicode> (ToreeInstall.python_exec) Default: 'python' Specify the python executable. Defaults to "python" --kernel_name=<Unicode> (ToreeInstall.kernel_name) Default: 'Apache Toree' Install the kernel spec with this name. This is also used as the base of the display name in jupyter. --log-level=<Enum> (Application.log_level) Default: 30 Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL') Set the log level by value or name. --config=<Unicode> (JupyterApp.config_file) Default: '' Full path of a config file. --spark_home=<Unicode> (ToreeInstall.spark_home) Default: '/usr/local/spark' Specify where the spark files can be found. --spark_opts=<Unicode> (ToreeInstall.spark_opts) Default: '' Specify command line arguments to proxy for spark config.
Toree is started using the spark-submit
script. All configuration options from Spark are consistent with configuring a Spark Submit job. There are two ways of setting configuration options for Spark.
The first is at install time with the --spark_opts
command line option.
jupyter toree install --spark_opts='--master=local[4]'
The second option is configured at run time through the SPARK_OPTS
environment variable.
SPARK_OPTS='--master=local[4]' jupyter notebook
Note: There is an order of precedence to the configuration options. SPARK_OPTS
will overwrite any values configured in --spark_opts
.
There are some configuration options that are specific to Toree.
Option Description ------ ----------- --default-interpreter default interpreter for the kernel --default-repositories comma separated list of additional repositories to resolve --default-repository-credentials comma separated list of credential files to use -h, --help display help information --interpreter-plugin --ip used to bind sockets --jar-dir directory where user added jars are stored (MUST EXIST) --magic-url path to a magic jar --max-interpreter-threads <Integer> total number of worker threads to use to execute code --spark-context-initialization-timeout <Long> number of milliseconds allowed for creation of the spark context; default is 100 milliseconds --alternate-sigint <String> specifies the signal to use instead of SIGINT for interrupting a long-running cell; value does not include the SIG prefix; use of USR2 is recommended --nosparkcontext kernel should not create a spark context -v, --version display version information
There are two way of setting these configuration options.
The first is at install time with the --toree_opts
command line option.
jupyter toree instal --toree_opts='--nosparkcontext'
The second option is configured at run time through the TOREE_OPTS
environment variable.
TOREE_OPTS='--nosparkcontext' jupyter notebook
Note: There is an order of precedence to the configuration options. TOREE_OPTS
will overwrite any values configured in --toree_opts
.
Apache Toree provides support for multiple languages. To enable this you need to install the configurations for these interpreters as a comma seperated list to the --interpreters
flag:
jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL
The available interpreters and their supported languages are:
Language | Spark Implementation | Value to provide to Apache Toree |
---|---|---|
Scala | Scala with Spark | Scala |
Python | Python with PySpark | PySpark |
R | R with SparkR | SparkR |
SQL | Spark SQL | SQL |
If the package directory is not writable by the Apache Toree, then you should see an error similar to the following:
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Warning in install.packages("sparkr_bundle.tar.gz", repos = NULL, type = "source") : 'lib = "/usr/local/lib/R/site-library"' is not writable Error in install.packages("sparkr_bundle.tar.gz", repos = NULL, type = "source") : unable to install packages Execution halted