blob: 1b5bdcf590e38682619181bf2535b3e9fab8b2da [file] [log] [blame] [view]
---
layout: page
title: "Install"
description: "This page will help you get started and will guide you through installing Apache Zeppelin and running it in the command line."
group: quickstart
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
{% include JB/setup %}
# Zeppelin on Kubernetes
Zeppelin can run on clusters managed by [Kubernetes](https://kubernetes.io/). When Zeppelin runs in Pod, it creates pods for individual interpreter. Also Spark interpreter auto configured to use Spark on Kubernetes in client mode.
Key benefits are
- Interpreter scale-out
- Spark interpreter auto configure Spark on Kubernetes
- Able to customize Kubernetes yaml file
- Spark UI access
## Prerequisites
- Zeppelin >= 0.9.0 docker image
- Spark >= 2.4.0 docker image (in case of using Spark Interpreter)
- A running Kubernetes cluster with access configured to it using [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
- [Kubernetes DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/) configured in your cluster
- Enough cpu and memory in your Kubernetes cluster. We recommend 4CPUs, 6g of memory to be able to start Spark Interpreter with few executors.
- If you're using [minikube](https://kubernetes.io/docs/setup/minikube/), check your cluster capacity (`kubectl describe node`) and increase if necessary
```
$ minikube delete # otherwise configuration won't apply
$ minikube config set cpus <number>
$ minikube config set memory <number in MB>
$ minikube start
$ minikube config view
```
## Quickstart
Let's first clone the Zeppelin repository from GitHub:
```sh
git clone https://github.com/apache/zeppelin.git
cd zeppelin
# you can check out to your desired version/branch
# git checkout tags/v0.10.1
# just make sure you check the version inside "./pom.xml"
```
Now we are going to create the `zeppelin-distribution` image. This may take some time and this image will be used as a base for the upcoming required images:
```sh
docker build -t zeppelin-distribution:latest -f ./Dockerfile .
```
Next, we will build our `zeppelin-server` image:
```sh
cd scripts/docker/zeppelin-server
# Looking at the "./pom.xml" we can see the version is 0.13.0-SNAPSHOT
# Let's set the correct version in our Dockerfile:
# vi Dockerfile
# ARG version="0.13.0-SNAPSHOT"
# Once you saved the Dockerfile with the correct version we can build our image:
docker build -t zeppelin-server:0.13.0-SNAPSHOT -f ./Dockerfile .
```
The last image we build is `zeppelin-interpreter`:
```sh
cd scripts/docker/zeppelin-interpreter
docker build -t zeppelin-interpreter:0.13.0-SNAPSHOT -f ./Dockerfile .
```
So we should now have the following images:
```sh
# sudo if you are on Linux and Docker requires root
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
zeppelin-interpreter 0.13.0-SNAPSHOT 4f77fe989eed 3 minutes ago 622MB
zeppelin-server 0.13.0-SNAPSHOT 4f77fe989eed 3 minutes ago 622MB
zeppelin-distribution latest bd2fb4b321d2 40 minutes ago 1.27GB
```
Reminder: Please adjust the images in the YAML-File of `zeppelin-server.yaml`
Start zeppelin on Kubernetes cluster,
```sh
kubectl apply -f zeppelin-server.yaml
```
Port forward Zeppelin server port,
```sh
kubectl port-forward zeppelin-server 8080:80
```
and browse [localhost:8080](http://localhost:8080).
Try running some paragraphs and see if each interpreter is running as a Pod (using `kubectl get pods`), instead of a local process.
To shut down,
```sh
kubectl delete -f zeppelin-server.yaml
```
## Spark Interpreter
Build spark docker image to use Spark Interpreter.
Download spark binary distribution and run following command.
Spark 2.4.0 or later version is required.
```
# if you're using minikube, set docker-env
$ eval $(minikube docker-env)
# build docker image
$ <spark-distribution>/bin/docker-image-tool.sh -m -t 2.4.0 build
```
Run `docker images` and check if `spark:2.4.0` is created.
Configure `sparkContainerImage` of `zeppelin-server-conf` ConfigMap in `zeppelin-server.yaml`.
Create note and configure executor number (default 1)
```
%spark.conf
spark.executor.instances 5
```
And then start your spark interpreter
```
%spark
sc.parallelize(1 to 100).count
...
```
While `spark.master` property of SparkInterpreter starts with `k8s://` (default `k8s://https://kubernetes.default.svc` when Zeppelin started using zeppelin-server.yaml), Spark executors will be automatically created in your Kubernetes cluster.
Spark UI is accessible by clicking `SPARK JOB` on the Paragraph.
Check [here](https://spark.apache.org/docs/latest/running-on-kubernetes.html) to know more about Running Spark on Kubernetes.
## Build Zeppelin image manually
To build your own Zeppelin image, first build Zeppelin project with `-Pbuild-distr` flag.
```
$ ./mvnw package -DskipTests -Pbuild-distr <your flags>
```
Binary package will be created under `zeppelin-distribution/target` directory. Move created package file under `scripts/docker/zeppelin/bin/` directory.
```
$ mv zeppelin-distribution/target/zeppelin-*-bin.tgz scripts/docker/zeppelin/bin/
```
`scripts/docker/zeppelin/bin/Dockerfile` downloads package from internet. Modify the file to add package from filesystem.
```
...
# Find following section and comment out
#RUN echo "$LOG_TAG Download Zeppelin binary" && \
# wget -O /tmp/zeppelin-${Z_VERSION}-bin-all.tgz "https://www.apache.org/dyn/closer.lua/zeppelin/zeppelin-${Z_VERSION}/zeppelin-${Z_VERSION}-bin-all.tgz?action=download" && \
# tar -zxvf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
# rm -rf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
# mv /zeppelin-${Z_VERSION}-bin-all ${ZEPPELIN_HOME}
# Add following lines right after the commented line above
ADD zeppelin-${Z_VERSION}.tar.gz /
RUN ln -s /zeppelin-${Z_VERSION} /zeppelin
...
```
Then build docker image.
```
# configure docker env, if you're using minikube
$ eval $(minikube docker-env)
# change directory
$ cd scripts/docker/zeppelin/bin/
# build image. Replace <tag>.
$ docker build -t <tag> .
```
Finally, set custom image `<tag>` just created to `image` and `ZEPPELIN_K8S_CONTAINER_IMAGE` env variable of `zeppelin-server` container spec in `zeppelin-server.yaml` file.
Currently, single docker image is being used in both Zeppelin server and Interpreter pods. Therefore,
| Pod | Number of instances | Image | Note |
| --- | --- | --- | --- |
| Zeppelin Server | 1 | Zeppelin docker image | User creates/deletes with kubectl command |
| Zeppelin Interpreters | n | Zeppelin docker image | Zeppelin Server creates/deletes |
| Spark executors | m | Spark docker image | Spark Interpreter creates/deletes |
Currently, size of Zeppelin docker image is quite big. Zeppelin project is planning to provides lightweight images for each individual interpreter in the future.
## How it works
### Zeppelin on Kubernetes
`k8s/zeppelin-server.yaml` is provided to run Zeppelin Server with few sidecars and configurations.
Once Zeppelin Server is started in side Kubernetes, it auto configure itself to use `K8sStandardInterpreterLauncher`.
The launcher creates each interpreter in a Pod using templates located under `k8s/interpreter/` directory.
Templates in the directory applied in alphabetical order. Templates are rendered by [jinjava](https://github.com/HubSpot/jinjava)
and all interpreter properties are accessible inside the templates.
### Spark on Kubernetes
When interpreter group is `spark`, Zeppelin sets necessary spark configuration automatically to use Spark on Kubernetes.
It uses client mode, so Spark interpreter Pod works as a Spark driver, spark executors are launched in separate Pods.
This auto configuration can be overridden by manually setting `spark.master` property of Spark interpreter.
### Accessing Spark UI (or Service running in interpreter Pod)
Zeppelin server Pod has a reverse proxy as a sidecar, and it splits traffic to Zeppelin server and Spark UI running in the other Pods.
It assume both `<your service domain>` and `*.<your service domain>` point the nginx proxy address.
`<your service domain>` is directed to ZeppelinServer, `*.<your service domain>` is directed to interpreter Pods.
`<port>-<interpreter pod svc name>.<your service domain>` is convention to access any application running in interpreter Pod.
For example, When your service domain name is `local.zeppelin-project.org` Spark interpreter Pod is running with a name `spark-axefeg` and Spark UI is running on port 4040,
```
4040-spark-axefeg.local.zeppelin-project.org
```
is the address to access Spark UI.
Default service domain is `local.zeppelin-project.org:8080`. `local.zeppelin-project.org` and `*.local.zeppelin-project.org` configured to resolve `127.0.0.1`.
It allows access Zeppelin and Spark UI with `kubectl port-forward zeppelin-server 8080:80`.
If you like to use your custom domain
1. Configure [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) in Kubernetes cluster for `http` port of the service `zeppelin-server` defined in `k8s/zeppelin-server.yaml`.
2. Configure DNS record that your service domain and wildcard subdomain point the IP Addresses of your Ingress.
3. Modify `serviceDomain` of `zeppelin-server-conf` ConfigMap in `k8s/zeppelin-server.yaml` file.
4. Apply changes (e.g. `kubectl apply -f k8s/zeppelin-server.yaml`)
## Persist /notebook and /conf directory
Notebook and configurations are not persisted by default. Please configure volume and update `k8s/zeppelin-server.yaml`
to use the volume to persist /notebook and /conf directory if necessary.
## Customization
### Zeppelin Server Pod
Edit `k8s/zeppelin-server.yaml` and apply.
### Interpreter Pod
Since Interpreter Pod is created/deleted by ZeppelinServer using templates under `k8s/interpreter` directory,
to customize,
1. Prepare `k8s/interpreter` directory with customization (edit or create new yaml file), in a Kubernetes volume.
2. Modify `k8s/zeppelin-server.yaml` and mount prepared volume dir `k8s/interpreter` to `/zeppelin/k8s/interpreter/`.
3. Apply modified `k8s/zeppelin-server.yaml`.
4. Run a paragraph will create an interpreter using modified yaml files.
The interpreter pod can also be customized through the interpreter settings. Here are some of the properties:
| Property Name | Default Value | Description |
| ----- | ----- | ----- |
| `zeppelin.k8s.interpreter.namespace` | `default` | Specify the namespace of the current interpreter. Users can set different namespaces for different interpreters. In order to minimize permissions, the interpreter pod can only be created in the `default` namespace by default. If users need to create an interpreter pod in other namespaces, they need to add the corresponding `rolebinding` in `k8s/zeppelin-server.yaml`.|
| `zeppelin.k8s.interpreter.serviceAccount` | `default` | The Kubernetes service account to use. |
| `zeppelin.k8s.interpreter.container.image` | `apache/zeppelin:<ZEPPELIN_VERSION>` | The interpreter image to use. |
| `zeppelin.k8s.interpreter.cores` | (optional) | The number of cpu cores to use. |
| `zeppelin.k8s.interpreter.memory` | (optional) | The memory to use, e.g., `1g`. |
| `zeppelin.k8s.interpreter.gpu.type` | (optional) | Set the type of gpu to request when the interpreter pod is required to schedule gpu resources, e.g., `nvidia.com/gpu`. |
| `zeppelin.k8s.interpreter.gpu.nums` | (optional) | Tne number of gpu to use. |
| `zeppelin.k8s.interpreter.imagePullSecrets` | (optional) | Set the comma-separated list of Kubernetes secrets while pulling images, e.g., `mysecret1,mysecret2` |
| `zeppelin.k8s.interpreter.container.imagePullPolicy` | (optional) | Set the pull policy of the interpreter image, e.g., `Always` |
| `zeppelin.k8s.spark.container.imagePullPolicy` | (optional) | Set the pull policy of the spark image, e.g., `Always` |
| `zeppelin.spark.uiWebUrl` | `//{{PORT}}-{{SERVICE_NAME}}.{{SERVICE_DOMAIN}}` | The URL for user to access Spark UI. The default value is a [jinjava](https://github.com/HubSpot/jinjava) template that contains three variables. |
| `zeppelin.k8s.spark.useIngress` | (optional) | If true, the [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) will be created when creating the spark interpreter. So users can access the Spark UI through Ingress. |
| `zeppelin.k8s.spark.ingress.host` | `{{PORT}}-{{SERVICE_NAME}}.{{SERVICE_DOMAIN}}` | If `zeppelin.k8s.spark.useIngress` is `true`, it configures the `host` value of the Ingress. The default value is a [jinjava](https://github.com/HubSpot/jinjava) template that contains three variables. Users can access the Spark UI through a customized `zeppelin.k8s.spark.ingress.host`. |
## Future work
- Smaller interpreter docker image.
- Blocking communication between interpreter Pod.
- Spark Interpreter Pod has Role CRUD for any pod/service in the same namespace. Which should be restricted to only Spark executors Pod.
- Per note interpreter mode by default when Zeppelin is running on Kubernetes
## Development
Instead of build Zeppelin distribution package and docker image everytime during development,
Zeppelin can run locally (such as inside your IDE in debug mode) and able to run Interpreter using [K8sStandardInterpreterLauncher](https://github.com/apache/zeppelin/blob/master/zeppelin-plugins/launcher/k8s-standard/src/main/java/org/apache/zeppelin/interpreter/launcher/K8sStandardInterpreterLauncher.java) by configuring following environment variables.
| Environment variable | Value | Description |
| ----- | ----- | ----- |
| `ZEPPELIN_RUN_MODE` | `k8s` | Make Zeppelin run interpreter on Kubernetes |
| `ZEPPELIN_K8S_PORTFORWARD` | `true` | Enable port forwarding from local Zeppelin instance to Interpreters running on Kubernetes |
| `ZEPPELIN_K8S_CONTAINER_IMAGE` | `<image>:<version>` | Zeppelin interpreter docker image to use |
| `ZEPPELIN_K8S_SPARK_CONTAINER_IMAGE` | `<image>:<version>` | Spark docker image to use |
| `ZEPPELIN_K8S_NAMESPACE` | `<k8s namespace>` | Kubernetes namespace to use |
| `KUBERNETES_AUTH_TOKEN` | `<token>` | Kubernetes auth token to create resources |