{% include JB/setup %}
Hadoop Submarine is the latest machine learning framework subproject in the Hadoop 3.1 release. It allows Hadoop to support Tensorflow, MXNet, Caffe, Spark, etc. A variety of deep learning frameworks provide a full-featured system framework for machine learning algorithm development, distributed model training, model management, and model publishing, combined with hadoop's intrinsic data storage and data processing capabilities to enable data scientists to Good mining and the value of the data.
A deep learning algorithm project requires data acquisition, data processing, data cleaning, interactive visual programming adjustment parameters, algorithm testing, algorithm publishing, algorithm job scheduling, offline model training, model online services and many other processes and processes. Zeppelin is a web-based notebook that supports interactive data analysis. You can use SQL, Scala, Python, etc. to make data-driven, interactive, collaborative documents.
You can use the more than 20 interpreters in zeppelin (for example: spark, hive, Cassandra, Elasticsearch, Kylin, HBase, etc.) to collect data, clean data, feature extraction, etc. in the data in Hadoop before completing the machine learning model training. The data preprocessing process.
By integrating submarine in zeppelin, we use zeppelin's data discovery, data analysis and data visualization and collaboration capabilities to visualize the results of algorithm development and parameter adjustment during machine learning model training.
As shown in the figure above, how the Submarine develops and models the machine learning algorithms through Zeppelin is explained from the system architecture.
After installing and deploying Hadoop 3.1+ and Zeppelin, submarine will create a fully separate Zeppelin Submarine interpreter Docker container for each user in YARN. This container contains the development and runtime environment for Tensorflow. Zeppelin Server connects to the Zeppelin Submarine interpreter Docker container in YARN. allows algorithmic engineers to perform algorithm development and data visualization in Tensorflow's stand-alone environment in Zeppelin Notebook.
After the algorithm is developed, the algorithm engineer can submit the algorithm directly to the YARN in offline transfer training in Zeppelin, real-time demonstration of model training with Submarine's TensorBoard for each algorithm engineer.
You can not only complete the model training of the algorithm, but you can also use the more than twenty interpreters in Zeppelin. Complete the data preprocessing of the model, For example, you can perform data extraction, filtering, and feature extraction through the Spark interpreter in Zeppelin in the Algorithm Note.
In the future, you can also use Zeppelin's upcoming Workflow workflow orchestration service. You can complete Spark, Hive data processing and Tensorflow model training in one Note. It is organized into a workflow through visualization, etc., and the scheduling of jobs is performed in the production environment.
As shown in the figure above, from the internal implementation, how Submarine combines Zeppelin's machine learning algorithm development and model training.
The algorithm engineer created a Tensorflow notebook (left image) in Zeppelin by using Submarine interpreter.
It is important to note that you need to complete the development of the entire algorithm in a Note.
You can use Spark for data preprocessing in some of the paragraphs in Note.
Use Python for algorithm development and debugging of Tensorflow in other paragraphs of notebook, Submarine creates a Zeppelin Submarine Interpreter Docker Container for you in YARN, which contains the following features and services:
Submarine interpreter Docker Image It is Submarine that provides you with an image file that supports Tensorflow (CPU and GPU versions). And installed the algorithm library commonly used by Python. You can also install other development dependencies you need on top of the base image provided by Submarine.
When you complete the development of the algorithm module, You can do this by creating a new paragraph in Note and typing %submarine dashboard
. Zeppelin will create a Submarine Dashboard. The machine learning algorithm written in this Note can be submitted to YARN as a JOB by selecting the JOB RUN
command option in the Control Panel. Create a Tensorflow Model Training Docker Container, The container contains the following sections:
Submarine Tensorflow Docker Image There is Submarine that provides you with an image file that supports Tensorflow (CPU and GPU versions). And installed the algorithm library commonly used by Python. You can also install other development dependencies you need on top of the base image provided by Submarine.
After creating a Note with Submarine Interpreter in Zeppelin, You can add a paragraph to Note if you need it. Using the %submarine.sh identifier, you can use the Shell command to perform various operations on the Submarine Interpreter Docker Container, such as:
You can add one or more paragraphs to Note. Write the algorithm module for Tensorflow in Python using the %submarine.python
identifier.
After writing the Tensorflow algorithm by using %submarine.python
, You can add a paragraph to Note. Enter the %submarine dashboard and execute it. Zeppelin will create a Submarine Dashboard.
With Submarine Dashboard you can do all the operational control of Submarine, for example:
Usage:Display Submarine's command description to help developers locate problems.
Refresh:Zeppelin will erase all your input in the Dashboard.
Tensorboard:You will be redirected to the Tensorboard WEB system created by Submarine for each user. With Tensorboard you can view the real-time status of the Tensorflow model training in real time.
Command
JOB RUN
will display the parameter input interface for submitting JOB.JOB STOP
You can choose to execute the JOB STOP
command. Stop a Tensorflow model training task that has been submitted and is running
TENSORBOARD START
You can choose to execute the TENSORBOARD START
command to create your TENSORBOARD Docker Container.
TENSORBOARD STOP
You can choose to execute the TENSORBOARD STOP
command to stop and destroy your TENSORBOARD Docker Container.
JOB RUN
execution.Zeppelin Submarine interpreter provides the following properties to customize the Submarine interpreter
The docker images file is stored in the zeppelin/scripts/docker/submarine
directory.
submarine interpreter cpu version
submarine interpreter gpu version
tensorflow 1.10 & hadoop 3.1.2 cpu version
tensorflow 1.10 & hadoop 3.1.2 gpu version
0.1.0 (Zeppelin 0.9.0) :
Hadoop Submarine Project: https://hadoop.apache.org/submarine Youtube Submarine Channel: https://www.youtube.com/channel/UC4JBt8Y8VJ0BW0IM9YpdCyQ