[{"title":"Deploy Submarine On K8s","type":0,"sectionRef":"#","url":"docs/adminDocs/k8s/helm","content":"","keywords":""},{"title":"Deploy Submarine Using Helm Chart (Recommended)","type":1,"pageTitle":"Deploy Submarine On K8s","url":"docs/adminDocs/k8s/helm#deploy-submarine-using-helm-chart-recommended","content":"Submarine's Helm Chart will deploy Submarine Server, TF/PyTorch Operator, Notebook controller and Traefik. We use the TF/PyTorch operator to run tf/pytorch job, the notebook controller to manage jupyter notebook and Traefik as reverse-proxy. "},{"title":"Install Helm","type":1,"pageTitle":"Deploy Submarine On K8s","url":"docs/adminDocs/k8s/helm#install-helm","content":"Helm v3 is minimum requirement. See here for installation: https://helm.sh/docs/intro/install/ "},{"title":"Install Submarine","type":1,"pageTitle":"Deploy Submarine On K8s","url":"docs/adminDocs/k8s/helm#install-submarine","content":"The Submarine helm charts is released with the source code for now. Please go to http://submarine.apache.org/download.html to download Install Helm charts from source code cd <PathTo>/submarine helm install submarine ./helm-charts/submarine Copy This will install submarine in the \"default\" namespace. The images are from Docker hub apache/submarine. See ./helm-charts/submarine/values.yaml for more details If we'd like use a different namespace like \"submarine\" kubectl create namespace submarine helm install submarine ./helm-charts/submarine -n submarine Copy Note that if you encounter below issue when installation: Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: podgroups.scheduling.incubator.k8s.io, existing_kind: apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition, new_kind: apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition Copy It might be caused by the previous installed submarine charts. Fix it by running: kubectl delete crd/tfjobs.kubeflow.org && kubectl delete crd/podgroups.scheduling.incubator.k8s.io && kubectl delete crd/pytorchjobs.kubeflow.org Copy Verify installation Once you got it installed, check with below commands and you should see similar outputs: kubectl get pods Copy NAME READY STATUS RESTARTS AGE notebook-controller-deployment-5db8b6cbf7-k65jm 1/1 Running 0 5s pytorch-operator-7ff5d96d59-gx7f5 1/1 Running 0 5s submarine-database-8d95d74f7-ntvqp 1/1 Running 0 5s submarine-server-b6cd4787b-7bvr7 1/1 Running 0 5s submarine-traefik-9bb6f8577-66sx6 1/1 Running 0 5s tf-job-operator-7844656dd-lfgmd 1/1 Running 0 5s Copy "},{"title":"Configure volume type","type":1,"pageTitle":"Deploy Submarine On K8s","url":"docs/adminDocs/k8s/helm#configure-volume-type","content":"Submarine can support various volume types, currently including hostPath (default) and NFS. It can be easily configured in the ./helm-charts/submarine/values.yaml, or you can override the default values in values.yaml by helm CLI. hostPath# In hostPath, you can store data directly in your node.Usage: Configure setting in ./helm-charts/submarine/values.yaml.To enable hostPath storage, set .storage.type to host.To set the root path for your storage, set .storage.host.root to <any-path> Example: # ./helm-charts/submarine/values.yaml storage: type: host host: root: /tmp Copy NFS (Network File System)# In NFS, it allows multiple clients to access a shared space.Prerequisite: A pre-existing NFS server. You have two options. Create NFS server kubectl create -f ./dev-support/nfs-server/nfs-server.yaml Copy It will create a nfs-server pod in kubernetes cluster, and expose nfs-server ip at 10.96.0.2Use your own NFS server Install NFS dependencies in your nodes Ubuntu apt-get install -y nfs-common Copy CentOS yum install nfs-util Copy Usage: Configure setting in ./helm-charts/submarine/values.yaml.To enable NFS storage, set .storage.type to nfs.To set the ip for NFS server, set .storage.nfs.ip to <any-ip> Example: # ./helm-charts/submarine/values.yaml storage: type: nfs nfs: ip: 10.96.0.2 Copy "},{"title":"Access to Submarine Server","type":1,"pageTitle":"Deploy Submarine On K8s","url":"docs/adminDocs/k8s/helm#access-to-submarine-server","content":"Submarine server by default expose 8080 port within K8s cluster. After Submarine v0.5 uses Traefik as reverse-proxy by default. If you don't want to use Traefik, you can modify below value to false in ./helm-charts/submarine/values.yaml. # Use Traefik by default traefik: enabled: true Copy To access the server from outside of the cluster, we use Traefik ingress controller and NodePort for external access.\\ Please refer to ./helm-charts/submarine/charts/traefik/values.yaml and Traefik docsfor more details if you want to customize the default value for Traefik. Notice:If you use kind to run local Kubernetes cluster, please refer to this docsand set the configuration \"extraPortMappings\" when creating the k8s cluster. kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraPortMappings: - containerPort: 32080 hostPort: [the port you want to access] Copy # Use nodePort and Traefik ingress controller by default. # To access the submarine server, open the following URL in your browser. http://127.0.0.1:32080 Copy If minikube is installed, use the following command to find the URL to the Submarine server. $ minikube service submarine-traefik --url Copy "},{"title":"Uninstall Submarine","type":1,"pageTitle":"Deploy Submarine On K8s","url":"docs/adminDocs/k8s/helm#uninstall-submarine","content":"helm delete submarine Copy "},{"title":"Submarine Local Deployment","type":0,"sectionRef":"#","url":"docs/","content":"","keywords":""},{"title":"Prerequisite","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#prerequisite","content":"kubectl helm (Helm v3 is minimum requirement.)minikube. "},{"title":"Deploy Kubernetes Cluster","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#deploy-kubernetes-cluster","content":"$ minikube start --vm-driver=docker --cpus 8 --memory 4096 --disk-size=20G --kubernetes-version v1.15.11 Copy "},{"title":"Install Submarine on Kubernetes","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#install-submarine-on-kubernetes","content":"$ git clone https://github.com/apache/submarine.git $ cd submarine $ helm install submarine ./helm-charts/submarine Copy NAME: submarine LAST DEPLOYED: Fri Jan 29 05:35:36 2021 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None Copy "},{"title":"Verify installation","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#verify-installation","content":"Once you got it installed, check with below commands and you should see similar outputs: $ kubectl get pods Copy NAME READY STATUS RESTARTS AGE notebook-controller-deployment-5db8b6cbf7-k65jm 1/1 Running 0 5s pytorch-operator-7ff5d96d59-gx7f5 1/1 Running 0 5s submarine-database-8d95d74f7-ntvqp 1/1 Running 0 5s submarine-server-b6cd4787b-7bvr7 1/1 Running 0 5s submarine-traefik-9bb6f8577-66sx6 1/1 Running 0 5s tf-job-operator-7844656dd-lfgmd 1/1 Running 0 5s Copy warning Note that if you encounter below issue when installation: Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: podgroups.scheduling.incubator.k8s.io, existing_kind: apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition, new_kind: apiextensions.k8s.io/v1beta1, Kind=CustomResourceDefinition Copy It might be caused by the previous installed submarine charts. Fix it by running: $ kubectl delete crd/tfjobs.kubeflow.org && kubectl delete crd/podgroups.scheduling.incubator.k8s.io && kubectl delete crd/pytorchjobs.kubeflow.org Copy "},{"title":"Use Port Forwarding to Access Submarine in a Cluster","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#use-port-forwarding-to-access-submarine-in-a-cluster","content":"# # Listen on port 32080 on all addresses, forwarding to 80 in the pod $ kubectl port-forward --address 0.0.0.0 service/submarine-traefik 32080:80 Copy "},{"title":"Open Workbench in the browser.","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#open-workbench-in-the-browser","content":"Open http://127.0.0.1:32080. The default username and password is admin and admin  "},{"title":"Uninstall Submarine","type":1,"pageTitle":"Submarine Local Deployment","url":"docs/#uninstall-submarine","content":"$ helm delete submarine Copy "},{"title":"Submarine on K8s","type":0,"sectionRef":"#","url":"docs/adminDocs/k8s/README","content":"","keywords":""},{"title":"Install Submarine","type":1,"pageTitle":"Submarine on K8s","url":"docs/adminDocs/k8s/README#install-submarine","content":""},{"title":"Setup Kubernetes","type":1,"pageTitle":"Submarine on K8s","url":"docs/adminDocs/k8s/README#setup-kubernetes","content":"Submarine can be deployed on any K8s environment if version matches. If you don't have a running K8s, you can set up a K8s using Docker Desktop, MiniKube, or kind, Kubernetes-in-Docker. From our experiences, Docker Desktop is an easier choice. "},{"title":"Install Submarine Use Helm Charts","type":1,"pageTitle":"Submarine on K8s","url":"docs/adminDocs/k8s/README#install-submarine-use-helm-charts","content":"After you have an up-and-running K8s, you can follow Submarine Helm Charts Guide to deploy Submarine services on K8s cluster in minutes. "},{"title":"Setup a Kubernetes cluster using Kind","type":0,"sectionRef":"#","url":"docs/adminDocs/k8s/kind","content":"","keywords":""},{"title":"Create K8s cluster","type":1,"pageTitle":"Setup a Kubernetes cluster using Kind","url":"docs/adminDocs/k8s/kind#create-k8s-cluster","content":"We recommend using kind to setup a Kubernetes cluster on a local machine. Running the following command: kind create cluster --image kindest/node:v1.15.6 --name submarine kubectl create namespace submarine Copy "},{"title":"Kubernetes Dashboard (optional)","type":1,"pageTitle":"Setup a Kubernetes cluster using Kind","url":"docs/adminDocs/k8s/kind#kubernetes-dashboard-optional","content":""},{"title":"Deploy","type":1,"pageTitle":"Setup a Kubernetes cluster using Kind","url":"docs/adminDocs/k8s/kind#deploy","content":"To deploy Dashboard, execute following command: kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-beta8/aio/deploy/recommended.yaml Copy "},{"title":"Create RBAC","type":1,"pageTitle":"Setup a Kubernetes cluster using Kind","url":"docs/adminDocs/k8s/kind#create-rbac","content":"Ensure to grant the cluster access permission of dashboard, run the following command: kubectl create serviceaccount dashboard-admin-sa kubectl create clusterrolebinding dashboard-admin-sa --clusterrole=cluster-admin --serviceaccount=default:dashboard-admin-sa Copy "},{"title":"Get access token (optional)","type":1,"pageTitle":"Setup a Kubernetes cluster using Kind","url":"docs/adminDocs/k8s/kind#get-access-token-optional","content":"If you want to use the token to login the dashboard, run the following command to get key: kubectl get secrets # select the right dashboard-admin-sa-token to describe the secret kubectl describe secret dashboard-admin-sa-token-6nhkx Copy "},{"title":"Start dashboard service","type":1,"pageTitle":"Setup a Kubernetes cluster using Kind","url":"docs/adminDocs/k8s/kind#start-dashboard-service","content":"To start the dashboard service, we can run the following command: kubectl proxy Copy Now access Dashboard at: http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ "},{"title":"tensorflow","type":0,"sectionRef":"#","url":"docs/adminDocs/k8s/tensorflow","content":"","keywords":""},{"title":"TFJob","type":1,"pageTitle":"tensorflow","url":"docs/adminDocs/k8s/tensorflow#tfjob","content":"We support Tensorflow job on kubernetes by using the tf-operator as a runtime. For more info about tf-operator see here. "},{"title":"Deploy tf-operator","type":1,"pageTitle":"tensorflow","url":"docs/adminDocs/k8s/tensorflow#deploy-tf-operator","content":"If you don't have the submarine namespace on your K8s cluster, you should create it first. Run command: kubectl create namespace submarine Running the follow commands: kubectl apply -f ./dev-support/k8s/tfjob/crd.yaml kubectl kustomize ./dev-support/k8s/tfjob/operator | kubectl apply -f - Copy Since K8s 1.14, Kubectl also supports the management of Kubernetes objects using a kustomization file. For more info see kustomization Default namespace is submarine, if you want to modify the namespace, please modify ./dev-support/k8s/tfjob/operator/kustomization.yaml, such as modify ${NAMESPACE} as below: apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: ${NAMESPACE} resources: - cluster-role-binding.yaml - cluster-role.yaml - deployment.yaml - service-account.yaml - service.yaml commonLabels: kustomize.component: tf-job-operator images: - name: apache/submarine newName: apache/submarine newTag: tf_operator-v1.1.0-g92389064 Copy "},{"title":"run-tensorflow-experiment-ui","type":0,"sectionRef":"#","url":"docs/adminDocs/k8s/run-tensorflow-experiment-ui","content":"","keywords":""},{"title":"Steps to run Tensorflow Experiment","type":1,"pageTitle":"run-tensorflow-experiment-ui","url":"docs/adminDocs/k8s/run-tensorflow-experiment-ui#steps-to-run-tensorflow-experiment","content":"Click + New Experiment on the \"Experiment\" page. Click Define your experiment Put a name to experiment, like \"minst-example\" Command: python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150 Image you can put; apache/submarine:tf-mnist-with-summaries-1.0 Click Next Step Choose Distributed Tensorflow Click Add new spec twice to add two new specs (roles) One is Worker, another one is PS, leave rest of the parameters unchanged Click next step, you can review your parameters before submitting the job: It should look like below: Name\tmnist-example-111 Command\tpython /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150 Image\tapache/submarine:tf-mnist-with-summaries-1.0 Environment Variables Ps\tcpu=1,nvidia.com/gpu=0,memory=1024M Worker\tcpu=1,nvidia.com/gpu=0,memory=1024M  Click Submit it will be submitted, you can see the new example running in the Experiment list, you can get logs, etc. directly from the UI "},{"title":"Running Submarine on YARN","type":0,"sectionRef":"#","url":"docs/adminDocs/yarn/README","content":"","keywords":""},{"title":"Hadoop version","type":1,"pageTitle":"Running Submarine on YARN","url":"docs/adminDocs/yarn/README#hadoop-version","content":"Must: Apache Hadoop version newer than 2.7.3 Optional: When you want to use GPU-on-YARN feature with Submarine, please make sure Hadoop is at least 2.10.0+ (or 3.1.0+), and follow Enable GPU on YARN 2.10.0+ to enable GPU-on-YARN feature.When you want to run training jobs with Docker container, please make sure Hadoop is at least 2.8.2, and follow Enable Docker on YARN 2.8.2+ to enable Docker-on-YARN feature. "},{"title":"Submarine YARN Runtime Guide","type":1,"pageTitle":"Running Submarine on YARN","url":"docs/adminDocs/yarn/README#submarine-yarn-runtime-guide","content":"YARN Runtime Guide talk about how to use Submarine to run jobs on YARN, with Docker / without Docker. "},{"title":"HowToRun","type":0,"sectionRef":"#","url":"docs/adminDocs/yarn/workbench/HowToRun","content":"","keywords":""},{"title":"Two versions of Submarine Workbench","type":1,"pageTitle":"HowToRun","url":"docs/adminDocs/yarn/workbench/HowToRun#two-versions-of-submarine-workbench","content":"Angular (default)Vue (This is the old version, and it will be replaced by version Angular in the future.) (WARNING: Please restart a new incognito window when you switch to different versions of Submarine Workbench)# "},{"title":"Launch the Submarine Workbench(Angular)","type":1,"pageTitle":"HowToRun","url":"docs/adminDocs/yarn/workbench/HowToRun#launch-the-submarine-workbenchangular","content":"It should be noted that since Submarine Workbench depends on the Submarine database, so you need to run the docker container of the Submarine database first. docker run -it -p 3306:3306 -d --name submarine-database -e MYSQL_ROOT_PASSWORD=password apache/submarine:database-<REPLACE_VERSION> docker run -it -p 8080:8080 -d --link=submarine-database:submarine-database --name submarine-server apache/submarine:server-<REPLACE_VERSION> Copy The login page of Submarine Workbench will be shown in http://127.0.0.1:8080. "},{"title":"Check the data in the submarine-database","type":1,"pageTitle":"HowToRun","url":"docs/adminDocs/yarn/workbench/HowToRun#check-the-data-in-the-submarine-database","content":"Step1: Enter the submarine-database container docker exec -it submarine-database bash Copy Step2: Enter MySQL database mysql -uroot -ppassword Copy Step3: List the data in the table // list all databases show databases; // choose a database use ${target_database}; // list all tables show tables; // list the data in the table select * from ${target_table}; Copy Run Submarine Workbench without docker# "},{"title":"Run Submarine Workbench","type":1,"pageTitle":"HowToRun","url":"docs/adminDocs/yarn/workbench/HowToRun#run-submarine-workbench","content":"cd submarine ./bin/submarine-daemon.sh [start|stop|restart] Copy To start workbench server, you need to download MySQL jdbc jar and put it in the path of workbench/lib for the first time. Or you can add parameter, getMysqlJar, to get MySQL jar automatically. cd submarine ./bin/submarine-daemon.sh start getMysqlJar Copy "},{"title":"submarine-env.sh","type":1,"pageTitle":"HowToRun","url":"docs/adminDocs/yarn/workbench/HowToRun#submarine-envsh","content":"submarine-env.sh is automatically executed each time the submarine-daemon.sh script is executed, so we can set the submarine-daemon.sh script and the environment variables in the SubmarineServer process via submarine-env.sh. Name\tVariableJAVA_HOME\tSet your java home path, default is java. SUBMARINE_JAVA_OPTS\tSet the JAVA OPTS parameter when the Submarine Workbench process starts. If you need to debug the Submarine Workbench process, you can set it to -agentlib:jdwp=transport=dt_socket, server=y,suspend=n,address=5005 SUBMARINE_MEM\tSet the java memory parameter when the Submarine Workbench process starts. MYSQL_JAR_URL\tThe customized URL to download MySQL jdbc jar. MYSQL_VERSION\tThe version of MySQL jdbc jar to downloaded. The default value is 5.1.39. It's used to generate the default value of MYSQL_JDBC_URL "},{"title":"submarine-site.xml","type":1,"pageTitle":"HowToRun","url":"docs/adminDocs/yarn/workbench/HowToRun#submarine-sitexml","content":"submarine-site.xml is the configuration file for the entire Submarine system to run. Name\tVariablesubmarine.server.addr\tSubmarine server address, default is 0.0.0.0 submarine.server.port\tSubmarine server port, default 8080 submarine.ssl\tShould SSL be used by the Submarine servers?, default false submarine.server.ssl.port\tServer ssl port. (used when ssl property is set to true), default 8483 submarine.ssl.client.auth\tShould client authentication be used for SSL connections? submarine.ssl.keystore.path\tPath to keystore relative to Submarine configuration directory submarine.ssl.keystore.type\tThe format of the given keystore (e.g. JKS or PKCS12) submarine.ssl.keystore.password\tKeystore password. Can be obfuscated by the Jetty Password tool submarine.ssl.key.manager.password\tKey Manager password. Defaults to keystore password. Can be obfuscated. submarine.ssl.truststore.path\tPath to truststore relative to Submarine configuration directory. Defaults to the keystore path submarine.ssl.truststore.type\tThe format of the given truststore (e.g. JKS or PKCS12). Defaults to the same type as the keystore type submarine.ssl.truststore.password\tTruststore password. Can be obfuscated by the Jetty Password tool. Defaults to the keystore password workbench.web.war\tSubmarine Workbench web war file path. "},{"title":"Test and Troubleshooting","type":0,"sectionRef":"#","url":"docs/adminDocs/yarn/TestAndTroubleshooting","content":"","keywords":""},{"title":"Test with a tensorflow job","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#test-with-a-tensorflow-job","content":"Distributed-shell + GPU + cgroup  ... \\ job run \\ --env DOCKER_JAVA_HOME=/opt/java \\ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf-gpu \\ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \\ --worker_docker_image tf-1.13.1-gpu:0.0.1 \\ --ps_docker_image tf-1.13.1-cpu:0.0.1 \\ --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \\ --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \\ --num_ps 0 \\ --ps_resources memory=4G,vcores=2,gpu=0 \\ --ps_launch_cmd \"python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0\" \\ --worker_resources memory=4G,vcores=2,gpu=1 --verbose \\ --num_workers 1 \\ --worker_launch_cmd \"python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1\" Copy "},{"title":"Issues:","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#issues","content":""},{"title":"Issue 1: Fail to start nodemanager after system reboot","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#issue-1-fail-to-start-nodemanager-after-system-reboot","content":"2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) 2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED Copy Solution: Grant user yarn the access to /sys/fs/cgroup/cpu,cpuacct, which is the subfolder of cgroup mount destination. chown :yarn -R /sys/fs/cgroup/cpu,cpuacct chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct Copy If GPUs are used，the access to cgroup devices folder is neede as well chown :yarn -R /sys/fs/cgroup/devices chmod g+rwx -R /sys/fs/cgroup/devices Copy "},{"title":"Issue 2: container-executor permission denied","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#issue-2-container-executor-permission-denied","content":"2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command: java.io.IOException: Cannot run program \"/etc/yarn/sbin/Linux-amd64-64/container-executor\": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) at org.apache.hadoop.util.Shell.run(Shell.java:901) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) Copy Solution: The permission of /etc/yarn/sbin/Linux-amd64-64/container-executor should be 6050 "},{"title":"Issue 3：How to get docker service log","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#issue-3：how-to-get-docker-service-log","content":"Solution: we can get docker log with the following command journalctl -u docker Copy "},{"title":"Issue 4：docker can't remove containers with errors like device or resource busy","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#issue-4：docker-cant-remove-containers-with-errors-like-device-or-resource-busy","content":"$ docker rm 0bfafa146431 Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy Copy Solution: to find which process leads to a device or resource busy, we can add a shell script, named find-busy-mnt.sh #!/usr/bin/env bash # A simple script to get information about mount points and pids and their # mount namespaces. if [ $# -ne 1 ];then echo \"Usage: $0 <devicemapper-device-id>\" exit 1 fi ID=$1 MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null` [ -z \"$MOUNTS\" ] && echo \"No pids found\" && exit 0 printf \"PID\\tNAME\\t\\tMNTNS\\n\" echo \"$MOUNTS\" | while read LINE; do PID=`echo $LINE | cut -d \":\" -f1 | cut -d \"/\" -f3` # Ignore self and thread-self if [ \"$PID\" == \"self\" ] || [ \"$PID\" == \"thread-self\" ]; then continue fi NAME=`ps -q $PID -o comm=` MNTNS=`readlink /proc/$PID/ns/mnt` printf \"%s\\t%s\\t\\t%s\\n\" \"$PID\" \"$NAME\" \"$MNTNS\" done Copy Kill the process by pid, which is found by the script $ chmod +x find-busy-mnt.sh ./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a # PID NAME MNTNS # 5007 ntpd mnt:[4026533598] $ kill -9 5007 Copy "},{"title":"Issue 5：Yarn failed to start containers","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/adminDocs/yarn/TestAndTroubleshooting#issue-5：yarn-failed-to-start-containers","content":"if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created. "},{"title":"setup-jupyter","type":0,"sectionRef":"#","url":"docs/adminDocs/yarn/workbench/notebook/setup-jupyter","content":"","keywords":""},{"title":"Experiment environment","type":1,"pageTitle":"setup-jupyter","url":"docs/adminDocs/yarn/workbench/notebook/setup-jupyter#experiment-environment","content":""},{"title":"Setup Kubernetes","type":1,"pageTitle":"setup-jupyter","url":"docs/adminDocs/yarn/workbench/notebook/setup-jupyter#setup-kubernetes","content":"We recommend using kind to setup a Kubernetes cluster on a local machine. You can use Extra mounts to mount your host path to kind node and use Extra port mappings to port forward to the kind nodes. Please refer to kind configurationfor more details. You need to create a kind config file. The following is an example : kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraMounts: # add a mount from /path/to/my/files on the host to /files on the node - hostPath: /tmp/submarine containerPath: /tmp/submarine extraPortMappings: - containerPort: 80 hostPort: 80 protocol: TCP # exposing additional ports to be used for NodePort services - containerPort: 30070 hostPort: 8888 protocol: TCP Copy Running the following command: kind create cluster --image kindest/node:v1.15.6 --config <path-to-kind-config> --name k8s-submarine kubectl create namespace submarine Copy "},{"title":"Deploy Jupyter Notebook","type":1,"pageTitle":"setup-jupyter","url":"docs/adminDocs/yarn/workbench/notebook/setup-jupyter#deploy-jupyter-notebook","content":"Once you have a running Kubernetes cluster, you can write a YAML file to deploy a jupyter notebook. In this example yaml, we use jupyter/minimal-notebookto make a single notebook running on the kind node. kubectl apply -f jupyter.yaml --namespace submarine Copy Once jupyter notebook is running, you can access the notebook server from the browser using http://localhost:8888 on local machine. You can enter and store a password for your notebook server with: kubectl exec -it <jupyter-pod-name> -- jupyter notebook password Copy After restarting the notebook server, you can login jupyter notebook with your new password. If you want to use JupyterLab : http://localhost:8888/lab Copy "},{"title":"README.zh-CN","type":0,"sectionRef":"#","url":"docs/adminDocs/yarn/workbench/README.zh-CN","content":"","keywords":""},{"title":"Register","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#register","content":"每个需要使用 Submarine 进行机器学习算法开发的用户，都可以登录 Submarine Workbench 的 WEB 首页，在首页上，点击注册链接，填写用户名、注册邮箱和密码就可以完成注册，但此时用户状态为 等待审核 状态。 管理员在 Submarine Workbench 中接收到用户的注册请求后，设置用户的操作权限，所属机构部门和分配资源，设置用户状态为 审核通过 后，用户才可以登录 Submarine Workbench。 "},{"title":"Login","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#login","content":"每个 Submarine 的用户在 Login 页面中输入用户名和密码，登录到 Submarine Workbench 的首页 Home。 "},{"title":"Home","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#home","content":"在 Submarine Workbench 的 Home 首页中，顶层通过四个图表显示了用户的资源的使用情况和任务执行的情况。 在 Quick Start 列表中，显示了 Workbench 中最常使用的功能链接，方便用户可以快速的进行工作。 在 Open Recent 列表中，显示了用户最近使用过的九个项目，方便你快速的进行工作。 在 What‘s New？ 列表中，显示了 Submarine 最新发布的一些功能特性和项目信息，方便你了解 Submarine 项目的最新进展。 "},{"title":"Workspace","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#workspace","content":"Workspace 主要有五个 Tab 页组成，每个 Tab 页的标题中显示了各自项目的总数。 "},{"title":"Project","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#project","content":"在 Project 页面中，以卡片的方式显示了用户自己创建的所有 Project。  每个 Project 卡片由以下部分内容组成： Project 类型：目前 Submarine 支持 Notebook、Python、R、Scala、Tensorflow 和 PyTorch 这六种类型的机器学习算法框架和开发语言，在项目卡片中以对应的图标进行标识。Project Tags：用户可以为每个 Project 打上不同的 Tag 标签，方便查找和管理。Github/Gitlab 集成：Submarine Workbench 与 Github/Gitlab 进行了系统集成，每个 Project 都可以在 Workbench 中进行 Watch、Star、Frok 和 Comment 操作。 Watch：[TODO]Star：[TODO]Fork：[TODO]Comment：用户可以在项目中进行评论 Edit：用户通过双击项目或者点击 Edit 按钮，可以在 Notebook 中打开项目，进行算法开发等操作。Download：用户通过点击 Download 按钮，将项目打包下载到本地。Setting：编辑项目信息，例如项目的名字，简介，分享级别和权限。Delete：删除项目中所有包含的文件。  Add New Project# 在项目页面中点击 Add New Project 按钮，将会显示出创建项目的引导页面，只需要三个步骤就可以创建一个新的项目。 第一步：在 Base Information 步骤中填写项目名称、项目简介。  Visibility: 设置项目对外的可见级别 Private: （默认）设置为私有项目，不对外公开项目中包含的所有文件，但是可以在 Notebook 中将项目的执行结果单独设置公开，方便其他人查看项目的可视化报告。Team: 设置为团队项目，在团队选择框中选择团队的名称，团队的其他成员可以根据设置的权限访问这个项目。Public: 设置为公开项目，Workbench 中的所有用户都可以通过搜索查看到这个项目。 Permission: 设置项目对外的访问权限，只有将项目的 Visibility 设置为 Team 或 Public 的时候，才会出现权限设置界面。 Can View 当项目的 Visibility 设置为 Team 时，团队中其他成员都只能查看这个项目的文件。 当项目的 Visibility 设置为 Public 时，Workbench 中其他成员都只能查看这个项目的文件。 Can Edit 当项目的 Visibility 设置为 Team 时，团队中其他成员都可以查看、编辑这个项目的文件。 当项目的 Visibility 设置为 Public 时，Workbench 中其他成员都可以查看、编辑这个项目的文件。 Can Execute 当项目的 Visibility 设置为 Team 时，团队中其他成员都可以查看、编辑、执行这个项目的文件。 当项目的 Visibility 设置为 Public 时，Workbench 中其他成员都可以查看、编辑、执行这个项目的文件。 第二步：在 Initial Project 步骤中，Workbench 提供了四种项目初始化的方式 Template: Workbench 内置了几种不同开发语言和算法框架的项目模版，你可以选择任何一种模版初始化你的项目，无需做任何修改就可以直接在 Notebook 中执行，特别适合新手进行快速的体验。 Blank：创建一个空白的项目，稍后，我们可以通过在 Notebook 中手工添加项目的文件 Upload: 通过上传 notebook 格式的文件来初始化你的项目，notebook 格式兼容 Jupyter Notebook 和 Zeppelin Notebook 文件格式。 Git Repo: 在你的 Github/Gitlab 账号中 Fork 一个仓库中的文件内容来初始化项目。 第三步：预览项目中的所包含的文件  Save: 将项目保存到 Workspace 中。Open In Notebook: 将项目保存到 Workspace 中，并用 Notebook 打开项目。 "},{"title":"Release","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#release","content":"[TODO] "},{"title":"Training","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#training","content":"[TODO] "},{"title":"Team","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#team","content":"[TODO] "},{"title":"Shared","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#shared","content":"[TODO] "},{"title":"Interpreters","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#interpreters","content":"[TODO] "},{"title":"Job","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#job","content":"[TODO] "},{"title":"Data","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#data","content":"[TODO] "},{"title":"Model","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#model","content":"[TODO] "},{"title":"Manager","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#manager","content":""},{"title":"User","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#user","content":"[TODO] "},{"title":"Team","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#team-1","content":"[TODO] "},{"title":"Data Dict","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#data-dict","content":"[TODO] "},{"title":"Department","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#department","content":"[TODO] "},{"title":"How to run workbench","type":1,"pageTitle":"README.zh-CN","url":"docs/adminDocs/yarn/workbench/README.zh-CN#how-to-run-workbench","content":"How To Run Submarine Workbench Guide "},{"title":"README","type":0,"sectionRef":"#","url":"docs/adminDocs/yarn/workbench/README","content":"","keywords":""},{"title":"Register","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#register","content":"Everyone who needs to use Submarine for machine learning algorithm development can log in to Submarine Workbench's WEB homepage. On the homepage, click the registration link, fill in the user name, email address and password to register the user. At this time, the user status is waiting for approval status. After receiving the registration request from the user in Submarine Workbench, the administrator sets the operation authority according to the user's needs, sets the user's organization and allocates resources, and sets the user status to pass the audit. The user can log in to the Submarine Workbench. Different users have different permission. "},{"title":"Login","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#login","content":"Each Submarine user logs in to the Home page of Submarine Workbench by entering their username and password on the Login page. "},{"title":"Home","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#home","content":"In the Submarine Workbench Home page, the top level shows the user's resource usage and task execution through four charts. In the Quick Start list, the most commonly used feature links in the Workbench are displayed so that users can work quickly. In the Open Recent list, there are nine items that the user has used recently, so you can work quickly. At What's New? In the list, some of the latest features and project information released by Submarine are displayed to help you understand the latest developments in the Submarine project. "},{"title":"Workspace","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#workspace","content":"Workspace consists primarily of five tab pages, with the total number of items in each tab page's title. "},{"title":"Project","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#project","content":"In the Project page, all the projects created by the user themselves are displayed as cards.  Each Project card consists of the following sections: Project Type：Submarine currently supports six types of machine learning algorithm frameworks and development languages: Notebook, Python, R, Scala, Tensorflow, and PyTorch, which are identified by corresponding icons in the project card.Project Tags：Users can tag each Project with different tags for easy searching and management.Github/Gitlab integrated：Submarine Workbench is system integrated with Github/Gitlab, and each Project can perform Watch, Star, Fork, and Comment operations in Workbench. Watch：[TODO]Star：[TODO]Fork：[TODO]Comment：Users can comment on the project. Edit：Users can open projects in Notebook and perform algorithm development by double-clicking on the project or by clicking the Edit button.Download：The user downloads the project package locally by clicking the Download button.Setting：Edit project information such as project name, profile, visibility level and permissions.Delete：Delete the project and all included files.  Add New Project# Clicking the Add New Project button on the project page will display the guide page for creating the project, and you can create a new project in just three steps. Step 1: Fill in the project name and project description in the Base Information step.  Visibility: Set the visibility level of the item externally Private: (Default) Set to private project, and all the files included in the project are not publicly displayed. but the execution result of the project can be individually set and exposed in Notebook, so that others can view the visual report of the project.Team: Set to team project, select the team name in the team selection box, and other members of the team can access the project according to the set permissions.Public: Set to public project, all users in Workbench can view this project through search. Permission: Set the external access rights of the project. The permission setting interface will appear only when the Visibility of the project is set to Team or Public. Can View When the project's Visibility is set to Team, other members of the team can only view the files for this project. When the project's Visibility is set to Public, other members of the Workbench can only view the files for this project. Can Edit When the project's Visibility is set to Team, other members of the team can view and edit the files for this project. When the project's Visibility is set to Public, other members of the Workbench can view and edit the files for this project. Can Execute When the project's Visibility is set to Team, other members of the team can view, edit, and execute the project's files. When the project's Visibility is set to Public, other members of the Workbench can view, edit, and execute the project's files. Step 2: In the Initial Project step, Workbench provides four ways to initialize the project. Template: Workbench Project templates with several different development languages and algorithm frameworks are built in. You can choose any template to initialize your project and you can execute it directly in Notebook without any modification. It is especially suitable for novices to experience quickly. Blank：Create a blank project, and later we can manually add the project's file in Notebook Upload: Initialize your project by uploading a file in notebook format that is compatible with the Jupyter Notebook and Zeppelin Notebook file formats. Git Repo: Fork a file in the repository to initialize the project in your Github/Gitlab account. Step 3：Preview the included files in the project  Save: Save the project to Workspace.Open In Notebook: Save the project to Workspace and open the project with Notebook. "},{"title":"Release","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#release","content":"[TODO] "},{"title":"Training","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#training","content":"[TODO] "},{"title":"Team","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#team","content":"[TODO] "},{"title":"Shared","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#shared","content":"[TODO] "},{"title":"Interpreters","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#interpreters","content":"[TODO] "},{"title":"Job","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#job","content":"[TODO] "},{"title":"Data","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#data","content":"[TODO] "},{"title":"Model","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#model","content":"[TODO] "},{"title":"Manager","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#manager","content":""},{"title":"User","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#user","content":"[TODO] "},{"title":"Team","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#team-1","content":"[TODO] "},{"title":"Data Dict","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#data-dict","content":"[TODO] "},{"title":"Department","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#department","content":"[TODO] "},{"title":"How to run workbench","type":1,"pageTitle":"README","url":"docs/adminDocs/yarn/workbench/README#how-to-run-workbench","content":"How To Run Submarine Workbench Guide "},{"title":"Environment REST API","type":0,"sectionRef":"#","url":"docs/api/environment","content":"","keywords":""},{"title":"Create Environment","type":1,"pageTitle":"Environment REST API","url":"docs/api/environment#create-environment","content":"POST /api/v1/environment Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\", \"anaconda=2020.02=py37_0\", \"anaconda-client=1.7.2=py37_0\", \"anaconda-navigator=1.9.12=py37_0\"], \"pipDependencies\" : [\"apache-submarine==0.5.0\", \"pyarrow==0.17.0\"] } } ' http://127.0.0.1:32080/api/v1/environment Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"environmentId\": \"environment_1586156073228_0001\", \"environmentSpec\": { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\", \"anaconda=2020.02=py37_0\", \"anaconda-client=1.7.2=py37_0\", \"anaconda-navigator=1.9.12=py37_0\"], \"pipDependencies\" : [\"apache-submarine==0.5.0\", \"pyarrow==0.17.0\"] } } } } Copy "},{"title":"List environment","type":1,"pageTitle":"Environment REST API","url":"docs/api/environment#list-environment","content":"GET /api/v1/environment Example Request: curl -X GET http://127.0.0.1:32080/api/v1/environment Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": [ { \"environmentId\": \"environment_1586156073228_0001\", \"environmentSpec\": { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\", \"anaconda=2020.02=py37_0\", \"anaconda-client=1.7.2=py37_0\", \"anaconda-navigator=1.9.12=py37_0\"], \"pipDependencies\" : [\"apache-submarine==0.5.0\", \"pyarrow==0.17.0\"] } } }, { \"environmentId\": \"environment_1586156073228_0002\", \"environmentSpec\": { \"name\": \"my-submarine-env-2\", \"dockerImage\" : \"continuumio/miniconda\", \"kernelSpec\" : { \"name\" : \"team_miniconda_python_3.7\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\"], \"pipDependencies\" : [], } } } ] } Copy "},{"title":"Get environment","type":1,"pageTitle":"Environment REST API","url":"docs/api/environment#get-environment","content":"GET /api/v1/environment/{name} Example Request: curl -X GET http://127.0.0.1:32080/api/v1/environment/my-submarine-env Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"environmentId\": \"environment_1586156073228_0001\", \"environmentSpec\": { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\", \"anaconda=2020.02=py37_0\", \"anaconda-client=1.7.2=py37_0\", \"anaconda-navigator=1.9.12=py37_0\"], \"pipDependencies\" : [\"apache-submarine==0.5.0\", \"pyarrow==0.17.0\"] } } } } Copy "},{"title":"Patch environment","type":1,"pageTitle":"Environment REST API","url":"docs/api/environment#patch-environment","content":"PATCH /api/v1/environment/{name} Example Request: curl -X PATCH -H \"Content-Type: application/json\" -d ' { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7_updated\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\"], \"pipDependencies\" : [] } } ' http://127.0.0.1:32080/api/v1/environment/my-submarine-env Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"success\": true, \"result\": { \"environmentId\": \"environment_1586156073228_0001\", \"environmentSpec\": { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7_updated\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\"], \"pipDependencies\" : [] } } } } Copy dockerImage, \"name\" (of kernelSpec), \"channels\", \"condaDependencies\", \"pipDependencies\" etc can be updated using this API. \"name\" of EnvironmentSpec is not supported. "},{"title":"Delete environment","type":1,"pageTitle":"Environment REST API","url":"docs/api/environment#delete-environment","content":"GET /api/v1/environment/{name} Example Request: curl -X DELETE http://127.0.0.1:32080/api/v1/environment/my-submarine-env Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"environmentId\": \"environment_1586156073228_0001\", \"environmentSpec\": { \"name\": \"my-submarine-env\", \"dockerImage\" : \"continuumio/anaconda3\", \"kernelSpec\" : { \"name\" : \"team_default_python_3.7_updated\", \"channels\" : [\"defaults\"], \"condaDependencies\" : [\"_ipyw_jlab_nb_ext_conf=0.1.0=py37_0\", \"alabaster=0.7.12=py37_0\"], \"pipDependencies\" : [] } } } } Copy "},{"title":"Experiment Template REST API","type":0,"sectionRef":"#","url":"docs/api/experiment-template","content":"","keywords":""},{"title":"Create experiment template","type":1,"pageTitle":"Experiment Template REST API","url":"docs/api/experiment-template#create-experiment-template","content":"POST /api/v1/template Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"name\": \"my-tf-mnist-template\", \"author\": \"author\", \"description\": \"This is a template to run tf-mnist\", \"parameters\": [{ \"name\": \"learning_rate\", \"value\": 0.1, \"required\": true, \"description\": \"This is learning_rate of training.\" }, { \"name\": \"batch_size\", \"value\": 150, \"required\": true, \"description\": \"This is batch_size of training.\" }, { \"name\": \"experiment_name\", \"value\": \"tf-mnist1\", \"required\": true, \"description\": \"the name of experiment.\" } ], \"experimentSpec\": { \"meta\": { \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate={{learning_rate}} --batch_size={{batch_size}}\", \"name\": \"{{experiment_name}}\", \"envVars\": { \"ENV1\": \"ENV1\" }, \"framework\": \"TensorFlow\", \"namespace\": \"default\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" } } } ' http://127.0.0.1:32080/api/v1/template Copy "},{"title":"List experiment template","type":1,"pageTitle":"Experiment Template REST API","url":"docs/api/experiment-template#list-experiment-template","content":"GET /api/v1/template Example Request: curl -X GET http://127.0.0.1:32080/api/v1/template Copy "},{"title":"Get experiment template","type":1,"pageTitle":"Experiment Template REST API","url":"docs/api/experiment-template#get-experiment-template","content":"GET /api/v1/template/{name} Example Request: curl -X GET http://127.0.0.1:32080/api/v1/template/my-tf-mnist-template Copy "},{"title":"Patch template","type":1,"pageTitle":"Experiment Template REST API","url":"docs/api/experiment-template#patch-template","content":"PATCH /api/v1/template/{name} curl -X PATCH -H \"Content-Type: application/json\" -d ' { \"name\": \"my-tf-mnist-template\", \"author\": \"author-new\", \"description\": \"This is a template to run tf-mnist\", \"parameters\": [{ \"name\": \"learning_rate\", \"value\": 0.1, \"required\": true, \"description\": \"This is learning_rate of training.\" }, { \"name\": \"batch_size\", \"value\": 150, \"required\": true, \"description\": \"This is batch_size of training.\" }, { \"name\": \"experiment_name\", \"value\": \"tf-mnist1\", \"required\": true, \"description\": \"the name of experiment.\" } ], \"experimentSpec\": { \"meta\": { \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate={{learning_rate}} --batch_size={{batch_size}}\", \"name\": \"{{experiment_name}}\", \"envVars\": { \"ENV1\": \"ENV1\" }, \"framework\": \"TensorFlow\", \"namespace\": \"default\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" } } } ' http://127.0.0.1:32080/api/v1/template/my-tf-mnist-template Copy \"description\", \"parameters\", \"experimentSpec\", \"author\" etc can be updated using this API. \"name\" of experiment template is not supported. "},{"title":"Delete template","type":1,"pageTitle":"Experiment Template REST API","url":"docs/api/experiment-template#delete-template","content":"GET /api/v1/template/{name} Example Request: curl -X DELETE http://127.0.0.1:32080/api/v1/template/my-tf-mnist-template Copy "},{"title":"Use template to create a experiment","type":1,"pageTitle":"Experiment Template REST API","url":"docs/api/experiment-template#use-template-to-create-a-experiment","content":"POST /api/v1/experiment/{template_name} Example Request: curl -X POST -H \"Content-Type: application/json\" -d ' { \"name\": \"tf-mnist\", \"params\": { \"learning_rate\":\"0.01\", \"batch_size\":\"150\", \"experiment_name\":\"newexperiment1\" } } ' http://127.0.0.1:32080/api/v1/experiment/my-tf-mnist-template Copy "},{"title":"Notebook REST API","type":0,"sectionRef":"#","url":"docs/api/notebook","content":"","keywords":""},{"title":"Create a notebook instance","type":1,"pageTitle":"Notebook REST API","url":"docs/api/notebook#create-a-notebook-instance","content":"POST /api/v1/notebook Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"meta\": { \"name\": \"test-nb\", \"namespace\": \"default\", \"ownerId\": \"e9ca23d68d884d4ebb19d07889727dae\" }, \"environment\": { \"name\": \"notebook-env\" }, \"spec\": { \"envVars\": { \"TEST_ENV\": \"test\" }, \"resources\": \"cpu=1,memory=1.0Gi\" } } ' http://127.0.0.1:32080/api/v1/notebook Copy Example Response: { \"status\":\"OK\", \"code\":200, \"success\":true, \"message\":\"Create a notebook instance\", \"result\":{ \"notebookId\":\"notebook_1597931805405_0001\", \"name\":\"test-nb\", \"uid\":\"5a94c01d-6a92-4222-bc66-c610c277546d\", \"url\":\"/notebook/default/test-nb/\", \"status\":\"creating\", \"reason\":\"The notebook instance is creating\", \"createdTime\":\"2020-08-20T21:58:27.000+08:00\", \"deletedTime\":null, \"spec\":{ \"meta\":{ \"name\":\"test-nb\", \"namespace\":\"default\", \"ownerId\":\"e9ca23d68d884d4ebb19d07889727dae\" }, \"environment\":{ \"name\":\"notebook-env\", \"dockerImage\":\"apache/submarine:jupyter-notebook-0.5.0\", \"kernelSpec\":{ \"name\": \"team_default_python_3.7\", \"channels\": [ \"defaults\" ], \"dependencies\": [ \"\" ] }, \"description\":null, \"image\":null }, \"spec\":{ \"envVars\":{ \"TEST_ENV\":\"test\" }, \"resources\":\"cpu=1,memory=1.0Gi\" } } }, \"attributes\":{} } Copy "},{"title":"List notebook instances which belong to user","type":1,"pageTitle":"Notebook REST API","url":"docs/api/notebook#list-notebook-instances-which-belong-to-user","content":"GET /api/v1/notebook Example Request: curl -X GET http://127.0.0.1:32080/api/v1/notebook?id={user_id} Copy Example Response: { \"status\":\"OK\", \"code\":200, \"success\":true, \"message\":\"List all notebook instances\", \"result\":[ { \"notebookId\":\"notebook_1597931805405_0001\", \"name\":\"test-nb\", \"uid\":\"5a94c01d-6a92-4222-bc66-c610c277546d\", \"url\":\"/notebook/default/test-nb/\", \"status\": \"running\", \"reason\": \"The notebook instance is running\", \"createdTime\":\"2020-08-20T21:58:27.000+08:00\", \"deletedTime\":null, \"spec\":{ \"meta\":{ \"name\":\"test-nb\", \"namespace\":\"default\", \"ownerId\":\"e9ca23d68d884d4ebb19d07889727dae\" }, \"environment\":{ \"name\":\"notebook-env\", \"dockerImage\":\"apache/submarine:jupyter-notebook-0.5.0\", \"kernelSpec\":{ \"name\": \"team_default_python_3.7\", \"channels\": [ \"defaults\" ], \"dependencies\": [ \"\" ] }, \"description\":null, \"image\":null }, \"spec\":{ \"envVars\":{ \"TEST_ENV\":\"test\" }, \"resources\":\"cpu=1,memory=1.0Gi\" } } } ], \"attributes\":{} } Copy "},{"title":"Get the notebook instance","type":1,"pageTitle":"Notebook REST API","url":"docs/api/notebook#get-the-notebook-instance","content":"GET /api/v1/notebook/{id} Example Request: curl -X GET http://127.0.0.1:32080/api/v1/notebook/{id} Copy Example Response: { \"status\":\"OK\", \"code\":200, \"success\":true, \"message\":\"Get the notebook instance\", \"result\":{ \"notebookId\":\"notebook_1597931805405_0001\", \"name\":\"test-nb\", \"uid\":\"5a94c01d-6a92-4222-bc66-c610c277546d\", \"url\":\"/notebook/default/test-nb/\", \"status\":\"running\", \"reason\":\"The notebook instance is running\", \"createdTime\":\"2020-08-20T21:58:27.000+08:00\", \"deletedTime\":null, \"spec\":{ \"meta\":{ \"name\":\"test-nb\", \"namespace\":\"default\", \"ownerId\":\"e9ca23d68d884d4ebb19d07889727dae\" }, \"environment\":{ \"name\":\"notebook-env\", \"dockerImage\":\"apache/submarine:jupyter-notebook-0.5.0\", \"kernelSpec\":{ \"name\": \"team_default_python_3.7\", \"channels\": [ \"defaults\" ], \"dependencies\": [ \"\" ] }, \"description\":null, \"image\":null }, \"spec\":{ \"envVars\":{ \"TEST_ENV\":\"test\" }, \"resources\":\"cpu=1,memory=1.0Gi\" } } }, \"attributes\":{} } Copy "},{"title":"Delete the notebook instance","type":1,"pageTitle":"Notebook REST API","url":"docs/api/notebook#delete-the-notebook-instance","content":"DELETE /api/v1/notebook/{id} Example Request: curl -X DELETE http://127.0.0.1:32080/api/v1/notebook/{id} Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"success\": true, \"message\": \"Delete the notebook instance\", \"result\": { \"notebookId\": \"notebook_1597931805405_0001\", \"name\": \"test-nb\", \"uid\": \"5a94c01d-6a92-4222-bc66-c610c277546d\", \"url\": \"/notebook/default/test-nb/\", \"status\": \"terminating\", \"reason\": \"The notebook instance is terminating\", \"createdTime\": \"2020-08-22T14:03:19.000+08:00\", \"deletedTime\": \"2020-08-22T14:46:28+0800\", \"spec\": { \"meta\": { \"name\": \"test-nb\", \"namespace\": \"default\", \"ownerId\":\"e9ca23d68d884d4ebb19d07889727dae\" }, \"environment\": { \"name\": \"notebook-env\", \"dockerImage\": \"apache/submarine:jupyter-notebook-0.5.0\", \"kernelSpec\": { \"name\": \"team_default_python_3.7\", \"channels\": [ \"defaults\" ], \"dependencies\": [ \"\" ] }, \"description\": null, \"image\": null }, \"spec\": { \"envVars\": { \"TEST_ENV\": \"test\" }, \"resources\": \"cpu=1,memory=1.0Gi\" } } }, \"attributes\": {} } Copy "},{"title":"Experiment REST API","type":0,"sectionRef":"#","url":"docs/api/experiment","content":"","keywords":""},{"title":"Create Experiment (Using Anonymous/Embedded Environment)","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#create-experiment-using-anonymousembedded-environment","content":"POST /api/v1/experiment Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=2048M\" } } } ' http://127.0.0.1:32080/api/v1/experiment Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"experimentId\": \"experiment_1586156073228_0001\", \"name\": \"tf-mnist-json\", \"uid\": \"28e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:59:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=2048M\" } } } } } Copy "},{"title":"Create Experiment (Using Pre-defined/Stored Environment)","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#create-experiment-using-pre-definedstored-environment","content":"POST /api/v1/experiment Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"name\": \"my-submarine-env\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=2048M\" } } } ' http://127.0.0.1:32080/api/v1/experiment Copy Above example assume environment \"my-submarine-env\" already exists in Submarine. Please refer Environment API Reference doc to Create/Update/Delete/List Environment REST API's Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"experimentId\": \"experiment_1586156073228_0001\", \"name\": \"tf-mnist-json\", \"uid\": \"28e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:59:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"name\": \"my-submarine-env\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=2048M\" } } } } } Copy "},{"title":"List experiment","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#list-experiment","content":"GET /api/v1/experiment Example Request: curl -X GET http://127.0.0.1:32080/api/v1/experiment Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": [ { \"experimentId\": \"experiment_1592057447228_0001\", \"name\": \"tf-mnist-json\", \"uid\": \"28e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:59:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=2048M\" } } } }, { \"experimentId\": \"experiment_1592057447228_0002\", \"name\": \"mnist\", \"uid\": \"38e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:19:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"pytorch-mnist-json\", \"namespace\": \"default\", \"framework\": \"PyTorch\", \"cmd\": \"python /var/mnist.py --backend gloo\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:pytorch-dist-mnist-1.0\" }, \"spec\": { \"Master\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } } ] } Copy "},{"title":"Get experiment","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#get-experiment","content":"GET /api/v1/experiment/{id} Example Request: curl -X GET http://127.0.0.1:32080/api/v1/experiment/experiment_1592057447228_0001 Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"experimentId\": \"experiment_1592057447228_0001\", \"name\": \"tf-mnist-json\", \"uid\": \"28e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:59:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=2048M\" } } } } } Copy "},{"title":"Patch experiment","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#patch-experiment","content":"PATCH /api/v1/experiment/{id} Example Request: curl -X PATCH -H \"Content-Type: application/json\" -d ' { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 2, \"resources\": \"cpu=1,memory=2048M\" } } } ' http://127.0.0.1:32080/api/v1/experiment/experiment_1592057447228_0001 Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"success\": true, \"result\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 2, \"resources\": \"cpu=1,memory=2048M\" } } } } Copy "},{"title":"Delete experiment","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#delete-experiment","content":"GET /api/v1/experiment/{id} Example Request: curl -X DELETE http://127.0.0.1:32080/api/v1/experiment/experiment_1592057447228_0001 Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"experimentId\": \"experiment_1586156073228_0001\", \"name\": \"tf-mnist-json\", \"uid\": \"28e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:59:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 2, \"resources\": \"cpu=1,memory=2048M\" } } } } } Copy "},{"title":"List experiment Log","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#list-experiment-log","content":"GET /api/v1/experiment/logs Example Request: curl -X GET http://127.0.0.1:32080/api/v1/experiment/logs Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"success\": null, \"message\": null, \"result\": [ { \"experimentId\": \"experiment_1589199154923_0001\", \"logContent\": [ { \"podName\": \"mnist-worker-0\", \"podLog\": null } ] }, { \"experimentId\": \"experiment_1589199154923_0002\", \"logContent\": [ { \"podName\": \"pytorch-dist-mnist-gloo-master-0\", \"podLog\": null }, { \"podName\": \"pytorch-dist-mnist-gloo-worker-0\", \"podLog\": null } ] } ], \"attributes\": {} } Copy "},{"title":"Get experiment Log","type":1,"pageTitle":"Experiment REST API","url":"docs/api/experiment#get-experiment-log","content":"GET /api/v1/experiment/logs/{id} Example Request: curl -X GET http://127.0.0.1:32080/api/v1/experiment/logs/experiment_1589199154923_0002 Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"success\": null, \"message\": null, \"result\": { \"experimentId\": \"experiment_1589199154923_0002\", \"logContent\": [ { \"podName\": \"pytorch-dist-mnist-gloo-master-0\", \"podLog\": \"Using distributed PyTorch with gloo backend\\nDownloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\\nDownloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\\nDownloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\\nDownloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\\nProcessing...\\nDone!\\nTrain Epoch: 1 [0/60000 (0%)]\\tloss=2.3000\\nTrain Epoch: 1 [640/60000 (1%)]\\tloss=2.2135\\nTrain Epoch: 1 [1280/60000 (2%)]\\tloss=2.1704\\nTrain Epoch: 1 [1920/60000 (3%)]\\tloss=2.0766\\nTrain Epoch: 1 [2560/60000 (4%)]\\tloss=1.8679\\nTrain Epoch: 1 [3200/60000 (5%)]\\tloss=1.4135\\nTrain Epoch: 1 [3840/60000 (6%)]\\tloss=1.0003\\nTrain Epoch: 1 [4480/60000 (7%)]\\tloss=0.7762\\nTrain Epoch: 1 [5120/60000 (9%)]\\tloss=0.4598\\nTrain Epoch: 1 [5760/60000 (10%)]\\tloss=0.4860\\nTrain Epoch: 1 [6400/60000 (11%)]\\tloss=0.4389\\nTrain Epoch: 1 [7040/60000 (12%)]\\tloss=0.4084\\nTrain Epoch: 1 [7680/60000 (13%)]\\tloss=0.4602\\nTrain Epoch: 1 [8320/60000 (14%)]\\tloss=0.4289\\nTrain Epoch: 1 [8960/60000 (15%)]\\tloss=0.3990\\nTrain Epoch: 1 [9600/60000 (16%)]\\tloss=0.3852\\n\" }, { \"podName\": \"pytorch-dist-mnist-gloo-worker-0\", \"podLog\": \"Using distributed PyTorch with gloo backend\\nDownloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\\nDownloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\\nDownloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\\nDownloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\\nProcessing...\\nDone!\\nTrain Epoch: 1 [0/60000 (0%)]\\tloss=2.3000\\nTrain Epoch: 1 [640/60000 (1%)]\\tloss=2.2135\\nTrain Epoch: 1 [1280/60000 (2%)]\\tloss=2.1704\\nTrain Epoch: 1 [1920/60000 (3%)]\\tloss=2.0766\\nTrain Epoch: 1 [2560/60000 (4%)]\\tloss=1.8679\\nTrain Epoch: 1 [3200/60000 (5%)]\\tloss=1.4135\\nTrain Epoch: 1 [3840/60000 (6%)]\\tloss=1.0003\\nTrain Epoch: 1 [4480/60000 (7%)]\\tloss=0.7762\\nTrain Epoch: 1 [5120/60000 (9%)]\\tloss=0.4598\\nTrain Epoch: 1 [5760/60000 (10%)]\\tloss=0.4860\\nTrain Epoch: 1 [6400/60000 (11%)]\\tloss=0.4389\\nTrain Epoch: 1 [7040/60000 (12%)]\\tloss=0.4084\\nTrain Epoch: 1 [7680/60000 (13%)]\\tloss=0.4602\\nTrain Epoch: 1 [8320/60000 (14%)]\\tloss=0.4289\\nTrain Epoch: 1 [8960/60000 (15%)]\\tloss=0.3990\\nTrain Epoch: 1 [9600/60000 (16%)]\\tloss=0.3852\\n\" } ] }, \"attributes\": {} } Copy "},{"title":"How To Contribute to Submarine","type":0,"sectionRef":"#","url":"docs/community/contributing","content":"","keywords":""},{"title":"Preface","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#preface","content":"Apache Submarine is an Apache 2.0 License Software. Contributing to Submarine means you agree to the Apache 2.0 License. Please read Code of Conduct carefully.The document How It Works can help you understand Apache Software Foundation further. "},{"title":"Build Submarine","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#build-submarine","content":"Build From Code "},{"title":"Creating patches","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#creating-patches","content":"Submarine follows Fork & Pull model. "},{"title":"Step1: Fork apache/submarine github repository (first time)","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step1-fork-apachesubmarine-github-repository-first-time","content":"Visit https://github.com/apache/submarineClick the Fork button to create a fork of the repository "},{"title":"Step2: Clone the Submarine to your local machine","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step2-clone-the-submarine-to-your-local-machine","content":"# USERNAME – your Github user account name. git clone git@github.com:${USERNAME}/submarine.git # or: git clone https://github.com/${USERNAME}/submarine.git cd submarine # set upstream git remote add upstream git@github.com:apache/submarine.git # or: git remote add upstream https://github.com/apache/submarine.git # Don't push to the upstream master. git remote set-url --push upstream no_push # Check upstream/origin: # origin git@github.com:${USERNAME}/submarine.git (fetch) # origin git@github.com:${USERNAME}/submarine.git (push) # upstream git@github.com:apache/submarine.git (fetch) # upstream no_push (push) git remote -v Copy "},{"title":"Step3: Create a new Jira in Submarine project","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step3-create-a-new-jira-in-submarine-project","content":"New contributors need privilege to create JIRA issues. Please email kaihsun@apache.org with your Jira username. In addition, the email title should be \"[New Submarine Contributor]\".Check Jira issue tracker for existing issues.Create a new Jira issue in Submarine project. When the issue is created, a Jira number (eg. SUBMARINE-748) will be assigned to the issue automatically. "},{"title":"Step4: Create a local branch for your contribution","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step4-create-a-local-branch-for-your-contribution","content":"cd submarine # Make your local master up-to-date git checkout master git fetch upstream git rebase upstream/master # Create a new branch fro issue SUBMARINE-${jira_number} git checkout -b SUBMARINE-${jira_number} # Example: git checkout -b SUBMARINE-748 Copy "},{"title":"Step5: Develop & Create commits","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step5-develop--create-commits","content":"You can edit the code on the SUBMARINE-${jira_number} branch. (Coding Style: Code Convention)Create commits git add ${edited files} git commit -m \"SUBMARINE-${jira_number}. ${Commit Message}\" # Example: git commit -m \"SUBMARINE-748. Update Contributing guide\" Copy "},{"title":"Step6: Syncing your local branch with upstream/master","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step6-syncing-your-local-branch-with-upstreammaster","content":"# On SUBMARINE-${jira_number} branch git fetch upstream git rebase upstream/master Copy Please do not use git pull to synchronize your local branch. Because git pull does a merge to create merged commits, these will make commit history messy. "},{"title":"Step7: Push your local branch to your personal fork","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step7-push-your-local-branch-to-your-personal-fork","content":"git push origin SUBMARINE-${jira_number} Copy "},{"title":"Step8: Check Travis-ci status of your personal fork","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step8-check-travis-ci-status-of-your-personal-fork","content":"Visit https://travis-ci.com/github/${USERNAME}/submarinePlease make sure your new commits can pass all integration tests before creating a pull request. "},{"title":"Step9: Create a pull request on github UI","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step9-create-a-pull-request-on-github-ui","content":"Visit your fork at https://github.com/${USERNAME}/submarine.gitClick Compare & Pull Request button to create pull request. Pull Request template# Pull request templateFilling the template thoroughly can improve the speed of the review process. Example:   "},{"title":"Step10: Check Travis-ci status of your pull request in apache/submarine","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step10-check-travis-ci-status-of-your-pull-request-in-apachesubmarine","content":"Visit https://travis-ci.com/github/apache/submarine/pull_requestsPlease make sure your pull request can pass all integration tests.  "},{"title":"Step11: The Review Process","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step11-the-review-process","content":"Anyone can be a reviewer and comment on the pull requests.Reviewer can indicate that a patch looks suitable for merging with a comment such as: \"Looks good\", \"LGTM\", \"+1\". (PS: LGTM = Looks Good To Me)At least one indication of suitability (e.g. \"LGTM\") from a committer is required to be merged. A committer can then initiate lazy consensus (\"Merge if there is no more discussion\") after which the code can be merged after a particular time (usually 24 hours) if there are no more reviews.Contributors can ping reviewers (including committers) by commenting 'Ready to review'. "},{"title":"Step12: Address review comments","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#step12-address-review-comments","content":"Push new commits to SUBMARINE-${jira_number} branch. The pull request will update automatically.After you address all review comments, committers will merge the pull request. "},{"title":"Code convention","type":1,"pageTitle":"How To Contribute to Submarine","url":"docs/community/contributing#code-convention","content":"We are following Google Code style: Java styleShell style There are some plugins to format, lint your code in IDE (use dev-support/maven-config/checkstyle.xml as rules) Checkstyle plugin for Intellij (Setting Guide)Checkstyle plugin for Eclipse (Setting Guide) "},{"title":"Guide for Apache Submarine Committers","type":0,"sectionRef":"#","url":"docs/community/HowToCommit","content":"","keywords":""},{"title":"New committers","type":1,"pageTitle":"Guide for Apache Submarine Committers","url":"docs/community/HowToCommit#new-committers","content":"New committers are encouraged to first read Apache's generic committer documentation: Apache New Committer GuideApache Committer FAQ The first act of a new core committer is typically to add their name to the credits page. This requires changing the site source inhttps://github.com/apache/submarine-site/blob/master/community/member.md. Once done, update the Submarine website as describedhere(TLDR; don't forget to regenerate the site with hugo, and commit the generated results, too). "},{"title":"Review","type":1,"pageTitle":"Guide for Apache Submarine Committers","url":"docs/community/HowToCommit#review","content":"Submarine committers should, as often as possible, attempt to review patches submitted by others. Ideally every submitted patch will get reviewed by a committer within a few days. If a committer reviews a patch they've not authored, and believe it to be of sufficient quality, then they can commit the patch, otherwise the patch should be cancelled with a clear explanation for why it was rejected. The list of submitted patches can be found in the GitHubPull Requests page. Committers should scan the list from top-to-bottom, looking for patches that they feel qualified to review and possibly commit. For non-trivial changes, it is best to get another committer to review & approve your own patches before commit. "},{"title":"Reject","type":1,"pageTitle":"Guide for Apache Submarine Committers","url":"docs/community/HowToCommit#reject","content":"Patches should be rejected which do not adhere to the guidelines inContribution Guidelines. Committers should always be polite to contributors and try to instruct and encourage them to contribute better patches. If a committer wishes to improve an unacceptable patch, then it should first be rejected, and a new patch should be attached by the committer for review. "},{"title":"Commit individual patches","type":1,"pageTitle":"Guide for Apache Submarine Committers","url":"docs/community/HowToCommit#commit-individual-patches","content":"Submarine uses git for source code version control. The writable repo is at -https://gitbox.apache.org/repos/asf/submarine.git It is strongly recommended to use the cicd script to merge the PRs. See the instructions athttps://github.com/apache/submarine/tree/master/dev-support/cicd "},{"title":"Adding Contributors role","type":1,"pageTitle":"Guide for Apache Submarine Committers","url":"docs/community/HowToCommit#adding-contributors-role","content":"There are three roles (Administrators, Committers, Contributors) in the project. Contributors who have Contributors role can become assignee of the issues in the project.Committers who have Committers role can set arbitrary roles in addition to Contributors role.Committers who have Administrators role can edit or delete all comments, or even delete issues in addition to Committers role. How to set roles Login to ASF JIRAGo to the project page (e.g. https://issues.apache.org/jira/browse/SUBMARINE )Hit \"Administration\" tabHit \"Roles\" tab in left sideAdd Administrators/Committers/Contributors role "},{"title":"Apache Submarine Community","type":0,"sectionRef":"#","url":"docs/community/README","content":"","keywords":""},{"title":"Communicating","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#communicating","content":"You can reach out to the community members via any one of the following ways: Slack Developer: https://the-asf.slack.com/submarine-dev/ Slack User: https://the-asf.slack.com/submarine-user/ Zoom: https://cloudera.zoom.us/j/880548968 Sync Up: https://docs.google.com/document/d/16pUO3TP4SxSeLduG817GhVAjtiph9HYpRHo_JgduDvw/edit "},{"title":"Your First Contribution","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#your-first-contribution","content":"You can start by finding an existing issue with the https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE?filter=allopenissues label. These issues are well suited for new contributors. If a PR (Pull Request) submitted to the Submarine Github projects by you is approved and merged, then you become a Submarine Contributor. If you want to work on a new idea of relatively small scope: Submit an issue describing your proposed change to the repo in question. The repo owners will respond to your issue promptly. Submit a pull request of Submarine containing a tested change. Contributions are welcomed and greatly appreciated. See CONTRIBUTING for details on submitting patches and the contribution workflow. "},{"title":"How Do I Become a Committer?","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#how-do-i-become-a-committer","content":"First of all, you need to get involved and be a Contributor. Based on your track-record as a contributor, Per Apache code, PMCs vote on committership, may invite you to be a committer (after we've called a vote). When that happens, if you accept, the following process kicks into place... Note that becoming a committer is not just about submitting some patches; it‘s also about helping out on the development and user Slack User, helping with documentation and the issues. "},{"title":"How to commit","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#how-to-commit","content":"See How to commit for helper doc for Submarine committers. "},{"title":"Communication","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#communication","content":"Communication within the Submarine community abides by Apache’s Code of Conduct. "},{"title":"Mailing lists","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#mailing-lists","content":"Get help using Apache Submarine or contribute to the project on our mailing lists: Users : subscribe, unsubscribe, archivesfor usage questions, help, and announcements.Dev : subscribe, unsubscribe, archivesfor people wanting to contribute to the project.Commits : subscribe, unsubscribe, archivesfor commit messages and patches. "},{"title":"License","type":1,"pageTitle":"Apache Submarine Community","url":"docs/community/README#license","content":"Submarine source code is under the Apache 2.0 license. See the LICENSE file for details. "},{"title":"Environments Implementation","type":0,"sectionRef":"#","url":"docs/designDocs/environments-implementation","content":"","keywords":""},{"title":"Overview","type":1,"pageTitle":"Environments Implementation","url":"docs/designDocs/environments-implementation#overview","content":"Environment profiles (or environment for short) defines a set of libraries and when Docker is being used, a Docker image in order to run an experiment or a notebook. Docker and/or VM-image (such as, VirtualBox/VMWare images, Amazon Machine Images - AMI, Or custom image of Azure VM) defines the base layer of the environment. Please note that VM-image is different from VM instance type, On top of that, users can define a set of libraries (such as Python/R) to install, we call it kernel. Example of Environment  +-------------------+ |+-----------------+| || Python=3.7 || || Tensorflow=2.0 || |+---Exp Dependency+| |+-----------------+| ||OS=Ubuntu16.04 || ||CUDA=10.2 || ||GPU_Driver=375.. || |+---Base Library--+| +-------------------+ Copy As you can see, There're base libraries, such as what OS, CUDA version, GPU driver, etc. They can be achieved by specifying a VM-image / Docker image. On top of that, user can bring their dependencies, such as different version of Python, Tensorflow, Pandas, etc. How users use environment? Users can save different environment configs which can be also shared across the platform. Environment profiles can be used to run a notebook (e.g. by choosing different kernel from Jupyter), or an experiment. Predefined experiment library includes what environment to use so users don't have to choose which environment to use.  +-------------------+ |+-----------------+| +------------+ || Python=3.7 || |User1 | || Tensorflow=2.0 || +------------+ |+---Kernel -------+| +------------+ |+-----------------+|<----+ |User2 | ||OS=Ubuntu16.04 || + +------------+ ||CUDA=10.2 || | +------------+ ||GPU_Driver=375.. || | |User3 | |+---Base Library--+| | +------------+ +-----Default-Env---+ | | | +-------------------+ | |+-----------------+| | || Python=3.3 || | || Tensorflow=2.0 || | |+---kernel--------+| | |+-----------------+| | ||OS=Ubuntu16.04 || | ||CUDA=10.3 ||<----+ ||GPU_Driver=375.. || |+---Base Library--+| +-----My-Customized-+ Copy There're two environments in the above graph, \"Default-Env\" and \"My-Customized\", which can have different combinations of libraries for different experiments/notebooks. Users can choose different environments for different experiments as they want. Environments can be added/listed/deleted/selected through CLI/SDK/UI. Implementation# "},{"title":"Environment API definition","type":1,"pageTitle":"Environments Implementation","url":"docs/designDocs/environments-implementation#environment-api-definition","content":"Let look at what object definition looks like to define an environment, API of environment looks like:  name: \"my_submarine_env\", vm-image: \"...\", docker-image: \"...\", kernel: <object of kernel> description: \"this is the most common env used by team ABC\" Copy vm-image is optional if we don't need to launch new VM (like running a training job in a cloud-remote machine). docker-image is requiredkernel could be optional if kernel is already included by vm-image or docker-image.name of the environment should be unique in the system, so user can reference it when create a new experiment/notebook. "},{"title":"VM-image and Docker-image","type":1,"pageTitle":"Environments Implementation","url":"docs/designDocs/environments-implementation#vm-image-and-docker-image","content":"Docker-image and VM image should be prepared by system admin / SREs, it is hard for Data-Scientists to write an error-proof Dockerfile, and push/manage Docker images. This is one of the reason we hide Docker-image inside \"environment\", we will encourage users to customize their kernels if needed, but don't have to touch Dockerfile and build/push/manage new Docker images. As a project, we will document what's the best practice and example of Dockerfiles. Dockerfile should include proper ENTRYPOINT definition which pointed to our default script, so no matter it is notebook, or an experiment, we will setup kernel (see below) and other environment variables properly. "},{"title":"Kernel Implementation","type":1,"pageTitle":"Environments Implementation","url":"docs/designDocs/environments-implementation#kernel-implementation","content":"After investigating different alternatives (such as pipenv, venv, etc.), we decided to use Conda environment which nicely replaces Python virtual env, pip, and can also support other languages. More details can be found at: https://medium.com/@krishnaregmi/pipenv-vs-virtualenv-vs-conda-environment-3dde3f6869ed When once Conda, users can easily add, remove dependency of a Conda environment. User can also easily export environment to yaml file. The yaml file of Conda environment by using conda env export looks like: name: base channels: - defaults dependencies: - _ipyw_jlab_nb_ext_conf=0.1.0=py37_0 - alabaster=0.7.12=py37_0 - anaconda=2020.02=py37_0 - anaconda-client=1.7.2=py37_0 - anaconda-navigator=1.9.12=py37_0 - anaconda-project=0.8.4=py_0 - applaunchservices=0.2.1=py_0 Copy Including Conda kernel, the environment object may look like: name: \"my_submarine_env\", vm-image: \"...\", docker-image: \"...\", kernel: name: team_default_python_3.7 channels: - defaults dependencies: - _ipyw_jlab_nb_ext_conf=0.1.0=py37_0 - alabaster=0.7.12=py37_0 - anaconda=2020.02=py37_0 - anaconda-client=1.7.2=py37_0 - anaconda-navigator=1.9.12=py37_0 Copy When launch a new experiment / notebook session using the my_submarine_env, submarine server will use defined Docker image, and Conda kernel to launch of container. "},{"title":"Storage of Environment","type":1,"pageTitle":"Environments Implementation","url":"docs/designDocs/environments-implementation#storage-of-environment","content":"Environment of Submarine is just a simple text file, so it will be persisted in Submarine metastore, which is ideally a Database. Docker image is stored inside a regular Docker registry, which will be handled outside of the system. Conda dependencies are stored in Conda channel (where referenced packages are stored), which will be handled/setuped separately. (Popular conda channels are default and conda-forge) For more detailed discussion about storage-related implementations, please refer to storage-implementation. "},{"title":"How to implement to make user can easily use Submarine environments?","type":1,"pageTitle":"Environments Implementation","url":"docs/designDocs/environments-implementation#how-to-implement-to-make-user-can-easily-use-submarine-environments","content":"We like simplicities, and we don't want to leak complexities of implementations to the users. To make it happen, we have to do some works to hide complexities. There're two primary uses of environments: experiments and notebook, for both of them, users should not do works like explictily call conda active $env_name to active environments. To make it happen, what we can do is to include following parts in Dockerfile FROM ubuntu:18.04 <Include whatever base-libraries like CUDA, etc.> <Make sure conda (with our preferred version) is installed> <Make sure Jupyter (with our preferred version) is installed> # This is just a sample of Dockerfile, users can do more customizations if needed ENTRYPOINT [\"/submarine-bootstrap.sh\"] Copy When Submarine Server (this is implementation detail of Submarine Server, user will not see it at all) launch an experiment, or notebook, it will invoke following docker run command (or any other equvilant like using K8s spec): docker run <submarine_docker_image> --kernel <kernel_name> -- .... python train.py --batch_size 5 (and other parameters) Copy Similarily, to launch a notebook: docker run <submarine_docker_image> --kernel <kernel_name> -- .... jupyter Copy The submarine-bootstrap.sh is part of Submarine repo, and will handle --kernel argument which will invoke conda active $kernel_name before anything else. (Like run the training job). "},{"title":"Architecture and Requirment","type":0,"sectionRef":"#","url":"docs/designDocs/architecture-and-requirements","content":"","keywords":""},{"title":"Terminology","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#terminology","content":"Term\tDescriptionUser\tA single data-scientist/data-engineer. User has resource quota, credentials Team\tUser belongs to one or more teams, teams have ACLs for artifacts sharing such as notebook content, model, etc. Admin\tAlso called SRE, who manages user's quotas, credentials, team, and other components. "},{"title":"Background","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#background","content":"Everybody talks about machine learning today, and lots of companies are trying to leverage machine learning to push the business to the next level. Nowadays, as more and more developers, infrastructure software companies coming to this field, machine learning becomes more and more achievable. In the last decade, the software industry has built many open source tools for machine learning to solve the pain points: It was not easy to build machine learning algorithms manually, such as logistic regression, GBDT, and many other algorithms:Answer to that: Industries have open sourced many algorithm libraries, tools, and even pre-trained models so that data scientists can directly reuse these building blocks to hook up to their data without knowing intricate details inside these algorithms and models. It was not easy to achieve \"WYSIWYG, what you see is what you get\" from IDEs: not easy to get output, visualization, troubleshooting experiences at the same place.Answer to that: Notebooks concept was added to this picture, notebook brought the experiences of interactive coding, sharing, visualization, debugging under the same user interface. There're popular open-source notebooks like Apache Zeppelin/Jupyter. It was not easy to manage dependencies: ML applications can run on one machine is hard to deploy on another machine because it has lots of libraries dependencies.Answer to that: Containerization becomes popular and a standard to packaging dependencies to make it easier to \"build once, run anywhere\". Fragmented tools, libraries were hard for ML engineers to learn. Experiences learned in one company are not naturally migratable to another company.Answer to that: A few dominant open-source frameworks reduced the overhead of learning too many different frameworks, concepts. Data-scientist can learn a few libraries such as Tensorflow/PyTorch, and a few high-level wrappers like Keras will be able to create your machine learning application from other open-source building blocks. Similarly, models built by one library (such as libsvm) were hard to be integrated into machine learning pipeline since there's no standard format.Answer to that: Industry has built successful open-source standard machine learning frameworks such as Tensorflow/PyTorch/Keras so their format can be easily shared across. And efforts to build an even more general model format such as ONNX. It was hard to build a data pipeline that flows/transform data from a raw data source to whatever required by ML applications.Answer to that: Open source big data industry plays an important role in providing, simplify, unify processes and building blocks for data flows, transformations, etc. The machine learning industry is moving on the right track to solve major roadblocks. So what are the pain points now for companies which have machine learning needs? What can we help here? To answer this question, let's look at machine learning workflow first. "},{"title":"Machine Learning Workflows & Pain points","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#machine-learning-workflows--pain-points","content":"1) From different data sources such as edge, clickstream, logs, etc. => Land to data lakes 2) From data lake, data transformation: => Data transformations: Cleanup, remove invalid rows/columns, select columns, sampling, split train/test data-set, join table, etc. => Data prepared for training. 3) From prepared data: => Training, model hyper-parameter tuning, cross-validation, etc. => Models saved to storage. 4) From saved models: => Model assurance, deployment, A/B testing, etc. => Model deployed for online serving or offline scoring. Copy Typically data scientists responsible for item 2)-4), 1) typically handled by a different team (called Data Engineering team in many companies, some Data Engineering team also responsible for part of data transformation) "},{"title":"Pain #1 Complex workflow/steps from raw data to model, different tools needed by different steps, hard to make changes to workflow, and not error-proof","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#pain-1-complex-workflowsteps-from-raw-data-to-model-different-tools-needed-by-different-steps-hard-to-make-changes-to-workflow-and-not-error-proof","content":"It is a complex workflow from raw data to usable models, after talking to many different data scientists, we have learned that a typical procedure to train a new model and push to production can take months to 1-2 years. It is also a wide skill set required by this workflow. For example, data transformation needs tools like Spark/Hive for large scale and tools like Pandas for a small scale. And model training needs to be switched between XGBoost, Tensorflow, Keras, PyTorch. Building a data pipeline requires Apache Airflow or Oozie. Yes, there are great, standardized open-source tools built for many of such purposes. But how about changes need to be made for a particular part of the data pipeline? How about adding a few columns to the training data for experiments? How about training models, and push models to validation, A/B testing before rolling to production? All these steps need jumping between different tools, UIs, and very hard to make changes, and it is not error-proof during these procedures. "},{"title":"Pain #2 Dependencies of underlying resource management platform","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#pain-2-dependencies-of-underlying-resource-management-platform","content":"To make jobs/services required by a machine learning platform to be able to run, we need an underlying resource management platform. There're some choices of resource management platform, and they have distinct advantages and disadvantages. For example, there're many machine learning platform built on top of K8s. It is relatively easy to get a K8s from a cloud vendor, easy to orchestrate machine learning required services/daemons run on K8s. However, K8s doesn't offer good support jobs like Spark/Flink/Hive. So if your company has Spark/Flink/Hive running on YARN, there're gaps and a significant amount of work to move required jobs from YARN to K8s. Maintaining a separate K8s cluster is also overhead to Hadoop-based data infrastructure. Similarly, if your company's data pipelines are mostly built on top of cloud resources and SaaS offerings, asking you to install a separate YARN cluster to run a new machine learning platform doesn't make a lot of sense. "},{"title":"Pain #3 Data scientist are forced to interact with lower-level platform components","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#pain-3-data-scientist-are-forced-to-interact-with-lower-level-platform-components","content":"In addition to the above pain, we do see Data Scientists are forced to learn underlying platform knowledge to be able to build a real-world machine learning workflow. For most of the data scientists we talked with, they're experts of ML algorithms/libraries, feature engineering, etc. They're also most familiar with Python, R, and some of them understand Spark, Hive, etc. If they're asked to do interactions with lower-level components like fine-tuning a Spark job's performance; or troubleshooting job failed to launch because of resource constraints; or write a K8s/YARN job spec and mount volumes, set networks properly. They will scratch their heads and typically cannot perform these operations efficiently. "},{"title":"Pain #4 Comply with data security/governance requirements","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#pain-4-comply-with-data-securitygovernance-requirements","content":"TODO: Add more details. "},{"title":"Pain #5 No good way to reduce routine ML code development","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#pain-5-no-good-way-to-reduce-routine-ml-code-development","content":"After the data is prepared, the data scientist needs to do several routine tasks to build the ML pipeline. To get a sense of the existing the data set, it usually needs a split of the data set, the statistics of data set. These tasks have a common duplicate part of code, which reduces the efficiency of data scientists. An abstraction layer/framework to help the developer to boost ML pipeline development could be valuable. It's better than the developer only needs to fill callback function to focus on their key logic. Submarine# "},{"title":"Overview","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#overview","content":""},{"title":"A little bit history","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#a-little-bit-history","content":"Initially, Submarine is built to solve problems of running deep learning jobs like Tensorflow/PyTorch on Apache Hadoop YARN, allows admin to monitor launched deep learning jobs, and manage generated models. It was part of YARN initially, and code resides under hadoop-yarn-applications. Later, the community decided to convert it to be a subproject within Hadoop (Sibling project of YARN, HDFS, etc.) because we want to support other resource management platforms like K8s. And finally, we're reconsidering Submarine's charter, and the Hadoop community voted that it is the time to moved Submarine to a separate Apache TLP. "},{"title":"Why Submarine?","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#why-submarine","content":"ONE PLATFORM Submarine is the ONE PLATFORM to allow Data Scientists to create end-to-end machine learning workflow. ONE PLATFORM means it supports Data Scientists and data engineers to finish their jobs on the same platform without frequently switching their toolsets. From dataset exploring data pipeline creation, model training, and tuning, and push model to production. All these steps can be completed within the ONE PLATFORM. Resource Management Independent It is also designed to be resource management independent, no matter if you have Apache Hadoop YARN, K8s, or just a container service, you will be able to run Submarine on top it. "},{"title":"Requirements and non-requirements","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#requirements-and-non-requirements","content":""},{"title":"Notebook","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#notebook","content":"1) Users should be able to create, edit, delete a notebook. (P0) 2) Notebooks can be persisted to storage and can be recovered if failure happens. (P0) 3) Users can trace back to history versions of a notebook. (P1) 4) Notebooks can be shared with different users. (P1) 5) Users can define a list of parameters of a notebook (looks like parameters of the notebook's main function) to allow executing a notebook like a job. (P1) 6) Different users can collaborate on the same notebook at the same time. (P2) A running notebook instance is called notebook session (or session for short). "},{"title":"Experiment","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#experiment","content":"Experiments of Submarine is an offline task. It could be a shell command, a Python command, a Spark job, a SQL query, or even a workflow. The primary purposes of experiments under Submarine's context is to do training tasks, offline scoring, etc. However, experiment can be generalized to do other tasks as well. Major requirement of experiment: 1) Experiments can be submitted from UI/CLI/SDK. 2) Experiments can be monitored/managed from UI/CLI/SDK. 3) Experiments should not bind to one resource management platform (K8s/YARN). Type of experiments#  There're two types of experiments:Adhoc experiments: which includes a Python/R/notebook, or even an adhoc Tensorflow/PyTorch task, etc. Predefined experiment library: This is specialized experiments, which including developed libraries such as CTR, BERT, etc. Users are only required to specify a few parameters such as input, output, hyper parameters, etc. Instead of worrying about where's training script/dependencies located. Adhoc experiment# Requirements: Allow run adhoc scripts.Allow model engineer, data scientist to run Tensorflow/Pytorch programs on YARN/K8s/Container-cloud. Allow jobs easy access data/models in HDFS/s3, etc. Support run distributed Tensorflow/Pytorch jobs with simple configs.Support run user-specified Docker images.Support specify GPU and other resources. Predefined experiment library# Here's an example of predefined experiment library to train deepfm model: { \"input\": { \"train_data\": [\"hdfs:///user/submarine/data/tr.libsvm\"], \"valid_data\": [\"hdfs:///user/submarine/data/va.libsvm\"], \"test_data\": [\"hdfs:///user/submarine/data/te.libsvm\"], \"type\": \"libsvm\" }, \"output\": { \"save_model_dir\": \"hdfs:///user/submarine/deepfm\", \"metric\": \"auc\" }, \"training\": { \"batch_size\" : 512, \"field_size\": 39, \"num_epochs\": 3, \"feature_size\": 117581, ... } } Copy Predefined experiment libraries can be shared across users on the same platform, users can also add new or modified predefined experiment library via UI/REST API. We will also model AutoML, auto hyper-parameter tuning to predefined experiment library. Pipeline# Pipeline is a special kind of experiment: A pipeline is a DAG of experiments. Can be also treated as a special kind of experiment.Users can submit/terminate a pipeline.Pipeline can be created/submitted via UI/API. "},{"title":"Environment Profiles","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#environment-profiles","content":"Environment profiles (or environment for short) defines a set of libraries and when Docker is being used, a Docker image in order to run an experiment or a notebook. Docker or VM image (such as AMI: Amazon Machine Images) defines the base layer of the environment. On top of that, users can define a set of libraries (such as Python/R) to install. Users can save different environment configs which can be also shared across the platform. Environment profiles can be used to run a notebook (e.g. by choosing different kernel from Jupyter), or an experiment. Predefined experiment library includes what environment to use so users don't have to choose which environment to use. Environments can be added/listed/deleted/selected through CLI/SDK. "},{"title":"Model","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#model","content":"Model management# Model artifacts are generated by experiments or notebook.A model consists of artifacts from one or multiple files. Users can choose to save, tag, version a produced model.Once The Model is saved, Users can do the online model serving or offline scoring of the model. Model serving# After model saved, users can specify a serving script, a model and create a web service to serve the model. We call the web service to \"endpoint\". Users can manage (add/stop) model serving endpoints via CLI/API/UI. "},{"title":"Metrics for training job and model","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#metrics-for-training-job-and-model","content":"Submarine-SDK provides tracking/metrics APIs, which allows developers to add tracking/metrics and view tracking/metrics from Submarine Workbench UI. "},{"title":"Deployment","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#deployment","content":"Submarine Services (See architecture overview below) should be deployed easily on-prem / on-cloud. Since there're more and more public cloud offering for compute/storage management on cloud, we need to support deploy Submarine compute-related workloads (such as notebook session, experiments, etc.) to cloud-managed clusters. This also include Submarine may need to take input parameters from customers and create/manage clusters if needed. It is also a common requirement to use hybrid of on-prem/on-cloud clusters. "},{"title":"Security / Access Control / User Management / Quota Management","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#security--access-control--user-management--quota-management","content":"There're 4 kinds of objects need access-control: Assets belong to Submarine system, which includes notebook, experiments and results, models, predefined experiment libraries, environment profiles.Data security. (Who owns what data, and what data can be accessed by each users). User credentials. (Such as LDAP).Other security, such as Git repo access, etc. For the data security / user credentials / other security, it will be delegated to 3rd libraries such as Apache Ranger, IAM roles, etc. Assets belong to Submarine system will be handled by Submarine itself. Here're operations which Submarine admin can do for users / teams which can be used to access Submarine's assets. Operations for admins Admin uses \"User Management System\" to onboard new users, upload user credentials, assign resource quotas, etc. Admins can create new users, new teams, update user/team mappings. Or remove users/teams. Admin can set resource quotas (if different from system default), permissions, upload/update necessary credentials (like Kerberos keytab) of a user.A DE/DS can also be an admin if the DE/DS has admin access. (Like a privileged user). This will be useful when a cluster is exclusively shared by a user or only shared by a small team.Resource Quota Management System helps admin to manage resources quotas of teams, organizations. Resources can be machine resources like CPU/Memory/Disk, etc. It can also include non-machine resources like $$-based budgets. "},{"title":"Dataset","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#dataset","content":"There's also need to tag dataset which will be used for training and shared across the platform by different users. Like mentioned above, access to the actual data will be handled by 3rd party system like Apache Ranger / Hive Metastore which is out of the Submarine's scope. "},{"title":"Architecture Overview","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#architecture-overview","content":""},{"title":"Architecture Diagram","type":1,"pageTitle":"Architecture and Requirment","url":"docs/designDocs/architecture-and-requirements#architecture-diagram","content":" +-----------------------------------------------------------------+ | Submarine UI / CLI / REST API / SDK | | Mini-Submarine | +-----------------------------------------------------------------+ +--------------------Submarine Server-----------------------------+ | +---------+ +---------+ +----------+ +----------+ +------------+| | |Data set | |Notebooks| |Experiment| |Models | |Servings || | +---------+ +---------+ +----------+ +----------+ +------------+| |-----------------------------------------------------------------| | | | +-----------------+ +-----------------+ +---------------------+ | | |Experiment | |Compute Resource | |Other Management | | | |Manager | | Manager | |Services | | | +-----------------+ +-----------------+ +---------------------+ | | Spark, template YARN/K8s/Docker | | TF, PyTorch, pipeline | | | + +-----------------+ + | |Submarine Meta | | | | Store | | | +-----------------+ | | | +-----------------------------------------------------------------+ (You can use http://stable.ascii-flow.appspot.com/#Draw to draw such diagrams) Copy Compute Resource Manager Helps to manage compute resources on-prem/on-cloud, this module can also handle cluster creation / management, etc. Experiment Manager Work with \"Compute Resource Manager\" to submit different kinds of workloads such as (distributed) Tensorflow / Pytorch, etc. Submarine SDK provides Java/Python/REST API to allow DS or other engineers to integrate into Submarine services. It also includes a mini-submarine component that launches Submarine components from a single Docker container (or a VM image). Details of Submarine Server design can be found at submarine-server-design. "},{"title":"Experiment Implementation","type":0,"sectionRef":"#","url":"docs/designDocs/experiment-implementation","content":"","keywords":""},{"title":"Overview","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#overview","content":"This document talks about implementation of experiment, flows and design considerations. Experiment consists of following components, also interact with other Submarine or 3rd-party components, showing below:  +---------------------------------------+ +----------+ | Experiment Tasks | |Run | | | |Configs | | +----------------------------------+ | +----------+ | | Experiment Runnable Code | | +-----------------+ +----------+ | | | | |Output Artifacts | |Input Data| | | (Like train-job.py) | | |(Models, etc.) | | | | +----------------------------------+ | +-----------------+ | | | +----------------------------------+ | +----------+ | | Experiment Deps (Like Python) | | +-------------+ | +----------------------------------+ | |Logs/Metrics | | +----------------------------------+ | | | | | OS, Base Libaries (Like CUDA) | | +-------------+ | +----------------------------------+ | +---------------------------------------+ ^ | (Launch Task with resources) + +---------------------------------+ |Resource Manager (K8s/YARN/Cloud)| +---------------------------------+ Copy As showing in the above diagram, Submarine experiment consists of the following items: On the left side, there're input data and run configs. In the middle box, they're experiment tasks, it could be multiple tasks when we run distributed training, pipeline, etc. There're main runnable code, such as train.py for the training main entry point. The two boxes below: experiment dependencies and OS/Base libraries we called Submarine Environment Profile or Environment for short. Which defined what is the basic libraries to run the main experiment code. Experiment tasks are launched by Resource Manager, such as K8s/YARN/Cloud or just launched locally. There're resources constraints for each experiment tasks. (e.g. how much memory, cores, GPU, disk etc. can be used by tasks). On the right side, they're artifacts generated by experiments: Output artifacts: Which are main output of the experiment, it could be model(s), or output data when we do batch prediction.Logs/Metrics for further troubleshooting or understanding of experiment's quality. For the rest of the design doc, we will talk about how we handle environment, code, and manage output/logs, etc. "},{"title":"API of Experiment","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#api-of-experiment","content":"This is not a full definition of experiment, for more details, please reference to experiment API. Here's just an example of experiment object which help developer to understand what included in an experiment. experiment: name: \"abc\", type: \"script\", environment: \"team-default-ml-env\" code: sync_mode: s3 url: \"s3://bucket/training-job.tar.gz\" parameter: > python training.py --iteration 10 --input=s3://bucket/input output=s3://bucket/output resource_constraint: res=\"mem=20gb, vcore=3, gpu=2\" timeout: \"30 mins\" Copy This defined a \"script\" experiment, which has a name \"abc\", the name can be used to track the experiment. There's environment \"team-default-ml-env\" defined to make sure dependencies of the job can be downloaded properly before executing the job. code defined where the experiment code will be downloaded, we will support a couple of sync_mode like s3 (or abfs/hdfs), git, etc. Different types of experiments will have different specs, for example distributed Tensorflow spec may look like: experiment: name: \"abc-distributed-tf\", type: \"distributed-tf\", ps: environment: \"team-default-ml-cpu\" resource_constraint: res=\"mem=20gb, vcore=3, gpu=0\" worker: environment: \"team-default-ml-gpu\" resource_constraint: res=\"mem=20gb, vcore=3, gpu=2\" code: sync_mode: git url: \"https://foo.com/training-job.git\" parameter: > python /code/training-job/training.py --iteration 10 --input=s3://bucket/input output=s3://bucket/output tensorboard: enabled timeout: \"30 mins\" Copy Since we have different Docker image, one is using GPU and one is not using GPU, we can specify different environment and resource constraint. "},{"title":"Manage environments for experiment","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#manage-environments-for-experiment","content":"Please refer to environment-implementation.md for more details "},{"title":"Manage storages for experiment","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#manage-storages-for-experiment","content":"There're different types of storage, such as logs, metrics, dependencies (environments). For more details. Please refer to storage-implementations for more details. This also includes how to manage code for experiment code. "},{"title":"Manage Pre-defined experiment libraries","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#manage-pre-defined-experiment-libraries","content":""},{"title":"Flow: Submit an experiment","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#flow-submit-an-experiment","content":""},{"title":"Submit via SDK Flows.","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#submit-via-sdk-flows","content":"To better understand experiment implementation, It will be good to understand what is the steps of experiment submission. Please note that below code is just pseudo code, not official APIs. "},{"title":"Specify what environment to use","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#specify-what-environment-to-use","content":"Before submit the environment, you have to choose what environment to choose. Environment defines dependencies, etc. of an experiment or a notebook. might looks like below: conda_environment = \"\"\" name: conda-env channels: - defaults dependencies: - asn1crypto=1.3.0=py37_0 - blas=1.0=mkl - ca-certificates=2020.1.1=0 - certifi=2020.4.5.1=py37_0 - cffi=1.14.0=py37hb5b8e2f_0 - chardet=3.0.4=py37_1003 prefix: /opt/anaconda3/envs/conda-env \"\"\" # This environment can be different from notebook's own environment environment = create_environment { DockerImage = \"ubuntu:16\", CondaEnvironment = conda_environment } Copy To better understand how environment works, please refer to environment-implementation. "},{"title":"Create experiment, specify where's training code located, and parameters.","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#create-experiment-specify-wheres-training-code-located-and-parameters","content":"For ad-hoc experiment (code located at S3), assume training code is part of the training-job.tar.gz and main class is train.py. When the job is launched, whatever specified in the localize_artifacts will be downloaded. experiment = create_experiment { Environment = environment, ExperimentConfig = { type = \"adhoc\", localize_artifacts = [ \"s3://bucket/training-job.tar.gz\" ], name = \"abc\", parameter = \"python training.py --iteration 10 --input=\"s3://bucket/input output=\"s3://bucket/output\", } } experiment.run() experiment.wait_for_finish(print_output=True) Copy Run notebook file in offline mode# It is possible we want to run a notebook file in offline mode, to do that, here's code to use to run a notebook code experiment = create_experiment { Environment = environment, ExperimentConfig = { type = \"adhoc\", localize_artifacts = [ \"s3://bucket/folder/notebook-123.ipynb\" ], name = \"abc\", parameter = \"runipy training.ipynb --iteration 10 --input=\"s3://bucket/input output=\"s3://bucket/output\", } } experiment.run() experiment.wait_for_finish(print_output=True) Copy Run pre-defined experiment library# experiment = create_experiment { # Here you can use default environment of library Environment = environment, ExperimentConfig = { type = \"template\", name = \"abc\", # A unique name of template template = \"deepfm_ctr\", # yaml file defined what is the parameters need to be specified. parameter = { Input: \"S3://.../input\", Output: \"S3://.../output\" Training: { \"batch_size\": 512, \"l2_reg\": 0.01, ... } } } } experiment.run() experiment.wait_for_finish(print_output=True) Copy "},{"title":"Summarize: Experiment v.s. Notebook session","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#summarize-experiment-vs-notebook-session","content":"There's a common misunderstanding about what is the differences between running experiment v.s. running task from a notebook session. We will talk about differences and commonalities: Differences \tExperiment\tNotebook SessionRun mode\tOffline\tInteractive Output Artifacts (a.k.a model)\tPersisted in a shared storage (like S3/NFS)\tLocal in the notebook session container, could be ephemeral Run history (meta, logs, metrics)\tMeta/logs/metrics can be traced from experiment UI (or corresponding API)\tNo run history can be traced from Submarine UI/API. Can view the current running paragraph's log/metrics, etc. What to run?\tCode from Docker image or shared storage (like Tarball on S3, Github, etc.)\tLocal in the notebook's paragraph Commonalities \tExperiment & Notebook SessionEnvironment\tThey can share the same Environment configuration "},{"title":"Experiment-related modules inside Submarine-server","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#experiment-related-modules-inside-submarine-server","content":"(Please refer to architecture of submarine server for more details) "},{"title":"Experiment Manager","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#experiment-manager","content":"The experiment manager receives the experiment requests, persisting the experiment metas in a database(e.g. MySQL), will invoke subsequence modules to submit and monitor the experiment's execution. "},{"title":"Compute Cluster Manager","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#compute-cluster-manager","content":"After experiment accepted by experiment manager, based on which cluster the experiment intended to run (like mentioned in the previous sections, Submarine supports to manage multiple compute clusters), compute cluster manager will returns credentials to access the compute cluster. It will also be responsible to create a new compute cluster if needed. For most of the on-prem use cases, there's only one cluster involved, for such cases, ComputeClusterManager returns credentials to access local cluster if needed. "},{"title":"Experiment Submitter","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#experiment-submitter","content":"Experiment Submitter handles different kinds of experiments to run (e.g. ad-hoc script, distributed TF, MPI, pre-defined templates, Pipeline, AutoML, etc.). And such experiments can be managed by different resource management systems (e.g. K8s, YARN, container cloud, etc.) To meet the requirements to support variant kinds of experiments and resource managers, we choose to use plug-in modules to support different submitters (which requires jars to submarine-server’s classpath). To avoid jars and dependencies of plugins break the submarine-server, the plug-ins manager, or both. To solve this issue, we can instantiate submitter plug-ins using a classloader that is different from the system classloader. Submitter Plug-ins# Each plug-in uses a separate module under the server-submitter module. As the default implements, we provide for YARN and K8s. For YARN cluster, we provide the submitter-yarn and submitter-yarnservice plug-ins. The submitter-yarn plug-in used the TonY as the runtime to run the training job, and the submitter-yarnservice plug-in direct use the YARN Service which supports Hadoop v3.1 above. The submitter-k8s plug-in is used to submit the job to Kubernetes cluster and use the operator as the runtime. The submitter-k8s plug-in implements the operation of CRD object and provides the java interface. In the beginning, we use the tf-operator for the TensorFlow. If Submarine want to support the other resource management system in the future, such as submarine-docker-cluster (submarine uses the Raft algorithm to create a docker cluster on the docker runtime environment on multiple servers, providing the most lightweight resource scheduling system for small-scale users). We should create a new plug-in module named submitter-docker under the server-submitter module. "},{"title":"Experiment Monitor","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#experiment-monitor","content":"The monitor tracks the experiment life cycle and records the main events and key info in runtime. As the experiment run progresses, the metrics are needed for evaluation of the ongoing success or failure of the execution progress. Due to adapt the different cluster resource management system, so we need a generic metric info structure and each submitter plug-in should inherit and complete it by itself. "},{"title":"Invoke flows of experiment-related components","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#invoke-flows-of-experiment-related-components","content":" +-----------------+ +----------------+ +----------------+ +-----------------+ |Experiments | |Compute Cluster | |Experiment | | Experiment | |Mgr | |Mgr | |Submitter | | Monitor | +-----------------+ +----------------+ +----------------+ +-----------------+ + + + + User | | | | Submit |+------------------------------------->+ + Xperiment| Use submitter.validate(spec) | | | to validate spec and create | | | experiment object (state- | | | machine). | | | | | | The experiment manager will | | | persist meta-data to Database| | | | | | | | + + |+-----------------> + | | | Submit Experiments| | | | To ComputeCluster| | | | Mgr, get existing|+---------------->| | | cluster, or | Use Submitter | | | create a new one.| to submit |+---------------> | | | Different kinds | Once job is | | | of experiments | submitted, use |+----+ | | to k8s/yarn, etc| monitor to get | | | | | status updates | | | | | | | Monitor | | | | | Xperiment | | | | | status | | | | | |<--------------------------------------------------------+| | | | | | | | Update Status back to Experiment | | | | Manager | |<----+ | | | | | | | | | | | | v v v v Copy TODO: add more details about template, environment, etc. "},{"title":"Common modules of experiment/notebook-session/model-serving","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#common-modules-of-experimentnotebook-sessionmodel-serving","content":"Experiment/notebook-session/model-serving share a lot of commonalities, all of them are: Some workloads running on YARN/K8s.Need persist meta data to DB. Need monitor task/service running status from resource management system.  We need to make their implementation are loose-coupled, but at the same time, share some building blocks as much as possible (e.g. submit PodSpecs to K8s, monitor status, get logs, etc.) to reduce duplications. "},{"title":"Support Predefined-experiment-templates","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#support-predefined-experiment-templates","content":"Predefined Experiment Template is just a way to save data-scientists time to repeatedly entering parameters which is not error-proof and user experience is also bad. "},{"title":"Predefined-experiment-template API to run experiment","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#predefined-experiment-template-api-to-run-experiment","content":"Predefined experiment template consists a list of parameters, each of the parameter has 4 properties: Key\tRequired\tDefault Value\tDescriptionName of the key\ttrue/false\tWhen required = false, a default value can be provided by the template\tDescription of the parameter For the example of deepfm CTR training experiment mentioned in the architecture-and-requirements.md { \"input\": { \"train_data\": [\"hdfs:///user/submarine/data/tr.libsvm\"], \"valid_data\": [\"hdfs:///user/submarine/data/va.libsvm\"], \"test_data\": [\"hdfs:///user/submarine/data/te.libsvm\"], \"type\": \"libsvm\" }, \"output\": { \"save_model_dir\": \"hdfs:///user/submarine/deepfm\", \"metric\": \"auc\" }, \"training\": { \"batch_size\" : 512, \"field_size\": 39, \"num_epochs\": 3, \"feature_size\": 117581, ... } } Copy The template will be (in yaml format): # deepfm.ctr template name: deepfm.ctr author: description: > This is a template to run CTR training using deepfm algorithm, by default it runs single node TF job, you can also overwrite training parameters to use distributed training. parameters: - name: input.train_data required: true description: > train data is expected in SVM format, and can be stored in HDFS/S3 ... - name: training.batch_size required: false default: 32 description: This is batch size of training Copy The batch format can be used in UI/API. "},{"title":"Handle Predefined-experiment-template from server side","type":1,"pageTitle":"Experiment Implementation","url":"docs/designDocs/experiment-implementation#handle-predefined-experiment-template-from-server-side","content":"Please note that, the conversion of predefined-experiment-template will be always handled by server. The invoke flow looks like:  +------------Submarine Server -----------------------+ +--------------+ | +-----------------+ | |Client |+------->|Experimment Mgr | | | | | | | | +--------------+ | +-----------------+ | | + | Submit | +-------v---------+ Get Experiment Template | Template | |Experiment |<-----+From pre-registered | Parameters | |Template Registry| Templates | to Submarine | +-------+---------+ | Server | | | | +-------v---------+ +-----------------+ | | |Deepfm CTR Templ-| |Experiment- | | | |ate Handler +------>|Tensorflow | | | +-----------------+ +--------+--------+ | | | | | | | | +--------v--------+ | | |Experiment | | | |Submitter | | | +--------+--------+ | | | | | | | | +--------v--------+ | | | | | | | ...... | | | +-----------------+ | | | +----------------------------------------------------+ Copy Basically, from Client, it submitted template parameters to Submarine Server, inside submarine server, it finds the corresponding template handler based on the name. And the template handler converts input parameters to an actual experiment, such as a distributed TF experiment. After that, it goes the similar route to validate experiment spec, compute cluster manager, etc. to get the experiment submitted and monitored. Predefined-experiment-template is able to create any kind of experiment, it could be a pipeline:  +-----------------+ +------------------+ |Template XYZ | | XYZ Template | | |+---------------> | Handler | +-----------------+ +------------------+ + | | | | v +--------------------+ +------------------+ | +-----------------+| | Predefined | | | Split Train/ ||<----+| Pipeline | | | Test data || +------------------+ | +-------+---------+| | | | | +-------v---------+| | | Spark Job ETL || | | || | +-------+---------+| | | | | +-------v---------+| | | Train using || | | XGBoost || | +-------+---------+| | | | | +-------v---------+| | | Validate Train || | | Results || | +-----------------+| | | +--------------------+ Copy Template can be also chained to reuse other template handlers  +-----------------+ +------------------+ |Template XYZ | | XYZ Template | | |+---------------> | Handler | +-----------------+ +------------------+ + | v +------------------+ +------------------+ |Distributed | | ABC Template | |TF Experiment |<----+| Handler | +------------------+ +------------------+ Copy Template Handler is a callable class inside Submarine Server with a standard interface defined like. interface ExperimentTemplateHandler { ExperimentSpec createExperiment(TemplatedExperimentParameters param) } Copy We should avoid users to do coding when they want to add new template, we should have several standard template handler to deal with most of the template handling. Experiment templates can be registered/updated/deleted via Submarine Server's REST API, which need to be discussed separately in the doc. (TODO) "},{"title":"Implementation Notes","type":0,"sectionRef":"#","url":"docs/designDocs/implementation-notes","content":"Before digging into details of implementations, you should read architecture-and-requirements first to understand overall requirements and architecture. Here're sub topics of Submarine implementations: Submarine Storage: How to store metadata, logs, metrics, etc. of Submarine.Submarine Environment: How environments created, managed, stored in Submarine. Submarine Experiment: How experiments managed, stored, and how the predefined experiment template works.Submarine Notebook: How experiments managed, stored, and how the predefined experiment template works.Submarine Server: How Submarine server is designed, architecture, implementation notes, etc. Working-in-progress designs, Below are designs which are working-in-progress, we will move them to the upper section once design & review is finished: Submarine HA Design: How Submarine HA can be achieved, using RAFT, etc.Submarine services deployment module: How to deploy submarine services to k8s, YARN or cloud. ","keywords":""},{"title":"Notebook Implementation","type":0,"sectionRef":"#","url":"docs/designDocs/notebook-implementation","content":"","keywords":""},{"title":"Overview","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#overview","content":""},{"title":"User's interaction","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#users-interaction","content":"Users can start N (N >= 0) number of Notebook sessions, a notebook session is a running notebook instance. Notebook session can be launched by Submarine UI (P0), and Submarine CLI (P2). When launch notebook session, users can choose T-shirt size of notebook session (how much mem/cpu/gpu resources, or resource profile such as small, medium, large, etc.). (P0)And user can choose an environment for notebook. More details please refer to environment implementation (P0)When start a notebook, user can choose what code to be initialized, similar to experiment. (P1)Optionally, users can choose to attach a persistent volume to a notebook session. (P2) Users can get a list of notebook sessions belongs to themselves, and connect to notebook session. User can choose to terminate a running notebook session. "},{"title":"Admin's interaction","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#admins-interaction","content":"How many concurrent notebook sessions can be launched by each user is determined by resource quota limits of each user, and maximum concurrent notebook sessions can be launched by each user. (P2) "},{"title":"Relationship with other components","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#relationship-with-other-components","content":""},{"title":"Metadata store","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#metadata-store","content":"Running notebook sessions' metadata need persistented in Submarine's metadata store (Database). "},{"title":"Submarine Server","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#submarine-server","content":" +--------------+ +--------Submarine Server--------------------+ |Submarine UI | | +-------------------+ | | |+---> Submarine | | | Notebook | | | Notebook REST API| | +--------------+ | | | | | +--------+----------+ +--------------+ | | | +->|Metastore | | | +--------v----------+ | |DB | | | | Submarine +--+ +--------------+ | | | Notebook Mgr | | | | | | | | | | | +--------+----------+ | | | | +----------|---------------------------------+ | +--------------+ +--------v---------+ | Notebook Session | | | | instance | | | +------------------+ Copy Once user use Submarine UI to launch a notebook session, Submarine notebook manager inside Submarine Server will persistent notebook session's metadata, and launch a new notebook session instance. "},{"title":"Resource manager","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#resource-manager","content":"When using K8s as resource manager, Submarine notebook session will run as a new POD. "},{"title":"Storage","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#storage","content":"There're several different types of storage requirements for Submarine notebook. For code, environment, etc, storage, please refer to storage implementation, check \"Localization of experiment/notebook/model-serving code\". When there're needs to attach volume (such as user's home folder) to Submarine notebook session, please check storage implementation, check \"Attachable volume\". "},{"title":"Environment","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#environment","content":"Submarine notebook's environment should be used to run experiment, model serving, etc. Please check environment implementation. (More specific to notebook, please check \"How to implement to make user can easily use Submarine environments\") Please note that notebook's Environment should include right version of notebook libraries, and admin should follow the guidance to build correct Docker image, Conda libraries to correctly run Notebook. "},{"title":"Submarine SDK (For Experiment, etc.)","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#submarine-sdk-for-experiment-etc","content":"Users can run new experiment, access metrics information, or do model operations using Submarine SDK. Submarine SDK is a Python library which can talk to Submarine Server which need Submarine Server's endpoint as well as user credentials. To ensure better experience, we recommend always install proper version of Submarine SDK from environment which users can use Submarine SDK directly from commandline. (We as Submarine community can provide sample Dockerfile or Conda environment which have correct base libraries installed for Submarine SDK). Submarine Server IP will be configured automatically by Submarine Server, and added as an envar when Submarine notebook session got launched. "},{"title":"Security","type":1,"pageTitle":"Notebook Implementation","url":"docs/designDocs/notebook-implementation#security","content":"Please refer to Security Implementation Once user accessed to a running notebook session, the user can also access resources of the notebook, capability of submit new experiment, and access data. This is also very dangerous so we have to protect it. A simple solution is to use token-based authentication https://jupyter-notebook.readthedocs.io/en/stable/security.html. A more common way is to use solutions like KNOX to support SSO. We need expand this section to more details. (TODO). "},{"title":"Storage Implementation","type":0,"sectionRef":"#","url":"docs/designDocs/storage-implementation","content":"","keywords":""},{"title":"ML-related objects and their storages","type":1,"pageTitle":"Storage Implementation","url":"docs/designDocs/storage-implementation#ml-related-objects-and-their-storages","content":"First let's look at what user will interact for most of the time: Notebook ExperimentModel Servings  +---------+ +------------+ |Logs |<--+|Notebook | +----------+ +---------+ +------------+ +----------------+ |Trackings | <-+|Experiment |<--+>|Model Artifacts | +----------+ +-----------------+ +------------+ +----------------+ +----------+<---+|ML-related Metric|<--+Servings | |tf.events | +-----------------+ +------------+ +----------+ ^ +-----------------+ + | Environments | +----------------------+ | | +-----------------+ | Submarine Metastore | | Dependencies | |Code | +----------------------+ | | +-----------------+ |Experiment Meta | | Docker Images | +----------------------+ +-----------------+ |Model Store Meta | +----------------------+ |Model Serving Meta | +----------------------+ |Notebook meta | +----------------------+ |Experiment Templates | +----------------------+ |Environments Meta | +----------------------+ Copy First of all, all the notebook-sessions / experiments / model-serving instances) are more or less interact with following storage objects: Logs for these tasks for troubleshooting. ML-related metrics such as loss, epoch, etc. (in contrast of system metrics such as CPU/memory usage, etc.) There're different types of ML-related metrics, for Tensorflow/pytorch, they can use tf.events and get visualizations on tensorboard. Or they can use tracking APIs (such as Submarine tracking, mlflow tracking, etc.) to output customized tracking results for non TF/Pytorch workloads. Training jobs of experiment typically generate model artifacts (files) which need persisted, and both of notebook, model serving needs to load model artifacts from persistent storage. There're various of meta information, such as experiment meta, model registry, model serving, notebook, experiment, environment, etc. We need be able to read these meta information back.We also have code for experiment (like training/batch-prediction), notebook (ipynb), and model servings.And notebook/experiments/model-serving need depend on environments (dependencies such as pip, and Docker Images). "},{"title":"Implementation considerations for ML-related objects","type":1,"pageTitle":"Storage Implementation","url":"docs/designDocs/storage-implementation#implementation-considerations-for-ml-related-objects","content":"Object Type\tCharacteristics\tWhere to storeMetrics: tf.events\tTime series data with k/v, appendable to file\tLocal/EBS, HDFS, Cloud Blob Storage Metrics: other tracking metrics\tTime series data with k/v, appendable to file\tLocal, HDFS, Cloud Blob Storage, Database Logs\tLarge volumes, #files are potentially huge.\tLocal (temporary), HDFS (need aggregation), Cloud Blob Storage Submarine Metastore\tCRUD operations for small meta data.\tDatabase Model Artifacts\tSize varies for model (from KBs to GBs). #files are potentially huge.\tHDFS, Cloud Blob Storage Code\tNeed version control. (Please find detailed discussions below for code storage and localization)\tTarball on HDFS/Cloud Blog Storage, or Git Environment (Dependencies, Docker Image) Public/private environment repo (like Conda channel), Docker registry. "},{"title":"Detailed discussions","type":1,"pageTitle":"Storage Implementation","url":"docs/designDocs/storage-implementation#detailed-discussions","content":"Store code for experiment/notebook/model-serving# There're following ways to get experiment code: 1) Code is part of Git repo: (Recommended) This is our recommended approach, once code is part of Git, it will be stored in version control, any change will be tracked, and much easier for users to trace back what change triggered a new bug, etc. 2) Code is part of Docker image: This is an anti-pattern and we will NOT recommend you to use it, Docker image can be used to include ANYTHING, like dependencies, the code you will execute, or even data. But this doesn't mean you should do it. We recommend to use Docker image ONLY for libraries/dependencies. Making code to be part of Docker image makes hard to edit code (if you want to update a value in your Python file, you will have to recreate the Docker image, push it and rerun it). 3) Code is part of S3/HDFS/ABFS: User may want to store their training code to a tarball on a shared storage. Submarine need to download code from remote storage to the launched container before running the code. Localization of experiment/notebook/model-serving code# To make user experiences keeps same across different environment, we will localize code to a same folder after the container is launched, preferably /code For example, there's a git repo need to be synced up for an experiment/notebook/model-serving (example above): experiment: #Or notebook, model-serving name: \"abc\", environment: \"team-default-ml-env\" ... (other fields) code: sync_mode: git url: \"https://foo.com/training-job.git\" Copy After localize, training-job/ will be placed under /code When we running on K8s environment, we can use K8s's initContainer and emptyDir to do these things for us. K8s POD spec (generated by Submarine server instead of user, user should NEVER edit K8s spec, that's too unfriendly to data-scientists): apiVersion: v1 kind: Pod metadata: name: experiment-abc spec: containers: - name: experiment-task image: training-job volumeMounts: - name: code-dir mountPath: /code initContainers: - name: git-localize image: git-sync command: \"git clone .. /code/\" volumeMounts: - name: code-dir mountPath: /code volumes: - name: code-dir emptyDir: {} Copy The above K8s spec create a code-dir and mount it to /code to launched containers. The initContainer git-localize uses https://github.com/kubernetes/git-sync to do the sync up. (If other storages are used such as s3, we can use similar initContainer approach to download contents) "},{"title":"System-related metrics/logs and their storages","type":1,"pageTitle":"Storage Implementation","url":"docs/designDocs/storage-implementation#system-related-metricslogs-and-their-storages","content":"Other than ML-related objects, we have system-related objects, including: Daemon logs (like logs of Submarine server). Logs for other dependency components (like Kubernetes logs when running on K8s). System metrics (Physical resource usages by daemons, launched training containers, etc.).  All these information should be handled by 3rd party system, such as Grafana, Prometheus, etc. And system admins are responsible to setup these infrastructures, dashboard. Users of submarine should NOT interact with system related metrics/logs. It is system admin's responsibility. "},{"title":"Attachable Volumes","type":1,"pageTitle":"Storage Implementation","url":"docs/designDocs/storage-implementation#attachable-volumes","content":"It is possible user has needs to have an attachable volume for their experiment / notebook, this is especially useful for notebook storage, since contents of notebook can be automatically saved, and it can be used as user's home folder. Downside of attachable volume is, it is not versioned, even notebook is mainly used for adhoc exploring tasks, an unversioned notebook file can lead to maintenance issues in the future. Since this is a common requirement, we can consider to support attachable volumes in Submarine in a long run, but with relatively lower priority. "},{"title":"In-scope / Out-of-scope","type":1,"pageTitle":"Storage Implementation","url":"docs/designDocs/storage-implementation#in-scope--out-of-scope","content":"Describe what Submarine project should own and what Submarine project should NOT own. "},{"title":"Submarine Server Implementation","type":0,"sectionRef":"#","url":"docs/designDocs/submarine-server/architecture","content":"","keywords":""},{"title":"Architecture Overview","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#architecture-overview","content":" +---------------Submarine Server ---+ | | | +------------+ +------------+ | | |Web Svc/Prxy| |Backend Svc | | +--Submarine Asset + | +------------+ +------------+ | |Project/Notebook | | ^ ^ | |Model/Metrics | +---|---------|---------------------+ |Libraries/Dataset | | | +------------------+ | | | +--|-Compute Cluster 1---+ +--Image Registry--+ + | | | | User's Images | User / | + | | | Admin | User Notebook Instance | +------------------+ | Experiment Runs | +------------------------+ +-Data Storage-----+ | S3/HDFS, etc. | +----Compute Cluster 2---+ | | +------------------+ ... Copy Here's a diagram to illustrate the Submarine's deployment. Submarine Server consists of web service/proxy, and backend services. They're like \"control planes\" of Submarine, and users will interact with these services.Submarine server could be a microservice architecture and can be deployed to one of the compute clusters. (see below, this will be useful when we only have one cluster). There're multiple compute clusters that could be used by Submarine service. For user's running notebook instance, jobs, etc. they will be placed to one of the compute clusters by user's preference or defined policies.Submarine's asset includes project/notebook(content)/models/metrics/dataset-meta, etc. can be stored inside Submarine's own database.Datasets can be stored in various locations such as S3/HDFS. Users can push container (such as Docker) images to a preconfigured registry in Submarine, so Submarine service can know how to pull required container images.Image Registry/Data-Storage, etc. are outside of Submarine server's scope and should be managed by 3rd party applications. "},{"title":"Submarine Server and its APIs","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#submarine-server-and-its-apis","content":"Submarine server is designed to allow data scientists to access notebooks, submit/manage jobs, manage models, create model training workflows, access datasets, etc. Submarine Server exposed UI and REST API. Users can also use CLI / SDK to manage assets inside Submarine Server.  +----------+ | CLI |+---+ +----------+ v +----------------+ +--------------+ | Submarine | +----------+ | REST API | | | | SDK |+>| |+> Server | +----------+ +--------------+ | | ^ +----------------+ +----------+ | | UI |+---+ +----------+ Copy REST API will be used by the other 3 approaches. (CLI/SDK/UI) The REST API Service handles HTTP requests and is responsible for authentication. It acts as the caller for the JobManager component. The REST component defines the generic job spec which describes the detailed info about job. For more details, refer to here. (Please note that we're converting REST endpoint description from Java-based REST API to swagger definition, once that is done, we should replace the link with swagger definition spec). "},{"title":"Proposal","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#proposal","content":" +---------------------+ +-----------+ | +--------+ +----+ | | | | |runtime1+-->+job1| | | workbench +---+ +----------------------------------+ | +--------+ +----+ | | | | | +------+ +---------------------+ | +-->+ +--------+ +----+ | +-----------+ | | | | | +------+ +-------+ | | | | |runtime2+-->+job2| | | | | | | | YARN | | K8s | | | | | +--------+ +----+ | +-----------+ | | | | | +------+ +-------+ | | | | YARN Cluster | | | | | | | | submitter | | | +---------------------+ | CLI +------>+ | REST | +---------------------+ +---+ | | | | | | +---------------------+ | | +---------------------+ +-----------+ | | | | | +-------+ +-------+ | | | | +--------+ +----+ | | | | | | |PlugMgr| |monitor| | | | | | +-->+job1| | +-----------+ | | | | | +-------+ +-------+ | | | | | | +----+ | | | | | | | | JobManager | | +-->+ |operator| +----+ | | SDK +---+ | +------+ +---------------------+ | | | +-->+job2| | | | +----------------------------------+ | +--------+ +----+ | +-----------+ | K8s Cluster | client server +---------------------+ Copy We propose to split the original core module in the old layout into two modules, CLI and server as shown in FIG. The submarine-client calls the REST APIs to submit and retrieve the job info. The submarine-server provides the REST service, job management, submitting the job to cluster, and running job in different clusters through the corresponding runtime. "},{"title":"Submarine Server Components","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#submarine-server-components","content":" +----------------------Submarine Server--------------------------------+ | +-----------------+ +------------------+ +--------------------+ | | | Experiment | |Notebook Session | |Environment Mgr | | | | Mgr | |Mgr | | | | | +-----------------+ +------------------+ +--------------------+ | | | | +-----------------+ +------------------+ +--------------------+ | | | Model Registry | |Model Serving Mgr | |Compute Cluster Mgr | | | | | | | | | | | +-----------------+ +------------------+ +--------------------+ | | | | +-----------------+ +------------------+ +--------------------+ | | | DataSet Mgr | |User/Team | |Metadata Mgr | | | | | |Permission Mgr | | | | | +-----------------+ +------------------+ +--------------------+ | +----------------------------------------------------------------------+ Copy "},{"title":"Experiment Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#experiment-manager","content":"TODO "},{"title":"Notebook Sessions Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#notebook-sessions-manager","content":"TODO "},{"title":"Environment Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#environment-manager","content":"TODO "},{"title":"Model Registry","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#model-registry","content":"TODO "},{"title":"Model Serving Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#model-serving-manager","content":"TODO "},{"title":"Compute Cluster Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#compute-cluster-manager","content":"TODO "},{"title":"Dataset Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#dataset-manager","content":"TODO "},{"title":"User/team permissions manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#userteam-permissions-manager","content":"TODO "},{"title":"Metadata Manager","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#metadata-manager","content":"TODO "},{"title":"Components/services outside of Submarine Server's scope","type":1,"pageTitle":"Submarine Server Implementation","url":"docs/designDocs/submarine-server/architecture#componentsservices-outside-of-submarine-servers-scope","content":"TODO: Describe what are the out-of-scope components, which should be handled and managed outside of Submarine server. Candidates are: Identity management, data storage, metastore storage, etc. "},{"title":"Generic Expeiment Spec","type":0,"sectionRef":"#","url":"docs/designDocs/submarine-server/experimentSpec","content":"","keywords":""},{"title":"Motivation","type":1,"pageTitle":"Generic Expeiment Spec","url":"docs/designDocs/submarine-server/experimentSpec#motivation","content":"As the machine learning platform, the submarine should support multiple machine learning frameworks, such as Tensorflow, Pytorch etc. But different framework has different distributed components for the training experiment. So that we designed a generic experiment spec to abstract the training experiment across different frameworks. In this way, the submarine-server can hide the complexity of underlying infrastructure differences and provide a cleaner interface to manager experiments "},{"title":"Proposal","type":1,"pageTitle":"Generic Expeiment Spec","url":"docs/designDocs/submarine-server/experimentSpec#proposal","content":"Considering the Tensorflow and Pytorch framework, we propose one spec which consists of library spec, submitter spec and task specs etc. Such as: name: \"mnist\" librarySpec: name: \"TensorFlow\" version: \"2.1.0\" image: \"apache/submarine:tf-mnist-with-summaries-1.0\" cmd: \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\" envVars: ENV_1: \"ENV1\" submitterSpec: type: \"k8s\" namespace: \"submarine\" taskSpecs: Ps: name: tensorflow replicas: 2 resources: \"cpu=4,memory=2048M,nvidia.com/gpu=1\" Worker: name: tensorflow replicas: 2 resources: \"cpu=4,memory=2048M,nvidia.com/gpu=1\" Copy "},{"title":"Library Spec","type":1,"pageTitle":"Generic Expeiment Spec","url":"docs/designDocs/submarine-server/experimentSpec#library-spec","content":"The library spec describes the info about machine learning framework. All the fields as below: field\ttype\toptional\tdescriptionname\tstring\tNO\tMachine Learning Framework name. Only \"tensorflow\" and \"pytorch\" is supported. It doesn't matter if the value is uppercase or lowercase. version\tstring\tNO\tThe version of ML framework. Such as: 2.1.0 image\tstring\tNO\tThe public image used for each task if not specified. Such as: apache/submarine cmd\tstring\tYES\tThe public entry cmd for the task if not specified. envVars\tkey/value\tYES\tThe public env vars for the task if not specified. "},{"title":"Submitter Spec","type":1,"pageTitle":"Generic Expeiment Spec","url":"docs/designDocs/submarine-server/experimentSpec#submitter-spec","content":"It describes the info of submitter which the user specified, such as yarn, yarnservice or k8s. All the fields as below: field\ttype\toptional\tdescriptiontype\tstring\tNO\tThe submitter type, supports k8s now configPath\tstring\tYES\tThe config path of the specified resource manager. You can set it in submarine-site.xml if run submarine-server locally namespace\tstring\tNO\tIt's known as queue in Apache Hadoop YARN and namespace in Kubernetes. kind\tstring\tYES\tIt's used for k8s submitter, supports TFJob and PyTorchJob apiVersion\tstring\tYES\tIt should pair with the kind, such as the TFJob's api version is kubeflow.org/v1 "},{"title":"Task Spec","type":1,"pageTitle":"Generic Expeiment Spec","url":"docs/designDocs/submarine-server/experimentSpec#task-spec","content":"It describes the task info, the tasks make up the experiment. So it must be specified when submit the experiment. All the tasks should putted into the key value collection. Such as: taskSpecs: Ps: name: tensorflow replicas: 2 resources: \"cpu=4,memory=2048M,nvidia.com/gpu=1\" Worker: name: tensorflow replicas: 2 resources: \"cpu=4,memory=2048M,nvidia.com/gpu=1\" Copy All the fields as below: field\ttype\toptional\tdescriptionname\tstring\tYES\tThe experiment name, if not specify using the library name image\tstring\tYES\tThe experiment docker image cmd\tstring\tYES\tThe entry command for running task envVars\tkey/value\tYES\tThe environment variables for the task resources\tstring\tNO\tThe limit resource for the task. Formatter: cpu=%s,memory=%s,nvidia.com/gpu=%s "},{"title":"Implements","type":1,"pageTitle":"Generic Expeiment Spec","url":"docs/designDocs/submarine-server/experimentSpec#implements","content":"For more info see SUBMARINE-321 "},{"title":"Security Implementation","type":0,"sectionRef":"#","url":"docs/designDocs/wip-designs/security-implementation","content":"","keywords":""},{"title":"Handle User's Credential","type":1,"pageTitle":"Security Implementation","url":"docs/designDocs/wip-designs/security-implementation#handle-users-credential","content":"Users credential includes Kerberoes Keytabs, Docker registry credentials, Github ssh-keys, etc. User's credential must be stored securitely, for example, via KeyCloak or K8s Secrets. (More details TODO) "},{"title":"Submarine Launcher","type":0,"sectionRef":"#","url":"docs/designDocs/wip-designs/submarine-launcher","content":"","keywords":""},{"title":"Introduction","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#introduction","content":"Submarine is built and run in Cloud Native, taking advantage of the cloud computing model. To give full play to the advantages of cloud computing. These applications are characterized by rapid and frequent build, release, and deployment. Combined with the features of cloud computing, they are decoupled from the underlying hardware and operating system, and can easily meet the requirements of scalability, availability, and portability. And provide better economy. In the enterprise data center, submarine can support k8s/yarn/docker three resource scheduling systems; in the public cloud environment, submarine can support these cloud services in GCE/AWS/Azure; "},{"title":"Requirement","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#requirement","content":""},{"title":"Cloud-Native Service","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#cloud-native-service","content":"The submarine server is a long-running services in the daemon mode. The submarine server is mainly used by algorithm engineers to provide online front-end functions such as algorithm development, algorithm debugging, data processing, and workflow scheduling. And submarine server also mainly used for back-end functions such as scheduling and execution of jobs, tracking of job status, and so on. Through the ability of rolling upgrades, we can better provide system stability. For example, we can upgrade or restart the workbench server without affecting the normal operation of submitted jobs. You can also make full use of system resources. For example, when the number of current developers or job tasks increases, The number of submarine server instances can be adjusted dynamically. In addition, submarine will provide each user with a completely independent workspace container. This workspace container has already deployed the development tools and library files commonly used by algorithm engineers including their operating environment. Algorithm engineers can work in our prepared workspaces without any extra work. Each user's workspace can also be run through a cloud service. "},{"title":"Service discovery","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#service-discovery","content":"With the cluster function of submarine, each service only needs to run in the container, and it will automatically register the service in the submarine cluster center. Submarine cluster management will automatically maintain the relationship between service and service, service and user. "},{"title":"Design","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#design","content":" "},{"title":"Launcher","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher","content":"The submarine launcher module defines the complete interface. By using this interface, you can run the submarine server, and workspace in k8s / yarn / docker / AWS / GCE / Azure. "},{"title":"Launcher On Docker","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher-on-docker","content":"In order to allow some small and medium-sized users without k8s/yarn to use submarine, we support running the submarine system in docker mode. Users only need to provide several servers with docker runtime environment. The submarine system can automatically cluster these servers into clusters, manage all the hardware resources of the cluster, and run the service or workspace container in this cluster through scheduling algorithms. "},{"title":"Launcher On Kubernetes","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher-on-kubernetes","content":"submarine operator "},{"title":"Launcher On Yarn","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher-on-yarn","content":"[TODO] "},{"title":"Launcher On AWS","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher-on-aws","content":"[TODO] "},{"title":"Launcher On GCP","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher-on-gcp","content":"[TODO] "},{"title":"Launcher On Azure","type":1,"pageTitle":"Submarine Launcher","url":"docs/designDocs/wip-designs/submarine-launcher#launcher-on-azure","content":"[TODO] "},{"title":"Cluster Server Design - High-Availability","type":0,"sectionRef":"#","url":"docs/designDocs/wip-designs/submarine-clusterServer","content":"","keywords":""},{"title":"Below is existing proposal:","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#below-is-existing-proposal","content":""},{"title":"Introduction","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#introduction","content":"The Submarine system contains a total of two daemon services, Submarine Server and Workbench Server. Submarine Server mainly provides job submission, job scheduling, job status monitoring, and model online service for Submarine. Workbench Server is mainly for algorithm users to provide algorithm development, Python/Spark interpreter operation, and other services through Notebook. The goal of the Submarine project is to provide high availability and high-reliability services for big data processing, algorithm development, job scheduling, model online services, model batch, and incremental updates. In addition to the high availability of big data and machine learning frameworks, the high availability of Submarine Server and Workbench Server itself is a key consideration. "},{"title":"Requirement","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#requirement","content":""},{"title":"Cluster Metadata Center","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#cluster-metadata-center","content":"Multiple Submarine (or Workbench) Server processes create a Submarine Cluster through the RAFT algorithm library. The cluster internally maintains a metadata center. All servers can operate the metadata. The RAFT algorithm ensures that multiple processes are simultaneously co-located. A data modification will not cause problems such as mutual coverage and dirty data. This metadata center stores data by means of key-value pairs. it can store/support a variety of data, but it should be noted that metadata is only suitable for storing small amounts of data and cannot be used to replace data storage. "},{"title":"Service discovery","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#service-discovery","content":"By storing the information of the service or process in the metadata center, we can easily find the information of the service or process we need in any place, for example, the IP address and port where the Python interpreter will be the process. Information is stored in metadata, and other services can easily find process information through process IDs and connect to provide service discovery capabilities. "},{"title":"Cluster event","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#cluster-event","content":"In the entire Submarine cluster, the servers can communicate with each other and other child processes to send cluster events to each other. The service or process processes the corresponding programs according to the cluster events. For example, the Workbench Server can be managed to Python. The interpreter process sends a shutdown event that controls the operation of the services and individual subprocesses throughout the cluster. Cluster events support both broadcast and separate delivery capabilities. "},{"title":"Independence","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#independence","content":"We implement Submarine's clustering capabilities through the RAFT algorithm library, without relying on any external services (e.g. Zookeeper, Etcd, etc.) "},{"title":"Disadvantages","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#disadvantages","content":"Because the RAFT algorithm requires more than half of the servers available to ensure the normality of the RAFT algorithm, if we need to turn on the clustering capabilities of Submarine (Workbench) Server, when more than half of the servers are unavailable, some programs may appear abnormal. Of course, we also detected this in the system, downgrading the system or refusing to provide service status. "},{"title":"System design","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#system-design","content":""},{"title":"Universal design","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#universal-design","content":"Modular design, Submarine (Workbench) Server exists in the Submarine system, these two services need to provide clustering capabilities, so we abstract the cluster function into a separate module for development so that Submarine (Workbench) Server can reuse the cluster function module. "},{"title":"ClusterConfigure","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#clusterconfigure","content":"Add a submarine.server.addr and workbench.server.addr configuration items in submarine-site.xml, submarine.server.addr=ip1, ip2, ip3, through the IP list, the RAFT algorithm module in the server process can Cluster with other server processes. "},{"title":"ClusterServer","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#clusterserver","content":"The ClusterServer module encapsulates the RAFT algorithm module, which can create a service cluster and read and write metadata based on the two configuration items submarine.server.addr or workbench.server.addr. The cluster management service runs in each submarine server; The cluster management service establishes a cluster by using the atomix RaftServer class of the Raft algorithm library, maintains the ClusterStateMachine, and manages the service state metadata of each submarine server through the PutCommand, GetQuery, and DeleteCommand operation commands. "},{"title":"ClusterClient","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#clusterclient","content":"The ClusterClient module encapsulates the RAFT algorithm client module, which can communicate with the cluster according to the two configuration items submarine.server.addr or workbench.server.addr, read and write metadata, and write the IP and port information of the client process. Into the cluster's metadata center. The cluster management client runs in each submarine server and submarine Interpreter process; The cluster management client manages the submarine server and submarine Interpreter process state (metadata information) in the ClusterStateMachine by using the atomix RaftClient class of the Raft library to connect to the atomix RaftServer. When the submarine server and Submarine Interpreter processes are started, they are added to the ClusterStateMachine and are removed from the ClusterStateMachine when the Submarine Server and Submarine Interpreter processes are closed. "},{"title":"ClusterMetadata","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#clustermetadata","content":"Metadata stores metadata information in a KV key-value pair。 ServerMeta：key='host:port'，value= {SERVER_HOST=...，SERVER_PORT=...，...} Name\tDescriptionSUBMARINE_SERVER_HOST\tSubmarine server IP SUBMARINE_SERVER_PORT\tSubmarine server port WORKBENCH_SERVER_HOST\tSubmarine workbench server IP WORKBENCH_SERVER_PORT\tSubmarine workbench server port InterpreterMeta：key=InterpreterGroupId，value={INTP_TSERVER_HOST=...，...} Name\tDescriptionINTP_TSERVER_HOST\tSubmarine Interpreter Thrift IP INTP_TSERVER_PORT\tSubmarine Interpreter Thrift port INTP_START_TIME\tSubmarine Interpreter start time HEARTBEAT\tSubmarine Interpreter heartbeat time "},{"title":"Network fault tolerance","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#network-fault-tolerance","content":"In a distributed environment, there may be network anomalies, network delays, or service exceptions. After submitting metadata to the cluster, check whether the submission is successful. After the submission fails, save the metadata in the local message queue. A separate commit thread to retry; "},{"title":"Cluster monitoring","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#cluster-monitoring","content":"The cluster needs to monitor whether the Submarine Server and Submarine-Interpreter processes are working properly. The Submarine Server and Submarine Interpreter processes periodically send heartbeats to update their own timestamps in the cluster metadata. The Submarine Server with Leader identity periodically checks the timestamps of the Submarine Server and Submarine Interpreter processes to clear the timeout services and processes. The cluster monitoring module runs in each Submarine Server and Submarine Interpreter process, periodically sending heartbeat data of the service or process to the cluster; When the cluster monitoring module runs in Submarine Server, it sends the heartbeat to the cluster's ClusterStateMachine. If the cluster does not receive heartbeat information for a long time, Indicates that the service or process is abnormal and unavailable. Resource usage statistics strategy, in order to avoid the instantaneous high peak and low peak of the server, the cluster monitoring will collect the average resource usage in the most recent period for reporting, and improve the reasonable line and effectiveness of the server resources as much as possible; When the cluster monitoring module runs in the Submarine Server, it checks the heartbeat data of each Submarine Server and Submarine Interpreter process. If it times out, it considers that the service or process is abnormally unavailable and removes it from the cluster. "},{"title":"Atomix Raft algorithm library","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#atomix-raft-algorithm-library","content":"In order to reduce the deployment complexity of distributed mode, submarine server does not use Zookeeper to build a distributed cluster. Multiple submarine server groups are built into distributed clusters by using the Raft algorithm in submarine server. The Raft algorithm is involved by atomix lib of atomix that has passed Jepsen consistency verification. "},{"title":"Synchronize workbench notes","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#synchronize-workbench-notes","content":"In cluster mode, the user creates, modifies, and deletes the note on any of the servers. All need to be notified to all the servers in the cluster to synchronize the update of Notebook. Failure to do so will result in the user not being able to continue while switching to another server. "},{"title":"Listen for note update events","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#listen-for-note-update-events","content":"Listen for the NEW_NOTE, DEL_NOTE, REMOVE_NOTE_TO_TRASH ... event of the notebook in the NotebookServer#onMessage() function. "},{"title":"Broadcast note update event","type":1,"pageTitle":"Cluster Server Design - High-Availability","url":"docs/designDocs/wip-designs/submarine-clusterServer#broadcast-note-update-event","content":"The note is refreshed by notifying the event to all Submarine servers in the cluster via messaging Service. "},{"title":"How to Build Submarine","type":0,"sectionRef":"#","url":"docs/devDocs/BuildFromCode","content":"","keywords":""},{"title":"Prerequisites","type":1,"pageTitle":"How to Build Submarine","url":"docs/devDocs/BuildFromCode#prerequisites","content":"JDK 1.8Maven 3.3 or later ( 3.6.2 is known to fail, see SUBMARINE-273 )Docker "},{"title":"Quick Start","type":1,"pageTitle":"How to Build Submarine","url":"docs/devDocs/BuildFromCode#quick-start","content":""},{"title":"Build Your Custom Submarine Docker Images","type":1,"pageTitle":"How to Build Submarine","url":"docs/devDocs/BuildFromCode#build-your-custom-submarine-docker-images","content":"Submarine provides default Docker image in the release artifacts, sometimes you would like to do some modifications on the images. You can rebuild Docker image after you make changes. Note that you need to make sure the images built above can be accessed in k8s Usually this needs a rename and push to a proper Docker registry. mvn clean package -DskipTests Copy Build submarine server image: ./dev-support/docker-images/submarine/build.sh Copy Build submarine database image: ./dev-support/docker-images/database/build.sh Copy "},{"title":"Building source code / binary distribution","type":1,"pageTitle":"How to Build Submarine","url":"docs/devDocs/BuildFromCode#building-source-code--binary-distribution","content":"Checking releases for licenses mvn clean org.apache.rat:apache-rat-plugin:check Copy Create binary distribution with default hadoop version mvn clean package -DskipTests Copy Create binary distribution with hadoop-2.9.x version mvn clean package -DskipTests -Phadoop-2.9 Copy Create binary distribution with hadoop-2.10.x version mvn clean package -DskipTests -Phadoop-2.10 Copy Create binary distribution with hadoop-3.1.x version mvn clean package -DskipTests -Phadoop-3.1 Copy Create binary distribution with hadoop-3.2.x version mvn clean package -DskipTests -Phadoop-3.2 Copy Create source code distribution mvn clean package -DskipTests -Psrc Copy "},{"title":"Building source code / binary distribution with Maven Wrapper","type":1,"pageTitle":"How to Build Submarine","url":"docs/devDocs/BuildFromCode#building-source-code--binary-distribution-with-maven-wrapper","content":"Maven Wrapper (Optional): Maven Wrapper can help you avoid dependencies problem about Maven version. # Setup Maven Wrapper (Maven 3.6.1) mvn -N io.takari:maven:0.7.7:wrapper -Dmaven=3.6.1 # Check Maven Wrapper ./mvnw -version # Replace 'mvn' with 'mvnw'. Example: ./mvnw clean package -DskipTests Copy "},{"title":"TonY code modification","type":1,"pageTitle":"How to Build Submarine","url":"docs/devDocs/BuildFromCode#tony-code-modification","content":"If it is needed to make modifications to TonY project, please make a PR to Tony repository. "},{"title":"How to Run Integration Test","type":0,"sectionRef":"#","url":"docs/devDocs/IntegrationTest","content":"","keywords":""},{"title":"Introduction","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#introduction","content":"Now, Apache Submarine supports two kinds of integration test: test-e2e and test-k8s. These two modules can be found in the submarine/submarine-test directory. Currently, there are some differences between test-e2e and test-k8s in operation mode. To elaborate, test-e2e needs to deploy Apache Submarine locally, while test-k8s deploys Apache Submarine via k8s. These two test modules can be applied to different test scenarios. (In the future, these two test modules may be combined or adjusted) "},{"title":"k8s test","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#k8s-test","content":"k8s test: When the user submits the code to his/her repository or the apache/submarine git repository, the travis test task will automatically start. test-k8s runs test cases in travis. It will first create a k8s cluster by using the kind tool in travis, and then compile and package the submarine project in submarine-dist directory to build a docker image. Then use this latest code to build a docker image and deploy a submarine system in k8s. Then run test case in the test-k8s/.. directory. "},{"title":"Run k8s test in locally","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#run-k8s-test-in-locally","content":"Executing the following command will perform the following actions: mvn -Phadoop-2.9 clean package install -DskipTests verify -DskipRat -am -pl submarine-test/test-k8s Copy The submarine project will be compiled and packaged to generate submarine-dist/target/submarine-<version>.tar.gz Call the submarine-cloud/hack/integration-test.sh script Call the build.sh script under submarine/dev-support/docker-images/ to generate the latest submarine, database and operator docker images.Call submarine-cloud/hack/kind-cluster-build.sh to create a k8s clusterCall submarine-cloud/hack/deploy-submarine.sh to deploy the submarine system in the k8s cluster using the latest submarine, database and operator docker images.Call the test cases in submarine-test/test-k8s/ for testing. "},{"title":"Run k8s test in travis","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#run-k8s-test-in-travis","content":"Each time a code is submitted, travis is automatically triggered for testing. "},{"title":"E2E test","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#e2e-test","content":""},{"title":"E2E tests can be executed both locally and in Travis (For workbench developer)","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#e2e-tests-can-be-executed-both-locally-and-in-travis-for-workbench-developer","content":"Run E2E tests locally: Step1: Follow HowToRun.md to launch the submarine-server and database.Step2: Run workbench (Angular version) locally cd submarine/submarine-workbench/workbench-web npm start // Check 127.0.0.1:4200 Copy Step3: Modify the port from 8080 to 4200 WebDriverManager.java: url = \"http://localhost:8080\"; --> url = \"http://localhost:4200\";Your Unit test case: 8080 --> 4200 Step4: Comment the headless option ChromeWebDriverProvider.java: chromeOptions.addArguments(\"--headless\"); --> //chromeOptions.addArguments(\"--headless\");With the headless option, the selenium will be executed in background. Step5: Run E2E test cases (Please check the following section Run the existing tests) Run E2E tests in Travis: Step1: Make sure that the port must be 8080 rather than in WebDriverManager.java and all test cases.Step2: Make sure that the headless option is not commented in ChromeWebDriverProvider.java.Step3: If you push the commit to Github, the Travis CI will execute automatically and you can check it in https://travis-ci.org/${your_github_account}/${your_repo_name}. "},{"title":"Run the existing tests.","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#run-the-existing-tests","content":"Move to the working directory.# cd submarine/submarine-test/test-e2e Copy Compile & Run.# Following command will compile all files and run all files ending with \"IT\" in the directory. For linux mvn verify Copy For MacOS mvn clean install -U Copy Run a specific testcase mvn -Dtest=${your_test_case_file_name} test //ex: mvn -Dtest=loginIT test Copy Result# If all of the function under test are succeeded, it will show. BUILD SUCCESS Copy Otherwise, it will show. BUILD FAILURE Copy "},{"title":"Add your own integration test","type":1,"pageTitle":"How to Run Integration Test","url":"docs/devDocs/IntegrationTest#add-your-own-integration-test","content":"Create a new file ending with \"IT\" under \"submarine/submarine-test/test-e2e/src/test/java/org/apache/submarine/integration/\".Your public class is recommended to extend AbstractSubmarineIT. The class AbstractSubmarineIT contains some commonly used functions.  WebElement pollingWait(final By locator, final long timeWait); // Find element on the website. void clickAndWait(final By locator); // Click element and wait for 1 second. void sleep(long millis, boolean logOutput); // Let system sleep a period of time. Copy There are also some commonly used functions except in AbstractSubmarineIT.java.  // In WebDriverManager.java: public static WebDriver getWebDriver(); // This return a firefox webdriver which has been set to your workbench website. Copy Add JUnit annotation before your testing function, e.g., @Beforeclass, @Test, and @AfterClass. You can refer to loginIT.java.Use command mentioned above to compile and run to test whether it works as your anticipation. "},{"title":"Development Guide","type":0,"sectionRef":"#","url":"docs/devDocs/Development","content":"","keywords":""},{"title":"Overview","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#overview","content":"From Getting Started/Submarine Local Deployment, you already know that Submarine is installed and uninstalled by Helm. As you can see by kubectl get pods, there are six major components in Submarine, including notebook-controller, pytorch-operator, submarine-database, submarine-server, submarine-traefik and tf-job-operator. They are launched as pods in kubernetes from the corresponding docker images. Some of the components are borrowed from other projects (kubeflow, traefik), including notebook-controller, pytorch-operator, submarine-traefik and tf-job-operator. The rest of them are built by ourselves, including submarine-database and submarine-server. The purpose of the components are as the following: tf-job-operator: manage the operation of tensorflow jobs pytorch-operator: manage the operation of pytorch jobs notebook-controller: manage the operation of notebook instances submarine-traefik: manage the ingress service submarine-database: store metadata in mysql database submarine-server: handle api request, submit job to container orchestration, and connect with database. In this document, we only focus on the last two components. You can learn how to develop server, database, and workbench here. "},{"title":"Video","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#video","content":"From This Video, you will know how to deal with the configuration of Submarine and be able to contribute to it via Github. "},{"title":"Develop server","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#develop-server","content":""},{"title":"Prerequisites","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#prerequisites","content":"JDK 1.8Maven 3.3 or later ( 3.6.2 is known to fail, see SUBMARINE-273 )Docker "},{"title":"Setting up checkstyle in IDE","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#setting-up-checkstyle-in-ide","content":"Checkstyle plugin may help to detect violations directly from the IDE. Install Checkstyle+IDEA plugin from Preference -> PluginsOpen Preference -> Tools -> Checkstyle -> Set Checkstyle version: Checkstyle version: 8.0 Add (+) a new Configuration File Description: SubmarineUse a local checkstyle ${SUBMARINE_HOME}/dev-support/maven-config/checkstyle.xml Open the Checkstyle Tool Window, select the Submarine rule and execute the check "},{"title":"Testing","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#testing","content":"Unit Test For each class, there is a corresponding testClass. For example, SubmarineServerTest is used for testing SubmarineServer. Whenever you add a funtion in classes, you must write a unit test to test it. Integration Test See IntegrationTest.md "},{"title":"Build from source","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#build-from-source","content":"Before building We assume the developer use minikube as a local kubernetes cluster.Make sure you have installed the submarine helm-chart in the cluster. Package the Submarine server into a new jar file mvn package -DskipTests Copy Build the new server docker image in minikube # switch to minikube docker daemon to build image directly in minikube eval $(minikube docker-env) # run docker build ./dev-support/docker-images/submarine/build.sh # exit minikube docker daemon eval $(minikube docker-env -u) Copy Update server pod helm upgrade --set submarine.server.dev=true submarine ./helm-charts/submarine Copy Set submarine.server.dev to true, enabling the server pod to be launched with the new docker image. "},{"title":"Develop workbench","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#develop-workbench","content":"Deploy the Submarine Follow Getting Started/Submarine Local Deployment, and make sure you can connect to http://localhost:32080 in the browser. Install the dependencies cd submarine-workbench/workbench-web npm install Copy Run the workbench based on proxy server npm run start Copy The request sent to http://localhost:4200 will be redirected to http://localhost:32080.Open http://localhost:4200 in browser to see the real-time change of workbench. "},{"title":"Develop database","type":1,"pageTitle":"Development Guide","url":"docs/devDocs/Development#develop-database","content":"Build the docker image # switch to minikube docker daemon to build image directly in minikube eval $(minikube docker-env) # run docker build ./dev-support/docker-images/database/build.sh # exit minikube docker daemon eval $(minikube docker-env -u) Copy Deploy new pods in the cluster helm upgrade --set submarine.database.dev=true submarine ./helm-charts/submarine Copy "},{"title":"Download Apache Submarine","type":0,"sectionRef":"#","url":"docs/download","content":"","keywords":""},{"title":"Verify the integrity of the files","type":1,"pageTitle":"Download Apache Submarine","url":"docs/download#verify-the-integrity-of-the-files","content":"It is essential that you verify the integrity of the downloaded files using the PGP or MD5 signatures. This signature should be matched against the KEYS file. gpg --import KEYS gpg --verify submarine-dist-X.Y.Z-src.tar.gz.asc Copy "},{"title":"Old releases","type":1,"pageTitle":"Download Apache Submarine","url":"docs/download#old-releases","content":"Apache Submarine 0.4.0 released on Jul 05, 2020 (release notes) (git tag) Binary package with submarine:submarine-dist-0.4.0-hadoop-2.9.tar.gz (550 MB,checksum,signature)Source:submarine-dist-0.4.0-src.tar.gz (6 MB,checksum,signature)Docker images:mini-submarine (guide) Apache Submarine 0.3.0 released on Feb 01, 2020 (release notes) (git tag) Binary package with submarine:submarine-dist-0.3.0-hadoop-2.9.tar.gz (550 MB,checksum,signature)Source:submarine-dist-0.3.0-src.tar.gz (6 MB,checksum,signature)Docker images:mini-submarine (guide) Apache Submarine 0.2.0 released on Jul 2, 2019 Binary package with submarine:hadoop-submarine-0.2.0.tar.gz (111 MB,checksum,signature,Announcement) Source:hadoop-submarine-0.2.0-src.tar.gz (1.4 MB,checksum,signature) Apache Submarine 0.1.0 released on Jan 16, 2019 Binary package with submarine:submarine-0.2.0-bin-all.tgz (97 MB,checksum,signature,Announcement) Source:submarine-hadoop-3.2.0-src.tar.gz (1.1 MB,checksum,signature) "},{"title":"Project Architecture","type":0,"sectionRef":"#","url":"docs/devDocs/README","content":"","keywords":""},{"title":"1. Introduction","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#1-introduction","content":"This document mainly describes the structure of each module of the Submarine project, the development and test description of each module. "},{"title":"2. Submarine Project Structure","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#2-submarine-project-structure","content":""},{"title":"2.1. submarine-client","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#21-submarine-client","content":"Provide the CLI interface for submarine user. (Currently only support YARN service) "},{"title":"2.2. submarine-cloud","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#22-submarine-cloud","content":"Define submarine operator. (Work in progress) "},{"title":"2.3. submarine-commons","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#23-submarine-commons","content":"Define utility function used in multiple packages, mainly related to hadoop. "},{"title":"2.4. submarine-dist","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#24-submarine-dist","content":"Store the pre-release files. "},{"title":"2.5. submarine-sdk","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#25-submarine-sdk","content":"Provide Python SDK for submarine user. "},{"title":"2.6. submarine-security","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#26-submarine-security","content":"Provide authorization for Apache Spark to talking to Ranger Admin. "},{"title":"2.7. submarine-server","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#27-submarine-server","content":"Include core server, restful api, and k8s/yarn submitter. "},{"title":"2.8. submarine-test","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#28-submarine-test","content":"Provide end-to-end and k8s test for submarine. "},{"title":"2.9. submarine-workbench","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#29-submarine-workbench","content":"workbench-server: is a Jetty-based web server service. Workbench-server provides RESTful interface and Websocket interface. The RESTful interface provides workbench-web with management capabilities for databases such as project, department, user, and role.workbench-web: is a web front-end service based on Angular.js framework. With workbench-web users can manage Submarine project, department, user, role through browser. You can also use the notebook to develop machine learning algorithms, model release and other lifecycle management. "},{"title":"2.10 dev-support","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#210-dev-support","content":"mini-submarine: by using the docker image provided by Submarine, you can experience all the functions of Submarine in a single docker environment, while mini-submarine also provides developers with a development and testing environment, Avoid the hassle of installing and deploying the runtime environment.submarine-installer: submarine-installer is our submarine runtime environment installation tool for yarn-3.1+ and above.By using submarine-installer, it is easy to install and deploy system services such asdocker, nvidia-docker, nvidia driver, ETCD, Calico network etc. required by yarn-3.1+. "},{"title":"3. Submarine Workbench Development Guide","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#3-submarine-workbench-development-guide","content":"Submarine workbench consists of three modules: workbench-server, workbench-web and database. First, you need to clone the entire Submarine project: git clone https://github.com/apache/submarine.git Copy "},{"title":"3.1 Database","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#31-database","content":"Submarine selects mysql as the workbench database, and stores the table structure and information of all the data in workbench in mysql. Please browse the project's Submarine Database Guide documentation and follow the instructions to install a mysql database via docker in your development and test environment. "},{"title":"3.2 Workbench-web","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#32-workbench-web","content":"Install dependencies You only need to execute the following command when you run workbench-web for the first time, so you can follow the depend. environment of node.js in the workbench-web directory. cd submarine-workbench/workbench-web yarn install Copy The node.js dependency library will be installed in the node_modules directory. node_modules does not need to be uploaded to the git repository. We have excluded it in the submarine/.gitignore file. You can clean this by manually deleting the directory or executing mvn clean. table of Contents. Compiles and hot-reloads for development yarn run build Copy By executing the above command, workbench-web will publish the web page to the workbench-web/dist directory. Later we will also add the feature of workbench-web to a WAR package, which only requires a workbench-web.war file. Package can release submarine workbench. Lints and fixes files When you write the Angular, js file in workbench-web through IDEA, because IDEA can't format these files well, you need to execute the following command to format the angular and js files to avoid some warnings during the yarn build. yarn run lint Copy In fact, you must execute this command when you develop and submit any workbench-web feature. Otherwise, chances are that you won't be able to check the code specification we set up in Travis. "},{"title":"3.3 Workbench-server","type":1,"pageTitle":"Project Architecture","url":"docs/devDocs/README#33-workbench-server","content":"Workbench-server has a built-in jetty server service, so you don't need to install any web container service. You can start submarine workbench by launching workbench-server directly in IDEA. Run / Debug : In the IDEA, add a Run/Debug Configuration, Main Class: select org.apache.submarine.server.SubmarineServer, Use classpath of module: select submarine-server-core.  So you can pass http://127.0.0.1:8080 debugging or running to submarine-workbench. It should be noted: Because workbench-web/dist is the webapp default directory of the workbench-server Jetty Server, the workbench-server will automatically load the workbench-web/dist directory after startup. The workbench-web/dist used by workbench-server is used as the webapp directory, which is configured via workbench-site.xml, but we do not recommend you to modify it. The submarine-workbench, IP and 8080 ports that are accessible locally through the port 8080 of 127.0.0.1 are configured via workbench-site.xml, but we do not recommend you to modify it. When you modify the angular or js of workbench-web, you need to execute the yarn run build command in the workbench-web directory, and let your modified code update to the dist directory, so that you can see the effect of your code modification in the workbench. "},{"title":"WriteDockerfileKaldi","type":0,"sectionRef":"#","url":"docs/ecosystem/kaldi/WriteDockerfileKaldi","content":"","keywords":""},{"title":"Creating Docker Images for Running Kaldi on YARN","type":1,"pageTitle":"WriteDockerfileKaldi","url":"docs/ecosystem/kaldi/WriteDockerfileKaldi#creating-docker-images-for-running-kaldi-on-yarn","content":""},{"title":"How to create docker images to run Kaldi on YARN","type":1,"pageTitle":"WriteDockerfileKaldi","url":"docs/ecosystem/kaldi/WriteDockerfileKaldi#how-to-create-docker-images-to-run-kaldi-on-yarn","content":"Dockerfile to run Kaldi on YARN need two part: Base libraries which Kaldi depends on 1) OS base image, for example nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 2) Kaldi depended libraries and packages. For example python, g++, make. For GPU support, need cuda, cudnn, etc. 3) Kaldi compile. Libraries to access HDFS 1) JDK 2) Hadoop Here's an example of a base image (w/o GPU support) to install Kaldi: FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 RUN apt-get clean && \\ apt-get update && \\ apt-get install -y --no-install-recommends \\ sudo \\ openjdk-8-jdk \\ iputils-ping \\ g++ \\ make \\ automake \\ autoconf \\ bzip2 \\ unzip \\ wget \\ sox \\ libtool \\ git \\ subversion \\ python2.7 \\ python3 \\ zlib1g-dev \\ ca-certificates \\ patch \\ ffmpeg \\ vim && \\ rm -rf /var/lib/apt/lists/* && \\ ln -s /usr/bin/python2.7 /usr/bin/python RUN git clone --depth 1 https://github.com/kaldi-asr/kaldi.git /opt/kaldi && \\ cd /opt/kaldi && \\ cd /opt/kaldi/tools && \\ ./extras/install_mkl.sh && \\ make -j $(nproc) && \\ cd /opt/kaldi/src && \\ ./configure --shared --use-cuda && \\ make depend -j $(nproc) && \\ make -j $(nproc) Copy On top of above image, add files, install packages to access HDFS RUN apt-get update && apt-get install -y openjdk-8-jdk wget # Install hadoop ENV HADOOP_VERSION=\"3.2.1\" ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && \\ tar zxf hadoop-${HADOOP_VERSION}.tar.gz && \\ ln -s hadoop-${HADOOP_VERSION} hadoop-current && \\ rm hadoop-${HADOOP_VERSION}.tar.gz Copy Build and push to your own docker registry: Use docker build ... and docker push ... to finish this step. "},{"title":"Use examples to build your own Kaldi docker images","type":1,"pageTitle":"WriteDockerfileKaldi","url":"docs/ecosystem/kaldi/WriteDockerfileKaldi#use-examples-to-build-your-own-kaldi-docker-images","content":"We provided following examples for you to build kaldi docker images. For latest Kaldi *base/ubuntu-18.04/Dockerfile.gpu.kaldi_latest: Latest Kaldi that supports GPU, which is prebuilt to CUDA10, with models. "},{"title":"Build Docker images","type":1,"pageTitle":"WriteDockerfileKaldi","url":"docs/ecosystem/kaldi/WriteDockerfileKaldi#build-docker-images","content":"Manually build Docker image:# Under docker/ directory,The CLUSTER_NAME can be modified in build-all.sh to have installation permissions, run build-all.sh to build Docker images. It will build following images: kaldi-latest-gpu-base:0.0.1 for base Docker image which includes Hadoop, Kaldi, GPU base libraries, which includes thchs30 model. Use prebuilt images# (No liability) You can also use prebuilt images for convenience in the docker hub: hadoopsubmarine/kaldi-latest-gpu-base:0.0.1 "},{"title":"Notebook Tutorial","type":0,"sectionRef":"#","url":"docs/gettingStarted/notebook","content":"","keywords":""},{"title":"Working with notebooks","type":1,"pageTitle":"Notebook Tutorial","url":"docs/gettingStarted/notebook#working-with-notebooks","content":"We recommend using Web UI to manage notebooks. "},{"title":"Notebooks Web UI","type":1,"pageTitle":"Notebook Tutorial","url":"docs/gettingStarted/notebook#notebooks-web-ui","content":"Notebooks can be started from the Web UI. You can click the “Notebook” tab in the \\ left-hand panel to manage your notebooks.  To create a new notebook server, click “New Notebook”. You should see a form for entering \\ details of your new notebook server. Notebook Name : Name of the notebook server. It should follow the rules below. Contain at most 63 characters.Contain only lowercase alphanumeric characters or '-'.Start with an alphabetic character.End with an alphanumeric character. Environment : It defines a set of libraries and docker image.CPU and MemoryGPU (optional)EnvVar (optional) : Injects environment variables into the notebook. If you’re not sure which environment you need, please choose the environment “notebook-env” \\ for the new notebook.  You should see your new notebook server. Click the name of your notebook server to connect to it.  "},{"title":"Experiment with your notebook","type":1,"pageTitle":"Notebook Tutorial","url":"docs/gettingStarted/notebook#experiment-with-your-notebook","content":"The environment “notebook-env” includes Submarine Python SDK which can talk to Submarine Server to \\ create experiments, as the example below: from __future__ import print_function import submarine from submarine.experiment.models.environment_spec import EnvironmentSpec from submarine.experiment.models.experiment_spec import ExperimentSpec from submarine.experiment.models.experiment_task_spec import ExperimentTaskSpec from submarine.experiment.models.experiment_meta import ExperimentMeta from submarine.experiment.models.code_spec import CodeSpec # Create Submarine Client submarine_client = submarine.ExperimentClient() # Define TensorFlow experiment spec environment = EnvironmentSpec(image='apache/submarine:tf-dist-mnist-test-1.0') experiment_meta = ExperimentMeta(name='mnist-dist', namespace='default', framework='Tensorflow', cmd='python /var/tf_dist_mnist/dist_mnist.py --train_steps=100', env_vars={'ENV1': 'ENV1'}) worker_spec = ExperimentTaskSpec(resources='cpu=1,memory=1024M', replicas=1) ps_spec = ExperimentTaskSpec(resources='cpu=1,memory=1024M', replicas=1) code_spec = CodeSpec(sync_mode='git', url='https://github.com/apache/submarine.git') experiment_spec = ExperimentSpec(meta=experiment_meta, environment=environment, code=code_spec, spec={'Ps' : ps_spec,'Worker': worker_spec}) # Create experiment experiment = submarine_client.create_experiment(experiment_spec=experiment_spec) Copy You can create a new notebook, paste the above code and run it. Or, you can find the notebook submarine_experiment_sdk.ipynb inside the launched notebook session. You can open it, try it out. After experiment submitted to Submarine server, you can find the experiment jobs on the UI. "},{"title":"RunningDistributedThchs30KaldiJobs","type":0,"sectionRef":"#","url":"docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs","content":"","keywords":""},{"title":"Prepare data for training","type":1,"pageTitle":"RunningDistributedThchs30KaldiJobs","url":"docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs#prepare-data-for-training","content":"Thchs30 is a common benchmark in machine learning for speech data and transcripts. Below example is based on Thchs30 dataset. 1) download gz file: THCHS30_PATH=/data/hdfs1/nfs/aisearch/kaldi/thchs30 mkdir $THCHS30_PATH/data && cd $THCHS30_PATH/data wget http://www.openslr.org/resources/18/data_thchs30.tgz wget http://www.openslr.org/resources/18/test-noise.tgz wget http://www.openslr.org/resources/18/resource.tgz Copy 2) Checkout https://github.com/apache/submarine.git: git clone https://github.com/apache/submarine.git Copy 3) Go to submarine/docker/ecosystem/ cp -r ./kaldi/sge $THCHS30_PATH/sge Copy 4) optional，Modify /opt/kaldi/egs/thchs30/s5/cmd.sh in the Container,This queue is used by default export train_cmd=\"queue.pl -q all.q\" Copy Warning: Please note that YARN service doesn't allow multiple services with the same name, so please run following command yarn application -destroy <service-name> Copy to delete services if you want to reuse the same service name. "},{"title":"Prepare Docker images","type":1,"pageTitle":"RunningDistributedThchs30KaldiJobs","url":"docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs#prepare-docker-images","content":"Refer to Write Dockerfile to build a Docker image or use prebuilt one: hadoopsubmarine/kaldi-latest-gpu-base:0.0.1 "},{"title":"Run Kaldi jobs","type":1,"pageTitle":"RunningDistributedThchs30KaldiJobs","url":"docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs#run-kaldi-jobs","content":""},{"title":"Run distributed training","type":1,"pageTitle":"RunningDistributedThchs30KaldiJobs","url":"docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs#run-distributed-training","content":"# Change the variables according to your needs SUBMARINE_VERSION=3.3.0-SNAPSHOT WORKER_NUM=2 SGE_CFG_PATH=/cfg THCHS30_PATH=/data/hdfs1/nfs/aisearch/kaldi/thchs30 DOCKER_HADOOP_HDFS_HOME=/app/${SUBMARINE_VERSION} # Dependent on registrydns, you must fill in < your RegistryDNSIP> in resolv.conf yarn jar /usr/local/matrix/share/hadoop/yarn/${SUBMARINE_VERSION}.jar \\ job run --name kaldi-thchs30-distributed \\ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \\ --env DOCKER_HADOOP_HDFS_HOME=$DOCKER_HADOOP_HDFS_HOME \\ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \\ --env PYTHONUNBUFFERED=\"0\" \\ --env TZ=\"Asia/Shanghai\" \\ --env YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=${THCHS30_PATH}/sge/resolv.conf:/etc/resolv.conf,\\ ${THCHS30_PATH}/sge/passwd:/etc/passwd:rw,\\ ${THCHS30_PATH}/sge/group:/etc/group:rw,\\ ${THCHS30_PATH}/sge:$SGE_CFG_PATH,\\ ${THCHS30_PATH}/data:/opt/kaldi/egs/thchs30,\\ ${THCHS30_PATH}/mul/s5:/opt/kaldi/egs/mul-thchs30/s5 \\ --input_path /opt/kaldi/egs/thchs30/data \\ --docker_image hadoopsubmarine/kaldi-latest-gpu-base:0.0.1 \\ --num_workers $WORKER_NUM \\ --worker_resources memory=64G,vcores=32,gpu=1 \\ --worker_launch_cmd \"sudo mkdir -p /opt/kaldi/egs/mul-thchs30/s5 && \\ sudo cp /opt/kaldi/egs/thchs30/s5/* /opt/kaldi/egs/mul-thchs30/s5 -r && \\ cluster_user=`whoami` domain_suffix=\"ml.com\" && \\ cd /cfg && bash sge_run.sh $WORKER_NUM $SGE_CFG_PATH && \\ if [ $(echo $HOST_NAME |grep \"^master-\") ] then sleep 2m && cd /opt/kaldi/egs/mul-thchs30/s5 && ./run.sh fi\" \\ --verbose Copy Explanations: >1 num_workers indicates it is a distributed training.Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.We don't need parameter server here For the meaning of the individual parameters, see the QuickStart page! Outputs of distributed training Sample output of master: ... Reading package lists... Building dependency tree... Reading state information... The following additional packages will be installed: bsd-mailx cpio gridengine-common ifupdown iproute2 isc-dhcp-client isc-dhcp-common libatm1 libdns-export162 libisc-export160 liblockfile-bin liblockfile1 libmnl0 libxmuu1 libxtables11 ncurses-term netbase openssh-client openssh-server openssh-sftp-server postfix python3-chardet python3-pkg-resources python3-requests python3-six python3-urllib3 ssh-import-id ssl-cert tcsh xauth Suggested packages: libarchive1 gridengine-qmon ppp rdnssd iproute2-doc resolvconf avahi-autoipd isc-dhcp-client-ddns apparmor ssh-askpass libpam-ssh keychain monkeysphere rssh molly-guard ufw procmail postfix-mysql postfix-pgsql postfix-ldap postfix-pcre sasl2-bin libsasl2-modules dovecot-common postfix-cdb postfix-doc python3-setuptools python3-ndg-httpsclient python3-openssl python3-pyasn1 openssl-blacklist The following NEW packages will be installed: bsd-mailx cpio gridengine-client gridengine-common gridengine-exec gridengine-master ifupdown iproute2 isc-dhcp-client isc-dhcp-common libatm1 libdns-export162 libisc-export160 liblockfile-bin liblockfile1 libmnl0 libxmuu1 libxtables11 ncurses-term netbase openssh-client openssh-server openssh-sftp-server postfix python3-chardet python3-pkg-resources python3-requests python3-six python3-urllib3 ssh-import-id ssl-cert tcsh xauth 0 upgraded, 33 newly installed, 0 to remove and 30 not upgraded. Need to get 12.1 MB of archives. After this operation, 65.8 MB of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu xenial/main amd64 libatm1 amd64 1:2.5.1-1.5 [24.2 kB] Get:2 http://archive.ubuntu.com/ubuntu xenial/main amd64 libmnl0 amd64 1.0.3-5 [12.0 kB] Get:3 http://archive.ubuntu.com/ubuntu xenial/main amd64 liblockfile-bin amd64 1.09-6ubuntu1 [10.8 kB] Get:4 http://archive.ubuntu.com/ubuntu xenial/main amd64 liblockfile1 amd64 1.09-6ubuntu1 [8056 B] Get:5 http://archive.ubuntu.com/ubuntu xenial/main amd64 cpio amd64 2.11+dfsg-5ubuntu1 [74.8 kB] Get:6 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 iproute2 amd64 4.3.0-1ubuntu3.16.04.5 [523 kB] Get:7 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 ifupdown amd64 0.8.10ubuntu1.4 [54.9 kB] Get:8 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libisc-export160 amd64 1:9.10.3.dfsg.P4-8ubuntu1.15 [153 kB] Get:9 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libdns-export162 amd64 1:9.10.3.dfsg.P4-8ubuntu1.15 [665 kB] Get:10 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 isc-dhcp-client amd64 4.3.3-5ubuntu12.10 [224 kB] Get:11 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 isc-dhcp-common amd64 4.3.3-5ubuntu12.10 [105 kB] Get:12 http://archive.ubuntu.com/ubuntu xenial/main amd64 libxtables11 amd64 1.6.0-2ubuntu3 [27.2 kB] Get:13 http://archive.ubuntu.com/ubuntu xenial/main amd64 netbase all 5.3 [12.9 kB] Get:14 http://archive.ubuntu.com/ubuntu xenial/main amd64 libxmuu1 amd64 2:1.1.2-2 [9674 B] Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 openssh-client amd64 1:7.2p2-4ubuntu2.8 [590 kB] Get:16 http://archive.ubuntu.com/ubuntu xenial/main amd64 xauth amd64 1:1.0.9-1ubuntu2 [22.7 kB] Get:17 http://archive.ubuntu.com/ubuntu xenial/main amd64 ssl-cert all 1.0.37 [16.9 kB] Get:18 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 postfix amd64 3.1.0-3ubuntu0.3 [1152 kB] Get:19 http://archive.ubuntu.com/ubuntu xenial/main amd64 bsd-mailx amd64 8.1.2-0.20160123cvs-2 [63.7 kB] Get:20 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-common all 6.2u5-7.4 [156 kB] Get:21 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-client amd64 6.2u5-7.4 [3394 kB] Get:22 http://archive.ubuntu.com/ubuntu xenial/universe amd64 tcsh amd64 6.18.01-5 [410 kB] Get:23 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-exec amd64 6.2u5-7.4 [990 kB] Get:24 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-master amd64 6.2u5-7.4 [2429 kB] Get:25 http://archive.ubuntu.com/ubuntu xenial/main amd64 ncurses-term all 6.0+20160213-1ubuntu1 [249 kB] Get:26 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 openssh-sftp-server amd64 1:7.2p2-4ubuntu2.8 [38.9 kB] Get:27 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 openssh-server amd64 1:7.2p2-4ubuntu2.8 [335 kB] Get:28 http://archive.ubuntu.com/ubuntu xenial/main amd64 python3-pkg-resources all 20.7.0-1 [79.0 kB] Get:29 http://archive.ubuntu.com/ubuntu xenial/main amd64 python3-chardet all 2.3.0-2 [96.2 kB] Get:30 http://archive.ubuntu.com/ubuntu xenial/main amd64 python3-six all 1.10.0-3 [11.0 kB] Get:31 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 python3-urllib3 all 1.13.1-2ubuntu0.16.04.3 [58.5 kB] Get:32 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 python3-requests all 2.9.1-3ubuntu0.1 [55.8 kB] Get:33 http://archive.ubuntu.com/ubuntu xenial/main amd64 ssh-import-id all 5.5-0ubuntu1 [10.2 kB] Fetched 12.1 MB in 0s (15.0 MB/s) Selecting previously unselected package libatm1:amd64. (Reading database ... (Reading database ... 5% (Reading database ... 10% (Reading database ... 15% (Reading database ... 20% (Reading database ... 25% (Reading database ... 30% (Reading database ... 35% (Reading database ... 40% (Reading database ... 45% (Reading database ... 50% (Reading database ... 55% (Reading database ... 60% (Reading database ... 65% (Reading database ... 70% (Reading database ... 75% (Reading database ... 80% (Reading database ... 85% (Reading database ... 90% (Reading database ... 95% (Reading database ... 100% (Reading database ... 21398 files and directories currently installed.) Preparing to unpack .../libatm1_1%3a2.5.1-1.5_amd64.deb ... Unpacking libatm1:amd64 (1:2.5.1-1.5) ... Selecting previously unselected package libmnl0:amd64. Preparing to unpack .../libmnl0_1.0.3-5_amd64.deb ... Unpacking libmnl0:amd64 (1.0.3-5) ... Selecting previously unselected package liblockfile-bin. Preparing to unpack .../liblockfile-bin_1.09-6ubuntu1_amd64.deb ... Unpacking liblockfile-bin (1.09-6ubuntu1) ... Selecting previously unselected package liblockfile1:amd64. Preparing to unpack .../liblockfile1_1.09-6ubuntu1_amd64.deb ... Unpacking liblockfile1:amd64 (1.09-6ubuntu1) ... Selecting previously unselected package cpio. Preparing to unpack .../cpio_2.11+dfsg-5ubuntu1_amd64.deb ... Unpacking cpio (2.11+dfsg-5ubuntu1) ... Selecting previously unselected package iproute2. Preparing to unpack .../iproute2_4.3.0-1ubuntu3.16.04.5_amd64.deb ... Unpacking iproute2 (4.3.0-1ubuntu3.16.04.5) ... Selecting previously unselected package ifupdown. Preparing to unpack .../ifupdown_0.8.10ubuntu1.4_amd64.deb ... Unpacking ifupdown (0.8.10ubuntu1.4) ... Selecting previously unselected package libisc-export160. Preparing to unpack .../libisc-export160_1%3a9.10.3.dfsg.P4-8ubuntu1.15_amd64.deb ... Unpacking libisc-export160 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Selecting previously unselected package libdns-export162. Preparing to unpack .../libdns-export162_1%3a9.10.3.dfsg.P4-8ubuntu1.15_amd64.deb ... Unpacking libdns-export162 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Selecting previously unselected package isc-dhcp-client. Preparing to unpack .../isc-dhcp-client_4.3.3-5ubuntu12.10_amd64.deb ... Unpacking isc-dhcp-client (4.3.3-5ubuntu12.10) ... Selecting previously unselected package isc-dhcp-common. Preparing to unpack .../isc-dhcp-common_4.3.3-5ubuntu12.10_amd64.deb ... Unpacking isc-dhcp-common (4.3.3-5ubuntu12.10) ... Selecting previously unselected package libxtables11:amd64. Preparing to unpack .../libxtables11_1.6.0-2ubuntu3_amd64.deb ... Unpacking libxtables11:amd64 (1.6.0-2ubuntu3) ... Selecting previously unselected package netbase. Preparing to unpack .../archives/netbase_5.3_all.deb ... Unpacking netbase (5.3) ... Selecting previously unselected package libxmuu1:amd64. Preparing to unpack .../libxmuu1_2%3a1.1.2-2_amd64.deb ... Unpacking libxmuu1:amd64 (2:1.1.2-2) ... Selecting previously unselected package openssh-client. Preparing to unpack .../openssh-client_1%3a7.2p2-4ubuntu2.8_amd64.deb ... Unpacking openssh-client (1:7.2p2-4ubuntu2.8) ... Selecting previously unselected package xauth. Preparing to unpack .../xauth_1%3a1.0.9-1ubuntu2_amd64.deb ... Unpacking xauth (1:1.0.9-1ubuntu2) ... Selecting previously unselected package ssl-cert. Preparing to unpack .../ssl-cert_1.0.37_all.deb ... Unpacking ssl-cert (1.0.37) ... Selecting previously unselected package postfix. Preparing to unpack .../postfix_3.1.0-3ubuntu0.3_amd64.deb ... Unpacking postfix (3.1.0-3ubuntu0.3) ... Selecting previously unselected package bsd-mailx. Preparing to unpack .../bsd-mailx_8.1.2-0.20160123cvs-2_amd64.deb ... Unpacking bsd-mailx (8.1.2-0.20160123cvs-2) ... Selecting previously unselected package gridengine-common. Preparing to unpack .../gridengine-common_6.2u5-7.4_all.deb ... Unpacking gridengine-common (6.2u5-7.4) ... Selecting previously unselected package gridengine-client. Preparing to unpack .../gridengine-client_6.2u5-7.4_amd64.deb ... Unpacking gridengine-client (6.2u5-7.4) ... Selecting previously unselected package tcsh. Preparing to unpack .../tcsh_6.18.01-5_amd64.deb ... Unpacking tcsh (6.18.01-5) ... Selecting previously unselected package gridengine-exec. Preparing to unpack .../gridengine-exec_6.2u5-7.4_amd64.deb ... Unpacking gridengine-exec (6.2u5-7.4) ... Selecting previously unselected package gridengine-master. Preparing to unpack .../gridengine-master_6.2u5-7.4_amd64.deb ... Unpacking gridengine-master (6.2u5-7.4) ... Selecting previously unselected package ncurses-term. Preparing to unpack .../ncurses-term_6.0+20160213-1ubuntu1_all.deb ... Unpacking ncurses-term (6.0+20160213-1ubuntu1) ... Selecting previously unselected package openssh-sftp-server. Preparing to unpack .../openssh-sftp-server_1%3a7.2p2-4ubuntu2.8_amd64.deb ... Unpacking openssh-sftp-server (1:7.2p2-4ubuntu2.8) ... Selecting previously unselected package openssh-server. Preparing to unpack .../openssh-server_1%3a7.2p2-4ubuntu2.8_amd64.deb ... Unpacking openssh-server (1:7.2p2-4ubuntu2.8) ... Selecting previously unselected package python3-pkg-resources. Preparing to unpack .../python3-pkg-resources_20.7.0-1_all.deb ... Unpacking python3-pkg-resources (20.7.0-1) ... Selecting previously unselected package python3-chardet. Preparing to unpack .../python3-chardet_2.3.0-2_all.deb ... Unpacking python3-chardet (2.3.0-2) ... Selecting previously unselected package python3-six. Preparing to unpack .../python3-six_1.10.0-3_all.deb ... Unpacking python3-six (1.10.0-3) ... Selecting previously unselected package python3-urllib3. Preparing to unpack .../python3-urllib3_1.13.1-2ubuntu0.16.04.3_all.deb ... Unpacking python3-urllib3 (1.13.1-2ubuntu0.16.04.3) ... Selecting previously unselected package python3-requests. Preparing to unpack .../python3-requests_2.9.1-3ubuntu0.1_all.deb ... Unpacking python3-requests (2.9.1-3ubuntu0.1) ... Selecting previously unselected package ssh-import-id. Preparing to unpack .../ssh-import-id_5.5-0ubuntu1_all.deb ... Unpacking ssh-import-id (5.5-0ubuntu1) ... Processing triggers for systemd (229-4ubuntu21.22) ... Processing triggers for libc-bin (2.23-0ubuntu11) ... Setting up libatm1:amd64 (1:2.5.1-1.5) ... Setting up libmnl0:amd64 (1.0.3-5) ... Setting up liblockfile-bin (1.09-6ubuntu1) ... Setting up liblockfile1:amd64 (1.09-6ubuntu1) ... Setting up cpio (2.11+dfsg-5ubuntu1) ... update-alternatives: using /bin/mt-gnu to provide /bin/mt (mt) in auto mode Setting up iproute2 (4.3.0-1ubuntu3.16.04.5) ... Setting up ifupdown (0.8.10ubuntu1.4) ... Creating /etc/network/interfaces. Setting up libisc-export160 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Setting up libdns-export162 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Setting up isc-dhcp-client (4.3.3-5ubuntu12.10) ... Setting up isc-dhcp-common (4.3.3-5ubuntu12.10) ... Setting up libxtables11:amd64 (1.6.0-2ubuntu3) ... Setting up netbase (5.3) ... Setting up libxmuu1:amd64 (2:1.1.2-2) ... Setting up openssh-client (1:7.2p2-4ubuntu2.8) ... Setting up xauth (1:1.0.9-1ubuntu2) ... Setting up ssl-cert (1.0.37) ... Setting up postfix (3.1.0-3ubuntu0.3) ... Creating /etc/postfix/dynamicmaps.cf setting myhostname: master-0.XXX setting alias maps setting alias database changing /etc/mailname to master-0.XXX setting myorigin setting destinations: $myhostname, master-0.XXX, localhost.XXX, , localhost setting relayhost: setting mynetworks: 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128 setting mailbox_size_limit: 0 setting recipient_delimiter: + setting inet_interfaces: all setting inet_protocols: all /etc/aliases does not exist, creating it. WARNING: /etc/aliases exists, but does not have a root alias. Postfix is now set up with a default configuration. If you need to make changes, edit /etc/postfix/main.cf (and others) as needed. To view Postfix configuration values, see postconf(1). After modifying main.cf, be sure to run '/etc/init.d/postfix reload'. Running newaliases invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of restart. Setting up bsd-mailx (8.1.2-0.20160123cvs-2) ... update-alternatives: using /usr/bin/bsd-mailx to provide /usr/bin/mailx (mailx) in auto mode Setting up gridengine-common (6.2u5-7.4) ... Creating config file /etc/default/gridengine with new version Setting up gridengine-client (6.2u5-7.4) ... Setting up tcsh (6.18.01-5) ... update-alternatives: using /bin/tcsh to provide /bin/csh (csh) in auto mode Setting up gridengine-exec (6.2u5-7.4) ... invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of start. Setting up gridengine-master (6.2u5-7.4) ... su: Authentication failure (Ignored) Initializing cluster with the following parameters: => SGE_ROOT: /var/lib/gridengine => SGE_CELL: default => Spool directory: /var/spool/gridengine/spooldb => Initial manager user: sgeadmin Initializing spool (/var/spool/gridengine/spooldb) Initializing global configuration based on /usr/share/gridengine/default-configuration Initializing complexes based on /usr/share/gridengine/centry Initializing usersets based on /usr/share/gridengine/usersets Adding user sgeadmin as a manager Cluster creation complete invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of start. Setting up ncurses-term (6.0+20160213-1ubuntu1) ... Setting up openssh-sftp-server (1:7.2p2-4ubuntu2.8) ... Setting up openssh-server (1:7.2p2-4ubuntu2.8) ... Creating SSH2 RSA key; this may take some time ... 2048 SHA256:hfQpES1aS4cjF8AOCIParZR6342vdwutoyITru0wtuE root@master-0.XXX (RSA) Creating SSH2 DSA key; this may take some time ... 1024 SHA256:gOsPMVgwXBHJzixN/gtJAG+hVCHqw8t7Fhy4nsx8od0 root@master-0.XXX (DSA) Creating SSH2 ECDSA key; this may take some time ... 256 SHA256:3D5SNniUb4z+/BuqXheFgG+DfjsxXqTT/zwWAqdX4jM root@master-0.XXX (ECDSA) Creating SSH2 ED25519 key; this may take some time ... 256 SHA256:SwyeV9iSqOW4TKLi4Wvc0zD8lWtupHCJpDu8oWBwbfU root@master-0.XXX (ED25519) invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of start. Setting up python3-pkg-resources (20.7.0-1) ... Setting up python3-chardet (2.3.0-2) ... Setting up python3-six (1.10.0-3) ... Setting up python3-urllib3 (1.13.1-2ubuntu0.16.04.3) ... Setting up python3-requests (2.9.1-3ubuntu0.1) ... Setting up ssh-import-id (5.5-0ubuntu1) ... Processing triggers for libc-bin (2.23-0ubuntu11) ... Processing triggers for systemd (229-4ubuntu21.22) ... Reading package lists... Building dependency tree... Reading state information... 0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded. Copy cat $SGE_CFG_PATH/setcfg.log finish master add worker node worker-0.XXX Copy Sample output of worker: please wait Reading package lists... Building dependency tree... Reading state information... The following additional packages will be installed: bsd-mailx cpio gridengine-common ifupdown iproute2 isc-dhcp-client isc-dhcp-common libatm1 libdns-export162 libisc-export160 liblockfile-bin liblockfile1 libmnl0 libxmuu1 libxtables11 ncurses-term netbase openssh-client openssh-server openssh-sftp-server postfix python3-chardet python3-pkg-resources python3-requests python3-six python3-urllib3 ssh-import-id ssl-cert tcsh xauth Suggested packages: libarchive1 gridengine-qmon ppp rdnssd iproute2-doc resolvconf avahi-autoipd isc-dhcp-client-ddns apparmor ssh-askpass libpam-ssh keychain monkeysphere rssh molly-guard ufw procmail postfix-mysql postfix-pgsql postfix-ldap postfix-pcre sasl2-bin libsasl2-modules dovecot-common postfix-cdb postfix-doc python3-setuptools python3-ndg-httpsclient python3-openssl python3-pyasn1 openssl-blacklist The following NEW packages will be installed: bsd-mailx cpio gridengine-client gridengine-common gridengine-exec ifupdown iproute2 isc-dhcp-client isc-dhcp-common libatm1 libdns-export162 libisc-export160 liblockfile-bin liblockfile1 libmnl0 libxmuu1 libxtables11 ncurses-term netbase openssh-client openssh-server openssh-sftp-server postfix python3-chardet python3-pkg-resources python3-requests python3-six python3-urllib3 ssh-import-id ssl-cert tcsh xauth 0 upgraded, 32 newly installed, 0 to remove and 30 not upgraded. Need to get 9633 kB of archives. After this operation, 51.2 MB of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu xenial/main amd64 libatm1 amd64 1:2.5.1-1.5 [24.2 kB] Get:2 http://archive.ubuntu.com/ubuntu xenial/main amd64 libmnl0 amd64 1.0.3-5 [12.0 kB] Get:3 http://archive.ubuntu.com/ubuntu xenial/main amd64 liblockfile-bin amd64 1.09-6ubuntu1 [10.8 kB] Get:4 http://archive.ubuntu.com/ubuntu xenial/main amd64 liblockfile1 amd64 1.09-6ubuntu1 [8056 B] Get:5 http://archive.ubuntu.com/ubuntu xenial/main amd64 cpio amd64 2.11+dfsg-5ubuntu1 [74.8 kB] Get:6 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 iproute2 amd64 4.3.0-1ubuntu3.16.04.5 [523 kB] Get:7 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 ifupdown amd64 0.8.10ubuntu1.4 [54.9 kB] Get:8 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libisc-export160 amd64 1:9.10.3.dfsg.P4-8ubuntu1.15 [153 kB] Get:9 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 libdns-export162 amd64 1:9.10.3.dfsg.P4-8ubuntu1.15 [665 kB] Get:10 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 isc-dhcp-client amd64 4.3.3-5ubuntu12.10 [224 kB] Get:11 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 isc-dhcp-common amd64 4.3.3-5ubuntu12.10 [105 kB] Get:12 http://archive.ubuntu.com/ubuntu xenial/main amd64 libxtables11 amd64 1.6.0-2ubuntu3 [27.2 kB] Get:13 http://archive.ubuntu.com/ubuntu xenial/main amd64 netbase all 5.3 [12.9 kB] Get:14 http://archive.ubuntu.com/ubuntu xenial/main amd64 libxmuu1 amd64 2:1.1.2-2 [9674 B] Get:15 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 openssh-client amd64 1:7.2p2-4ubuntu2.8 [590 kB] Get:16 http://archive.ubuntu.com/ubuntu xenial/main amd64 xauth amd64 1:1.0.9-1ubuntu2 [22.7 kB] Get:17 http://archive.ubuntu.com/ubuntu xenial/main amd64 ssl-cert all 1.0.37 [16.9 kB] Get:18 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 postfix amd64 3.1.0-3ubuntu0.3 [1152 kB] Get:19 http://archive.ubuntu.com/ubuntu xenial/main amd64 bsd-mailx amd64 8.1.2-0.20160123cvs-2 [63.7 kB] Get:20 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-common all 6.2u5-7.4 [156 kB] Get:21 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-client amd64 6.2u5-7.4 [3394 kB] Get:22 http://archive.ubuntu.com/ubuntu xenial/universe amd64 tcsh amd64 6.18.01-5 [410 kB] Get:23 http://archive.ubuntu.com/ubuntu xenial/universe amd64 gridengine-exec amd64 6.2u5-7.4 [990 kB] Get:24 http://archive.ubuntu.com/ubuntu xenial/main amd64 ncurses-term all 6.0+20160213-1ubuntu1 [249 kB] Get:25 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 openssh-sftp-server amd64 1:7.2p2-4ubuntu2.8 [38.9 kB] Get:26 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 openssh-server amd64 1:7.2p2-4ubuntu2.8 [335 kB] Get:27 http://archive.ubuntu.com/ubuntu xenial/main amd64 python3-pkg-resources all 20.7.0-1 [79.0 kB] Get:28 http://archive.ubuntu.com/ubuntu xenial/main amd64 python3-chardet all 2.3.0-2 [96.2 kB] Get:29 http://archive.ubuntu.com/ubuntu xenial/main amd64 python3-six all 1.10.0-3 [11.0 kB] Get:30 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 python3-urllib3 all 1.13.1-2ubuntu0.16.04.3 [58.5 kB] Get:31 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 python3-requests all 2.9.1-3ubuntu0.1 [55.8 kB] Get:32 http://archive.ubuntu.com/ubuntu xenial/main amd64 ssh-import-id all 5.5-0ubuntu1 [10.2 kB] Fetched 9633 kB in 2s (4496 kB/s) Selecting previously unselected package libatm1:amd64. (Reading database ... (Reading database ... 5% (Reading database ... 10% (Reading database ... 15% (Reading database ... 20% (Reading database ... 25% (Reading database ... 30% (Reading database ... 35% (Reading database ... 40% (Reading database ... 45% (Reading database ... 50% (Reading database ... 55% (Reading database ... 60% (Reading database ... 65% (Reading database ... 70% (Reading database ... 75% (Reading database ... 80% (Reading database ... 85% (Reading database ... 90% (Reading database ... 95% (Reading database ... 100% (Reading database ... 21398 files and directories currently installed.) Preparing to unpack .../libatm1_1%3a2.5.1-1.5_amd64.deb ... Unpacking libatm1:amd64 (1:2.5.1-1.5) ... Selecting previously unselected package libmnl0:amd64. Preparing to unpack .../libmnl0_1.0.3-5_amd64.deb ... Unpacking libmnl0:amd64 (1.0.3-5) ... Selecting previously unselected package liblockfile-bin. Preparing to unpack .../liblockfile-bin_1.09-6ubuntu1_amd64.deb ... Unpacking liblockfile-bin (1.09-6ubuntu1) ... Selecting previously unselected package liblockfile1:amd64. Preparing to unpack .../liblockfile1_1.09-6ubuntu1_amd64.deb ... Unpacking liblockfile1:amd64 (1.09-6ubuntu1) ... Selecting previously unselected package cpio. Preparing to unpack .../cpio_2.11+dfsg-5ubuntu1_amd64.deb ... Unpacking cpio (2.11+dfsg-5ubuntu1) ... Selecting previously unselected package iproute2. Preparing to unpack .../iproute2_4.3.0-1ubuntu3.16.04.5_amd64.deb ... Unpacking iproute2 (4.3.0-1ubuntu3.16.04.5) ... Selecting previously unselected package ifupdown. Preparing to unpack .../ifupdown_0.8.10ubuntu1.4_amd64.deb ... Unpacking ifupdown (0.8.10ubuntu1.4) ... Selecting previously unselected package libisc-export160. Preparing to unpack .../libisc-export160_1%3a9.10.3.dfsg.P4-8ubuntu1.15_amd64.deb ... Unpacking libisc-export160 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Selecting previously unselected package libdns-export162. Preparing to unpack .../libdns-export162_1%3a9.10.3.dfsg.P4-8ubuntu1.15_amd64.deb ... Unpacking libdns-export162 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Selecting previously unselected package isc-dhcp-client. Preparing to unpack .../isc-dhcp-client_4.3.3-5ubuntu12.10_amd64.deb ... Unpacking isc-dhcp-client (4.3.3-5ubuntu12.10) ... Selecting previously unselected package isc-dhcp-common. Preparing to unpack .../isc-dhcp-common_4.3.3-5ubuntu12.10_amd64.deb ... Unpacking isc-dhcp-common (4.3.3-5ubuntu12.10) ... Selecting previously unselected package libxtables11:amd64. Preparing to unpack .../libxtables11_1.6.0-2ubuntu3_amd64.deb ... Unpacking libxtables11:amd64 (1.6.0-2ubuntu3) ... Selecting previously unselected package netbase. Preparing to unpack .../archives/netbase_5.3_all.deb ... Unpacking netbase (5.3) ... Selecting previously unselected package libxmuu1:amd64. Preparing to unpack .../libxmuu1_2%3a1.1.2-2_amd64.deb ... Unpacking libxmuu1:amd64 (2:1.1.2-2) ... Selecting previously unselected package openssh-client. Preparing to unpack .../openssh-client_1%3a7.2p2-4ubuntu2.8_amd64.deb ... Unpacking openssh-client (1:7.2p2-4ubuntu2.8) ... Selecting previously unselected package xauth. Preparing to unpack .../xauth_1%3a1.0.9-1ubuntu2_amd64.deb ... Unpacking xauth (1:1.0.9-1ubuntu2) ... Selecting previously unselected package ssl-cert. Preparing to unpack .../ssl-cert_1.0.37_all.deb ... Unpacking ssl-cert (1.0.37) ... Selecting previously unselected package postfix. Preparing to unpack .../postfix_3.1.0-3ubuntu0.3_amd64.deb ... Unpacking postfix (3.1.0-3ubuntu0.3) ... Selecting previously unselected package bsd-mailx. Preparing to unpack .../bsd-mailx_8.1.2-0.20160123cvs-2_amd64.deb ... Unpacking bsd-mailx (8.1.2-0.20160123cvs-2) ... Selecting previously unselected package gridengine-common. Preparing to unpack .../gridengine-common_6.2u5-7.4_all.deb ... Unpacking gridengine-common (6.2u5-7.4) ... Selecting previously unselected package gridengine-client. Preparing to unpack .../gridengine-client_6.2u5-7.4_amd64.deb ... Unpacking gridengine-client (6.2u5-7.4) ... Selecting previously unselected package tcsh. Preparing to unpack .../tcsh_6.18.01-5_amd64.deb ... Unpacking tcsh (6.18.01-5) ... Selecting previously unselected package gridengine-exec. Preparing to unpack .../gridengine-exec_6.2u5-7.4_amd64.deb ... Unpacking gridengine-exec (6.2u5-7.4) ... Selecting previously unselected package ncurses-term. Preparing to unpack .../ncurses-term_6.0+20160213-1ubuntu1_all.deb ... Unpacking ncurses-term (6.0+20160213-1ubuntu1) ... Selecting previously unselected package openssh-sftp-server. Preparing to unpack .../openssh-sftp-server_1%3a7.2p2-4ubuntu2.8_amd64.deb ... Unpacking openssh-sftp-server (1:7.2p2-4ubuntu2.8) ... Selecting previously unselected package openssh-server. Preparing to unpack .../openssh-server_1%3a7.2p2-4ubuntu2.8_amd64.deb ... Unpacking openssh-server (1:7.2p2-4ubuntu2.8) ... Selecting previously unselected package python3-pkg-resources. Preparing to unpack .../python3-pkg-resources_20.7.0-1_all.deb ... Unpacking python3-pkg-resources (20.7.0-1) ... Selecting previously unselected package python3-chardet. Preparing to unpack .../python3-chardet_2.3.0-2_all.deb ... Unpacking python3-chardet (2.3.0-2) ... Selecting previously unselected package python3-six. Preparing to unpack .../python3-six_1.10.0-3_all.deb ... Unpacking python3-six (1.10.0-3) ... Selecting previously unselected package python3-urllib3. Preparing to unpack .../python3-urllib3_1.13.1-2ubuntu0.16.04.3_all.deb ... Unpacking python3-urllib3 (1.13.1-2ubuntu0.16.04.3) ... Selecting previously unselected package python3-requests. Preparing to unpack .../python3-requests_2.9.1-3ubuntu0.1_all.deb ... Unpacking python3-requests (2.9.1-3ubuntu0.1) ... Selecting previously unselected package ssh-import-id. Preparing to unpack .../ssh-import-id_5.5-0ubuntu1_all.deb ... Unpacking ssh-import-id (5.5-0ubuntu1) ... Processing triggers for systemd (229-4ubuntu21.22) ... Processing triggers for libc-bin (2.23-0ubuntu11) ... Setting up libatm1:amd64 (1:2.5.1-1.5) ... Setting up libmnl0:amd64 (1.0.3-5) ... Setting up liblockfile-bin (1.09-6ubuntu1) ... Setting up liblockfile1:amd64 (1.09-6ubuntu1) ... Setting up cpio (2.11+dfsg-5ubuntu1) ... update-alternatives: using /bin/mt-gnu to provide /bin/mt (mt) in auto mode Setting up iproute2 (4.3.0-1ubuntu3.16.04.5) ... Setting up ifupdown (0.8.10ubuntu1.4) ... Creating /etc/network/interfaces. Setting up libisc-export160 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Setting up libdns-export162 (1:9.10.3.dfsg.P4-8ubuntu1.15) ... Setting up isc-dhcp-client (4.3.3-5ubuntu12.10) ... Setting up isc-dhcp-common (4.3.3-5ubuntu12.10) ... Setting up libxtables11:amd64 (1.6.0-2ubuntu3) ... Setting up netbase (5.3) ... Setting up libxmuu1:amd64 (2:1.1.2-2) ... Setting up openssh-client (1:7.2p2-4ubuntu2.8) ... Setting up xauth (1:1.0.9-1ubuntu2) ... Setting up ssl-cert (1.0.37) ... Setting up postfix (3.1.0-3ubuntu0.3) ... Creating /etc/postfix/dynamicmaps.cf setting myhostname: worker-0.XXX setting alias maps setting alias database changing /etc/mailname to worker-0.XXX setting myorigin setting destinations: $myhostname, worker-0.XXX, localhost.XXX, , localhost setting relayhost: setting mynetworks: 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128 setting mailbox_size_limit: 0 setting recipient_delimiter: + setting inet_interfaces: all setting inet_protocols: all /etc/aliases does not exist, creating it. WARNING: /etc/aliases exists, but does not have a root alias. Postfix is now set up with a default configuration. If you need to make changes, edit /etc/postfix/main.cf (and others) as needed. To view Postfix configuration values, see postconf(1). After modifying main.cf, be sure to run '/etc/init.d/postfix reload'. Running newaliases invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of restart. Setting up bsd-mailx (8.1.2-0.20160123cvs-2) ... update-alternatives: using /usr/bin/bsd-mailx to provide /usr/bin/mailx (mailx) in auto mode Setting up gridengine-common (6.2u5-7.4) ... Creating config file /etc/default/gridengine with new version Setting up gridengine-client (6.2u5-7.4) ... Setting up tcsh (6.18.01-5) ... update-alternatives: using /bin/tcsh to provide /bin/csh (csh) in auto mode Setting up gridengine-exec (6.2u5-7.4) ... invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of start. Setting up ncurses-term (6.0+20160213-1ubuntu1) ... Setting up openssh-sftp-server (1:7.2p2-4ubuntu2.8) ... Setting up openssh-server (1:7.2p2-4ubuntu2.8) ... Creating SSH2 RSA key; this may take some time ... 2048 SHA256:ok/TxzwtF5W8I55sDxrt4Agy4fuWn39BiSovvDObhVE root@worker-0.XXX (RSA) Creating SSH2 DSA key; this may take some time ... 1024 SHA256:4y48kVYt3mS3q1KgZzEoYMnS/2d/tA8TJUK5uNSaxZY root@worker-0.XXX (DSA) Creating SSH2 ECDSA key; this may take some time ... 256 SHA256:4D7zm4cD2IbDnHoXnzcIo3FISbvOW8eOstGBNf1/bvo root@worker-0.XXX (ECDSA) Creating SSH2 ED25519 key; this may take some time ... 256 SHA256:/HrA3xiZiH5CZkXwtcfE6GwcMM+hEhZzTdFHxj4PzDg root@worker-0.XXX (ED25519) invoke-rc.d: could not determine current runlevel invoke-rc.d: policy-rc.d denied execution of start. Setting up python3-pkg-resources (20.7.0-1) ... Setting up python3-chardet (2.3.0-2) ... Setting up python3-six (1.10.0-3) ... Setting up python3-urllib3 (1.13.1-2ubuntu0.16.04.3) ... Setting up python3-requests (2.9.1-3ubuntu0.1) ... Setting up ssh-import-id (5.5-0ubuntu1) ... Processing triggers for libc-bin (2.23-0ubuntu11) ... Processing triggers for systemd (229-4ubuntu21.22) ... Reading package lists... Building dependency tree... Reading state information... 0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded. Copy cat $SGE_CFG_PATH/setcfg.log please wait Start SGE for worker is finished done for worker-0.XXX worker. Copy Sample output of sge:  "},{"title":"Submarine Python SDK","type":0,"sectionRef":"#","url":"docs/gettingStarted/python-sdk","content":"","keywords":""},{"title":"Prepare Python Environment to run Submarine SDK","type":1,"pageTitle":"Submarine Python SDK","url":"docs/gettingStarted/python-sdk#prepare-python-environment-to-run-submarine-sdk","content":"Submarine SDK requires Python3.7+. It's better to use a new Python environment created by Anoconda or Python virtualenv to try this to avoid trouble to existing Python environment. A sample Python virtual env can be setup like this: wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz tar xf virtualenv-16.0.0.tar.gz # Make sure to install using Python 3 python3 virtualenv-16.0.0/virtualenv.py venv . venv/bin/activate Copy "},{"title":"Install Submarine SDK","type":1,"pageTitle":"Submarine Python SDK","url":"docs/gettingStarted/python-sdk#install-submarine-sdk","content":""},{"title":"Install SDK from pypi.org (recommended)","type":1,"pageTitle":"Submarine Python SDK","url":"docs/gettingStarted/python-sdk#install-sdk-from-pypiorg-recommended","content":"Starting from 0.4.0, Submarine provides Python SDK. Please change it to a proper version needed. More detail: https://pypi.org/project/apache-submarine/ # Install latest stable version pip install apache-submarine # Install specific version pip install apache-submarine==<REPLACE_VERSION> Copy "},{"title":"Install SDK from source code","type":1,"pageTitle":"Submarine Python SDK","url":"docs/gettingStarted/python-sdk#install-sdk-from-source-code","content":"Please first clone code from github or go to http://submarine.apache.org/download.html to download released source code. git clone https://github.com/apache/submarine.git # (optional) chackout specific branch or release git checkout <correct release tag/branch> cd submarine/submarine-sdk/pysubmarine pip install . Copy "},{"title":"Manage Submarine Experiment","type":1,"pageTitle":"Submarine Python SDK","url":"docs/gettingStarted/python-sdk#manage-submarine-experiment","content":"Assuming you've installed submarine on K8s and forward the traefik service to localhost, now you can open a Python shell, Jupyter notebook or any tools with Submarine SDK installed. Follow SDK experiment example to run an experiment. "},{"title":"Training a DeepFM model","type":1,"pageTitle":"Submarine Python SDK","url":"docs/gettingStarted/python-sdk#training-a-deepfm-model","content":"The Submarine also supports users to train an easy-to-use CTR model with a few lines of code and a configuration file, so they don’t need to reimplement the model by themself. In addition, they can train the model on both local on distributed systems, such as Hadoop or Kubernetes. Follow SDK DeepFM example to try the model. "},{"title":"Apache Submarine Release 0.3.0","type":0,"sectionRef":"#","url":"docs/releases/submarine-release-0.3.0","content":"The Apache Submarine community is pleased to announce the availability of the 0.3.0 release. The community put significant effort into improving Apache Submarine since the last release. 196 patches for improvements and bug fixes. The highlighted features are as follows: Mini-submarine (YARN)Basic Tensorflow job submission to k8s through submarine-server RESTful APIJob submission on YARN through submarine-server RPC protocol We encourage to download the latest release. Feedback through the mailing lists is very welcome. You can visit issue tracker for full list of issues that are resolved.","keywords":""},{"title":"Apache Submarine Release 0.2.0","type":0,"sectionRef":"#","url":"docs/releases/submarine-release-0.2.0","content":"The Apache Submarine community is pleased to announce the availability of the 0.2.0 release. The community put significant effort into improving Apache Submarine since the last release. 46 patches for improvements and bug fixes. We encourage to download the latest release. Feedback through the mailing lists is very welcome. You can visit issue tracker for full list of issues that are resolved.","keywords":""},{"title":"Apache Submarine Release 0.4.0","type":0,"sectionRef":"#","url":"docs/releases/submarine-release-0.4.0","content":"The Apache Submarine Community is pleased to announce the availability of the 0.4.0 release. The community put significant effort into improving Apache Submarine since the last release. 175 patches for improvements and bug fixes. The highlighted features are as follows: Submarine Experiments: Refactor the Job to experiment and redefined the experiment specSubmarine Helm Charts: Provides one command to install the submarine into the Kubernetes clusterPySubmarine: Submarine Python SDK We encourage to download the latest release. Feedback through the mailing lists is very welcome. You can visit issue tracker for full list of issues that are resolved.","keywords":""},{"title":"how-to-use-tensorboard","type":0,"sectionRef":"#","url":"docs/userDocs/k8s/how-to-use-tensorboard","content":"","keywords":""},{"title":"Write to LogDirs by the environment variable","type":1,"pageTitle":"how-to-use-tensorboard","url":"docs/userDocs/k8s/how-to-use-tensorboard#write-to-logdirs-by-the-environment-variable","content":""},{"title":"Environment variable","type":1,"pageTitle":"how-to-use-tensorboard","url":"docs/userDocs/k8s/how-to-use-tensorboard#environment-variable","content":"SUBMARINE_TENSORBOARD_LOG_DIR: Exist in every experiment container. You just need to direct your logs to $(SUBMARINE_TENSORBOARD_LOG_DIR) (NOTICE: it is () not {}), and you can inspect the process on the tensorboard webpage. "},{"title":"Example","type":1,"pageTitle":"how-to-use-tensorboard","url":"docs/userDocs/k8s/how-to-use-tensorboard#example","content":"{ \"meta\": { \"name\": \"tensorflow-tensorboard-dist-mnist\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=$(SUBMARINE_TENSORBOARD_LOG_DIR) --learning_rate=0.01 --batch_size=20\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=512M\" } } } Copy "},{"title":"Connect to the tensorboard webpaage","type":1,"pageTitle":"how-to-use-tensorboard","url":"docs/userDocs/k8s/how-to-use-tensorboard#connect-to-the-tensorboard-webpaage","content":"Open the experiment page in the workbench, and Click the TensorBoard button.  Inspect the process on tensorboard page.  "},{"title":"Apache Submarine Release 0.5.0","type":0,"sectionRef":"#","url":"docs/releases/submarine-release-0.5.0","content":"The Apache Submarine Community is pleased to announce the availability of the 0.5.0 release. The community put significant effort into improving Apache Submarine since the last release. 99 patches for improvements and bug fixes. The highlighted features are as follows: Submarine Experiments: Redefined the experiment spec, sync up code from Git, it could be HTTP and sshPredefined experiment template: Register A experiment template and submit the related parameter to run an experiment using Rest APIEnvironment profile: Users could easily manage their docker image and conda environmentJupyter Notebook: Spawn a jupyter notebook using Rest API, and execute ML code on K8s, or submit an experiment to submarine serverSubmarine Workbench UI: CRUD Experiment, Environment, Notebook through the UI Disable interpreter module We encourage to download the latest release. Feedback through the mailing lists is very welcome. You can visit issue tracker for full list of issues that are resolved.","keywords":""},{"title":"Run Experiment Template Guide (REST)","type":0,"sectionRef":"#","url":"docs/userDocs/k8s/run-experiment-template-rest","content":"","keywords":""},{"title":"Experiment Template Spec","type":1,"pageTitle":"Run Experiment Template Guide (REST)","url":"docs/userDocs/k8s/run-experiment-template-rest#experiment-template-spec","content":"The experiment is represented in JSON or YAML format. "},{"title":"Use existing experiment template to create a experiment","type":1,"pageTitle":"Run Experiment Template Guide (REST)","url":"docs/userDocs/k8s/run-experiment-template-rest#use-existing-experiment-template-to-create-a-experiment","content":"POST /api/v1/experiment/{template-name} Example Request: curl -X POST -H \"Content-Type: application/json\" -d ' { \"params\": { \"learning_rate\":\"0.01\", \"batch_size\":\"150\", \"experiment_name\":\"newexperiment1\" } } ' http://127.0.0.1:32080/api/v1/experiment/tf-mnist Copy Example Request: curl -X POST -H \"Content-Type: application/json\" -d ' { \"params\": { \"experiment_name\":\"new-pytorch-mnist\" } } ' http://127.0.0.1:32080/api/v1/experiment/pytorch-mnist Copy Register experiment template and more info see Experiment Template API Reference. "},{"title":"Run TensorFlow Experiment Guide (REST)","type":0,"sectionRef":"#","url":"docs/userDocs/k8s/run-tensorflow-experiment-rest","content":"","keywords":""},{"title":"Experiment Spec","type":1,"pageTitle":"Run TensorFlow Experiment Guide (REST)","url":"docs/userDocs/k8s/run-tensorflow-experiment-rest#experiment-spec","content":"The experiment is represented in JSON or YAML format. YAML Format: meta: name: \"tf-mnist-yaml\" namespace: \"default\" framework: \"TensorFlow\" cmd: \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\" envVars: ENV_1: \"ENV1\" environment: image: \"apache/submarine:tf-mnist-with-summaries-1.0\" spec: Ps: replicas: 1 resources: \"cpu=1,memory=1024M\" Worker: replicas: 1 resources: \"cpu=1,memory=1024M\" Copy JSON Format: { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } Copy "},{"title":"Create Experiment by REST API","type":1,"pageTitle":"Run TensorFlow Experiment Guide (REST)","url":"docs/userDocs/k8s/run-tensorflow-experiment-rest#create-experiment-by-rest-api","content":"POST /api/v1/experiment Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } ' http://127.0.0.1:32080/api/v1/experiment Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"experimentId\": \"experiment_1592057447228_0001\", \"name\": \"tf-mnist-json\", \"uid\": \"28e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:59:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"tf-mnist-json\", \"namespace\": \"default\", \"framework\": \"TensorFlow\", \"cmd\": \"python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/log --learning_rate=0.01 --batch_size=150\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:tf-mnist-with-summaries-1.0\" }, \"spec\": { \"Ps\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } } } Copy More info see Experiment API Reference. "},{"title":"Run PyTorch Experiment Guide (REST)","type":0,"sectionRef":"#","url":"docs/userDocs/k8s/run-pytorch-experiment-rest","content":"","keywords":""},{"title":"Experiment Spec","type":1,"pageTitle":"Run PyTorch Experiment Guide (REST)","url":"docs/userDocs/k8s/run-pytorch-experiment-rest#experiment-spec","content":"The experiment is represented in JSON or YAML format. YAML Format: meta: name: pytorch-mnist-yaml namespace: default framework: PyTorch cmd: python /var/mnist.py --backend gloo envVars: ENV_1: ENV1 environment: image: apache/submarine:pytorch-dist-mnist-1.0 spec: Master: replicas: 1 resources: cpu=1,memory=1024M Worker: replicas: 1 resources: cpu=1,memory=1024M Copy JSON Format: { \"meta\": { \"name\": \"pytorch-mnist-json\", \"namespace\": \"default\", \"framework\": \"PyTorch\", \"cmd\": \"python /var/mnist.py --backend gloo\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:pytorch-dist-mnist-1.0\" }, \"spec\": { \"Master\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } Copy "},{"title":"Create Experiment by REST API","type":1,"pageTitle":"Run PyTorch Experiment Guide (REST)","url":"docs/userDocs/k8s/run-pytorch-experiment-rest#create-experiment-by-rest-api","content":"POST /api/v1/experiment Example Request curl -X POST -H \"Content-Type: application/json\" -d ' { \"meta\": { \"name\": \"pytorch-mnist-json\", \"namespace\": \"default\", \"framework\": \"PyTorch\", \"cmd\": \"python /var/mnist.py --backend gloo\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:pytorch-dist-mnist-1.0\" }, \"spec\": { \"Master\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } ' http://127.0.0.1:32080/api/v1/experiment Copy Example Response: { \"status\": \"OK\", \"code\": 200, \"result\": { \"experimentId\": \"experiment_1592057447228_0002\", \"name\": \"mnist\", \"uid\": \"38e39dcd-77d4-11ea-8dbb-0242ac110003\", \"status\": \"Accepted\", \"acceptedTime\": \"2020-06-13T22:19:29.000+08:00\", \"spec\": { \"meta\": { \"name\": \"pytorch-mnist-json\", \"namespace\": \"default\", \"framework\": \"PyTorch\", \"cmd\": \"python /var/mnist.py --backend gloo\", \"envVars\": { \"ENV_1\": \"ENV1\" } }, \"environment\": { \"image\": \"apache/submarine:pytorch-dist-mnist-1.0\" }, \"spec\": { \"Master\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" }, \"Worker\": { \"replicas\": 1, \"resources\": \"cpu=1,memory=1024M\" } } } } } Copy More info see Experiment API Reference. "},{"title":"Python SDK Development","type":0,"sectionRef":"#","url":"docs/userDocs/submarine-sdk/pysubmarine/development","content":"","keywords":""},{"title":"Prerequisites","type":1,"pageTitle":"Python SDK Development","url":"docs/userDocs/submarine-sdk/pysubmarine/development#prerequisites","content":"This is required for developing & testing changes, we recommend installing pysubmarine in its own conda environment by running the following conda create --name submarine-dev python=3.6 conda activate submarine-dev # lint-requirements.txt and test-requirements.txt are in ./submarine-sdk/pysubmarine/github-actions pip install -r lint-requirements.txt pip install -r test-requirements.txt # Installs pysubmarine from current checkout pip install ./submarine-sdk/pysubmarine Copy "},{"title":"PySubmarine Docker","type":1,"pageTitle":"Python SDK Development","url":"docs/userDocs/submarine-sdk/pysubmarine/development#pysubmarine-docker","content":"We also use docker to provide build environments for CI, development, generate python sdk from swagger. ./run-pysubmarine-ci.sh Copy The script does the following things: Start an interactive bash sessionMount submarine directory to /workspace and set it as homeSwitch user to be the same user that calls the run-pysubmarine-ci.sh "},{"title":"Coding Style","type":1,"pageTitle":"Python SDK Development","url":"docs/userDocs/submarine-sdk/pysubmarine/development#coding-style","content":"Use yapf to format Python codeyapf style is configured in .style.yapf fileTo autoformat code ./submarine-sdk/pysubmarine/github-actions/auto-format.sh Copy Verify linter pass before submitting a pull request by running: ./submarine-sdk/pysubmarine/github-actions/lint.sh Copy "},{"title":"Unit Testing","type":1,"pageTitle":"Python SDK Development","url":"docs/userDocs/submarine-sdk/pysubmarine/development#unit-testing","content":"We are using pytest to develop our unit test suite. After building the project (see below) you can run its unit tests like so: cd submarine-sdk/pysubmarine Copy Run unit test pytest --cov=submarine -vs -m \"not e2e\" Copy Run integration test pytest --cov=submarine -vs -m \"e2e\" Copy Before run this command in local, you should make sure the submarine server is running. "},{"title":"Generate python SDK from swagger","type":1,"pageTitle":"Python SDK Development","url":"docs/userDocs/submarine-sdk/pysubmarine/development#generate-python-sdk-from-swagger","content":"We use open-api generatorto generate pysubmarine client API that used to communicate with submarine server. If change below files, please run ./dev-support/pysubmarine/gen-sdk.shto generate latest version of SDK. Bootstrap.javaExperimentRestApi.java "},{"title":"Upload package to PyPi","type":1,"pageTitle":"Python SDK Development","url":"docs/userDocs/submarine-sdk/pysubmarine/development#upload-package-to-pypi","content":"For Apache Submarine committer and PMCs to do a new release. Change the version from 0.x.x-SNAPSHOT to 0.x.x in setup.pyInstall Python packages cd submarine-sdk/pysubmarine pip install -r github-actions/pypi-requirements.txt Copy Compiling Your Package It will create build, dist, and project.egg.infoin your local directory python setup.py bdist_wheel Copy Upload python package to TestPyPI for testing python -m twine upload --repository testpypi dist/* Copy Upload python package to PyPi python -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/* Copy "},{"title":"Submarine-SDK","type":0,"sectionRef":"#","url":"docs/userDocs/submarine-sdk/README","content":"","keywords":""},{"title":"Summary","type":1,"pageTitle":"Submarine-SDK","url":"docs/userDocs/submarine-sdk/README#summary","content":"Support Python, Scala, R language for algorithm development Support tracking/metrics APIs which allows developers add tracking/metrics and view tracking/metrics from Submarine Workbench UI. "},{"title":"PySubmarine Tracking","type":0,"sectionRef":"#","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking","content":"","keywords":""},{"title":"Quickstart","type":1,"pageTitle":"PySubmarine Tracking","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking#quickstart","content":"Start mini-submarine Start Mysql server in mini-submarine Uncomment the log_param and log_metric inmnist_distributed.py Start Submarine experiment (e.g., run_submarine_mnist_tony.sh) "},{"title":"Functions","type":1,"pageTitle":"PySubmarine Tracking","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking#functions","content":""},{"title":"submarine.get_tracking_uri()","type":1,"pageTitle":"PySubmarine Tracking","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking#submarineget_tracking_uri","content":"return the tracking URI. "},{"title":"submarine.set_tracking_uri(URI)","type":1,"pageTitle":"PySubmarine Tracking","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking#submarineset_tracking_uriuri","content":"set the tracking URI. You can also set the SUBMARINE_TRACKING_URI environment variable to have Submarine find a URI from there. The URI should be database connection string. Parameters URI - Submarine record data to Mysql server. The database URL is expected in the format <dialect>+<driver>://<username>:<password>@<host>:<port>/<database>. By default it's mysql+pymysql://submarine:password@localhost:3306/submarine. More detail : SQLAlchemy docs "},{"title":"submarine.log_metric(key, value, step=0)","type":1,"pageTitle":"PySubmarine Tracking","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking#submarinelog_metrickey-value-step0","content":"logs a single key-value metric. The value must always be a number. Parameters key - Metric name (string).value - Metric value (float).step - A single integer step at which to log the specified Metrics, by default it's 0. "},{"title":"submarine.log_param(key, value)","type":1,"pageTitle":"PySubmarine Tracking","url":"docs/userDocs/submarine-sdk/pysubmarine/tracking#submarinelog_paramkey-value","content":"logs a single key-value parameter. The key and value are both strings. Parameters key - Parameter name (string).value - Parameter value (string). "},{"title":"Building Submarine Spark Security Plugin","type":0,"sectionRef":"#","url":"docs/userDocs/submarine-security/spark-security/build-submarine-spark-security-plugin","content":"Submarine Spark Security Plugin is built using Apache Maven. To build it, cd to the root direct of submarine project and run: mvn clean package -Dmaven.javadoc.skip=true -DskipTests -pl :submarine-spark-security Copy By default, Submarine Spark Security Plugin is built against Apache Spark 2.3.x and Apache Ranger 1.1.0, which may be incompatible with other Apache Spark or Apache Ranger releases. Currently, available profiles are: Spark: -Pspark-2.3, -Pspark-2.4, -Pspark-3.0 Ranger: -Pranger-1.2, -Pranger-2.0","keywords":""},{"title":"Submarine Spark Security Plugin","type":0,"sectionRef":"#","url":"docs/userDocs/submarine-security/spark-security/README","content":"","keywords":""},{"title":"Build","type":1,"pageTitle":"Submarine Spark Security Plugin","url":"docs/userDocs/submarine-security/spark-security/README#build","content":"Please refer to the online documentation - Building submarine spark security plguin "},{"title":"Quick Start","type":1,"pageTitle":"Submarine Spark Security Plugin","url":"docs/userDocs/submarine-security/spark-security/README#quick-start","content":"Three steps to integrate Apache Spark and Apache Ranger. "},{"title":"Installation","type":1,"pageTitle":"Submarine Spark Security Plugin","url":"docs/userDocs/submarine-security/spark-security/README#installation","content":"Place the submarine-spark-security-<version>.jar into $SPARK_HOME/jars. "},{"title":"Configurations","type":1,"pageTitle":"Submarine Spark Security Plugin","url":"docs/userDocs/submarine-security/spark-security/README#configurations","content":"Settings for Apache Ranger# Create ranger-spark-security.xml in $SPARK_HOME/conf and add the following configurations for pointing to the right Apache Ranger admin server. <configuration> <property> <name>ranger.plugin.spark.policy.rest.url</name> <value>ranger admin address like http://ranger-admin.org:6080</value> </property> <property> <name>ranger.plugin.spark.service.name</name> <value>a ranger hive service name</value> </property> <property> <name>ranger.plugin.spark.policy.cache.dir</name> <value>./a ranger hive service name/policycache</value> </property> <property> <name>ranger.plugin.spark.policy.pollIntervalMs</name> <value>5000</value> </property> <property> <name>ranger.plugin.spark.policy.source.impl</name> <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value> </property> </configuration> Copy Create ranger-spark-audit.xml in $SPARK_HOME/conf and add the following configurations to enable/disable auditing. <configuration> <property> <name>xasecure.audit.is.enabled</name> <value>true</value> </property> <property> <name>xasecure.audit.destination.db</name> <value>false</value> </property> <property> <name>xasecure.audit.destination.db.jdbc.driver</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>xasecure.audit.destination.db.jdbc.url</name> <value>jdbc:mysql://10.171.161.78/ranger</value> </property> <property> <name>xasecure.audit.destination.db.password</name> <value>rangeradmin</value> </property> <property> <name>xasecure.audit.destination.db.user</name> <value>rangeradmin</value> </property> </configuration> Copy Settings for Apache Spark# You can configure spark.sql.extensions with the *Extension we provided. For example, spark.sql.extensions=org.apache.submarine.spark.security.api.RangerSparkAuthzExtension Currently, you can set the following options to spark.sql.extensions to choose authorization w/ or w/o extra functions. option\tauthorization\trow filtering\tdata maskingorg.apache.submarine.spark.security.api.RangerSparkAuthzExtension\t√\t×\t× org.apache.submarine.spark.security.api.RangerSparkSQLExtension\t√\t√\t√ "},{"title":"Write Dockerfiles for Submarine","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/Dockerfiles","content":"How to write Dockerfile for Submarine TensorFlow jobs How to write Dockerfile for Submarine PyTorch jobs How to write Dockerfile for Submarine MXNet jobs","keywords":""},{"title":"Test and Troubleshooting","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/TestAndTroubleshooting","content":"","keywords":""},{"title":"Test with a tensorflow job","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#test-with-a-tensorflow-job","content":"Distributed-shell + GPU + cgroup  ... \\ job run \\ --env DOCKER_JAVA_HOME=/opt/java \\ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf-gpu \\ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \\ --worker_docker_image tf-1.13.1-gpu:0.0.1 \\ --ps_docker_image tf-1.13.1-cpu:0.0.1 \\ --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \\ --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \\ --num_ps 0 \\ --ps_resources memory=4G,vcores=2,gpu=0 \\ --ps_launch_cmd \"python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0\" \\ --worker_resources memory=4G,vcores=2,gpu=1 --verbose \\ --num_workers 1 \\ --worker_launch_cmd \"python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1\" Copy "},{"title":"Issues:","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#issues","content":""},{"title":"Issue 1: Fail to start nodemanager after system reboot","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#issue-1-fail-to-start-nodemanager-after-system-reboot","content":"2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) 2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED Copy Solution: Grant user yarn the access to /sys/fs/cgroup/cpu,cpuacct, which is the subfolder of cgroup mount destination. chown :yarn -R /sys/fs/cgroup/cpu,cpuacct chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct Copy If GPUs are used，the access to cgroup devices folder is neede as well chown :yarn -R /sys/fs/cgroup/devices chmod g+rwx -R /sys/fs/cgroup/devices Copy "},{"title":"Issue 2: container-executor permission denied","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#issue-2-container-executor-permission-denied","content":"2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command: java.io.IOException: Cannot run program \"/etc/yarn/sbin/Linux-amd64-64/container-executor\": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) at org.apache.hadoop.util.Shell.run(Shell.java:901) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) Copy Solution: The permission of /etc/yarn/sbin/Linux-amd64-64/container-executor should be 6050 "},{"title":"Issue 3：How to get docker service log","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#issue-3：how-to-get-docker-service-log","content":"Solution: we can get docker log with the following command journalctl -u docker Copy "},{"title":"Issue 4：docker can't remove containers with errors like device or resource busy","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#issue-4：docker-cant-remove-containers-with-errors-like-device-or-resource-busy","content":"$ docker rm 0bfafa146431 Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy Copy Solution: to find which process leads to a device or resource busy, we can add a shell script, named find-busy-mnt.sh #!/usr/bin/env bash # A simple script to get information about mount points and pids and their # mount namespaces. if [ $# -ne 1 ];then echo \"Usage: $0 <devicemapper-device-id>\" exit 1 fi ID=$1 MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null` [ -z \"$MOUNTS\" ] && echo \"No pids found\" && exit 0 printf \"PID\\tNAME\\t\\tMNTNS\\n\" echo \"$MOUNTS\" | while read LINE; do PID=`echo $LINE | cut -d \":\" -f1 | cut -d \"/\" -f3` # Ignore self and thread-self if [ \"$PID\" == \"self\" ] || [ \"$PID\" == \"thread-self\" ]; then continue fi NAME=`ps -q $PID -o comm=` MNTNS=`readlink /proc/$PID/ns/mnt` printf \"%s\\t%s\\t\\t%s\\n\" \"$PID\" \"$NAME\" \"$MNTNS\" done Copy Kill the process by pid, which is found by the script $ chmod +x find-busy-mnt.sh ./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a # PID NAME MNTNS # 5007 ntpd mnt:[4026533598] $ kill -9 5007 Copy "},{"title":"Issue 5：Yarn failed to start containers","type":1,"pageTitle":"Test and Troubleshooting","url":"docs/userDocs/yarn/TestAndTroubleshooting#issue-5：yarn-failed-to-start-containers","content":"if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created. "},{"title":"Docker Images for MXNet","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/WriteDockerfileMX","content":"","keywords":""},{"title":"How to create docker images to run MXNet on YARN","type":1,"pageTitle":"Docker Images for MXNet","url":"docs/userDocs/yarn/WriteDockerfileMX#how-to-create-docker-images-to-run-mxnet-on-yarn","content":"Dockerfile to run MXNet on YARN needs two parts: Base libraries which MXNet depends on 1) OS base image, for example ubuntu:18.04 2) MXNet dependent libraries and packages. \\ For example python, scipy. For GPU support, you also need cuda, cudnn, etc. 3) MXNet package. Libraries to access HDFS 1) JDK 2) Hadoop Here's an example of a base image (without GPU support) to install MXNet: FROM ubuntu:18.04 # Install some development tools and packages # MXNet 1.6 is going to be the last MXNet release to support Python2 RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata git \\ wget zip python3 python3-pip python3-distutils libgomp1 libopenblas-dev libopencv-dev # Install latest MXNet using pip (without GPU support) RUN pip3 install mxnet RUN echo \"Install python related packages\" && \\ pip3 install --user graphviz==0.8.4 ipykernel jupyter matplotlib numpy pandas scipy sklearn && \\ python3 -m ipykernel.kernelspec Copy On top of above image, add files, install packages to access HDFS ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 RUN apt-get update && apt-get install -y openjdk-8-jdk wget # Install hadoop ENV HADOOP_VERSION=\"3.1.2\" RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz # If you are in mainland China, you can use the following command. # RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current RUN rm hadoop-${HADOOP_VERSION}.tar.gz Copy Build and push to your own docker registry: Use docker build ... and docker push ... to finish this step. "},{"title":"Use examples to build your own MXNet docker images","type":1,"pageTitle":"Docker Images for MXNet","url":"docs/userDocs/yarn/WriteDockerfileMX#use-examples-to-build-your-own-mxnet-docker-images","content":"We provided some example Dockerfiles for you to build your own MXNet docker images. For latest MXNet docker/mxnet/base/ubuntu-18.04/Dockerfile.cpu.mxnet_latest: Latest MXNet that supports CPUdocker/mxnet/base/ubuntu-18.04/Dockerfile.gpu.mxnet_latest: Latest MXNet that supports GPU, which is prebuilt to CUDA10. Build Docker images# "},{"title":"Manually build Docker image:","type":1,"pageTitle":"Docker Images for MXNet","url":"docs/userDocs/yarn/WriteDockerfileMX#manually-build-docker-image","content":"Under docker/mxnet directory, run build-all.sh to build all Docker images. This command will build the following Docker images: mxnet-latest-cpu-base:0.0.1 for base Docker image which includes Hadoop, MXNetmxnet-latest-gpu-base:0.0.1 for base Docker image which includes Hadoop, MXNet, GPU base libraries. "},{"title":"Docker Images for PyTorch","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/WriteDockerfilePT","content":"","keywords":""},{"title":"How to create docker images to run PyTorch on YARN","type":1,"pageTitle":"Docker Images for PyTorch","url":"docs/userDocs/yarn/WriteDockerfilePT#how-to-create-docker-images-to-run-pytorch-on-yarn","content":"Dockerfile to run PyTorch on YARN needs two parts: Base libraries which PyTorch depends on 1) OS base image, for example ubuntu:18.04 2) PyTorch dependent libraries and packages. For example python, scipy. For GPU support, you also need cuda, cudnn, etc. 3) PyTorch package. Libraries to access HDFS 1) JDK 2) Hadoop Here's an example of a base image (with GPU support) to install PyTorch: FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 ARG PYTHON_VERSION=3.6 RUN apt-get update && apt-get install -y --no-install-recommends \\ build-essential \\ cmake \\ git \\ curl \\ vim \\ ca-certificates \\ libjpeg-dev \\ libpng-dev \\ wget &&\\ rm -rf /var/lib/apt/lists/* RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \\ chmod +x ~/miniconda.sh && \\ ~/miniconda.sh -b -p /opt/conda && \\ rm ~/miniconda.sh && \\ /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include cython typing && \\ /opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \\ /opt/conda/bin/conda clean -ya ENV PATH /opt/conda/bin:$PATH RUN pip install ninja # This must be done before pip so that requirements.txt is available WORKDIR /opt/pytorch RUN git clone https://github.com/pytorch/pytorch.git WORKDIR pytorch RUN git submodule update --init RUN TORCH_CUDA_ARCH_LIST=\"3.5 5.2 6.0 6.1 7.0+PTX\" TORCH_NVCC_FLAGS=\"-Xfatbin -compress-all\" \\ CMAKE_PREFIX_PATH=\"$(dirname $(which conda))/../\" \\ pip install -v . WORKDIR /opt/pytorch RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v . Copy On top of above image, add files, install packages to access HDFS RUN apt-get update && apt-get install -y openjdk-8-jdk wget # Install hadoop ENV HADOOP_VERSION=\"2.9.2\" RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current RUN rm hadoop-${HADOOP_VERSION}.tar.gz Copy Build and push to your own docker registry: Use docker build ... and docker push ... to finish this step. "},{"title":"Use examples to build your own PyTorch docker images","type":1,"pageTitle":"Docker Images for PyTorch","url":"docs/userDocs/yarn/WriteDockerfilePT#use-examples-to-build-your-own-pytorch-docker-images","content":"We provided some example Dockerfiles for you to build your own PyTorch docker images. For latest PyTorch docker/pytorch/base/ubuntu-18.04/Dockerfile.gpu.pytorch_latest: Latest Pytorch that supports GPU, which is prebuilt to CUDA10.docker/pytorch/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.pytorch_latest: Latest Pytorch that GPU, which is prebuilt to CUDA10, with models. "},{"title":"Build Docker images","type":1,"pageTitle":"Docker Images for PyTorch","url":"docs/userDocs/yarn/WriteDockerfilePT#build-docker-images","content":""},{"title":"Manually build Docker image:","type":1,"pageTitle":"Docker Images for PyTorch","url":"docs/userDocs/yarn/WriteDockerfilePT#manually-build-docker-image","content":"Under docker/pytorch directory, run build-all.sh to build all Docker images. This command will build the following Docker images: pytorch-latest-gpu-base:0.0.1 for base Docker image which includes Hadoop, PyTorch, GPU base libraries.pytorch-latest-gpu:0.0.1 which includes cifar10 model as well "},{"title":"Use prebuilt images","type":1,"pageTitle":"Docker Images for PyTorch","url":"docs/userDocs/yarn/WriteDockerfilePT#use-prebuilt-images","content":"(No liability) You can also use prebuilt images for convenience: hadoopsubmarine/pytorch-latest-gpu-base:0.0.1 "},{"title":"README","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README","content":"","keywords":""},{"title":"Prerequisite","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#prerequisite","content":"Install TensorFlow version 1.2.1 or later. Download the CIFAR-10 dataset and generate TFRecord files using the provided script. The script and associated command below will download the CIFAR-10 dataset and then generate a TFRecord for the training, validation, and evaluation datasets. python generate_cifar10_tfrecords.py --data-dir=${PWD}/cifar-10-data Copy After running the command above, you should see the following files in the --data-dir (ls -R cifar-10-data): train.tfrecordsvalidation.tfrecordseval.tfrecords "},{"title":"Training on a single machine with GPUs or CPU","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#training-on-a-single-machine-with-gpus-or-cpu","content":"Run the training on CPU only. After training, it runs the evaluation. python cifar10_main.py --data-dir=${PWD}/cifar-10-data \\ --job-dir=/tmp/cifar10 \\ --num-gpus=0 \\ --train-steps=1000 Copy Run the model on 2 GPUs using CPU as parameter server. After training, it runs the evaluation. python cifar10_main.py --data-dir=${PWD}/cifar-10-data \\ --job-dir=/tmp/cifar10 \\ --num-gpus=2 \\ --train-steps=1000 Copy Run the model on 2 GPUs using GPU as parameter server. It will run an experiment, which for local setting basically means it will run stop training a couple of times to perform evaluation. python cifar10_main.py --data-dir=${PWD}/cifar-10-data \\ --job-dir=/tmp/cifar10 \\ --variable-strategy GPU \\ --num-gpus=2 \\ Copy There are more command line flags to play with; runpython cifar10_main.py --help for details. "},{"title":"Run distributed training","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#run-distributed-training","content":""},{"title":"(Optional) Running on Google Cloud Machine Learning Engine","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#optional-running-on-google-cloud-machine-learning-engine","content":"This example can be run on Google Cloud Machine Learning Engine (ML Engine), which will configure the environment and take care of running workers, parameters servers, and masters in a fault tolerant way. To install the command line tool, and set up a project and billing, see the quickstart here. You'll also need a Google Cloud Storage bucket for the data. If you followed the instructions above, you can just run: MY_BUCKET=gs://<my-bucket-name> gsutil cp -r ${PWD}/cifar-10-data $MY_BUCKET/ Copy Then run the following command from the tutorials/image directory of this repository (the parent directory of this README): gcloud ml-engine jobs submit training cifarmultigpu \\ --runtime-version 1.2 \\ --job-dir=$MY_BUCKET/model_dirs/cifarmultigpu \\ --config cifar10_estimator/cmle_config.yaml \\ --package-path cifar10_estimator/ \\ --module-name cifar10_estimator.cifar10_main \\ -- \\ --data-dir=$MY_BUCKET/cifar-10-data \\ --num-gpus=4 \\ --train-steps=1000 Copy "},{"title":"Set TF_CONFIG","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#set-tf_config","content":"Considering that you already have multiple hosts configured, all you need is aTF_CONFIG environment variable on each host. You can set up the hosts manually or check tensorflow/ecosystem for instructions about how to set up a Cluster. The TF_CONFIG will be used by the RunConfig to know the existing hosts and their task: master, ps or worker. Here's an example of TF_CONFIG. cluster = {'master': ['master-ip:8000'], 'ps': ['ps-ip:8000'], 'worker': ['worker-ip:8000']} TF_CONFIG = json.dumps( {'cluster': cluster, 'task': {'type': master, 'index': 0}, 'model_dir': 'gs://<bucket_path>/<dir_path>', 'environment': 'cloud' }) Copy Cluster A cluster spec, which is basically a dictionary that describes all of the tasks in the cluster. More about it here. In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker. ps: saves the parameters among all workers. All workers can read/write/update the parameters for model via ps. As some models are extremely large the parameters are shared among the ps (each ps stores a subset). worker: does the training. master: basically a special worker, it does training, but also restores and saves checkpoints and do evaluation. Task The Task defines what is the role of the current node, for this example the node is the master on index 0 on the cluster spec, the task will be different for each node. An example of the TF_CONFIG for a worker would be: cluster = {'master': ['master-ip:8000'], 'ps': ['ps-ip:8000'], 'worker': ['worker-ip:8000']} TF_CONFIG = json.dumps( {'cluster': cluster, 'task': {'type': worker, 'index': 0}, 'model_dir': 'gs://<bucket_path>/<dir_path>', 'environment': 'cloud' }) Copy Model_dir This is the path where the master will save the checkpoints, graph and TensorBoard files. For a multi host environment you may want to use a Distributed File System, Google Storage and DFS are supported. Environment By the default environment is local, for a distributed setting we need to change it to cloud. "},{"title":"Running script","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#running-script","content":"Once you have a TF_CONFIG configured properly on each host you're ready to run on distributed settings. Master# Run this on master: Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps. It will run evaluation a couple of times during training. The num_workers argument is used only to update the learning rate correctly. Make sure the model_dir is the same as defined on the TF_CONFIG. python cifar10_main.py --data-dir=gs://path/cifar-10-data \\ --job-dir=gs://path/model_dir/ \\ --num-gpus=4 \\ --train-steps=40000 \\ --sync \\ --num-workers=2 Copy Output: INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/ INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'master', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd16fb2be10>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1 gpu_options { } allow_soft_placement: true , '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_evaluation_master': '', '_master': u'grpc://master-ip:8000'} ... 2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:00:04.0 Total memory: 11.17GiB Free memory: 11.09GiB 2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:00:05.0 Total memory: 11.17GiB Free memory: 11.10GiB ... 2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000 INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64) INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11) INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=1; total_num_replicas=1 INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-0 2017-08-01 19:59:37.560775: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 156fcb55fe6648d6 with config: intra_op_parallelism_threads: 1 gpu_options { per_process_gpu_memory_fraction: 1 } allow_soft_placement: true INFO:tensorflow:Saving checkpoints for 1 into gs://path/model_dir/model.ckpt. INFO:tensorflow:loss = 1.20682, step = 1 INFO:tensorflow:loss = 1.20682, learning_rate = 0.1 INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64) INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11) INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2 INFO:tensorflow:Starting evaluation at 2017-08-01-20:00:14 2017-08-01 20:00:15.745881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0) 2017-08-01 20:00:15.745949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0) 2017-08-01 20:00:15.745958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0) 2017-08-01 20:00:15.745964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0) 2017-08-01 20:00:15.745969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:08.0) 2017-08-01 20:00:15.745975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:09.0) 2017-08-01 20:00:15.745987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:0a.0) 2017-08-01 20:00:15.745997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:0b.0) INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-10023 INFO:tensorflow:Evaluation [1/100] INFO:tensorflow:Evaluation [2/100] INFO:tensorflow:Evaluation [3/100] INFO:tensorflow:Evaluation [4/100] INFO:tensorflow:Evaluation [5/100] INFO:tensorflow:Evaluation [6/100] INFO:tensorflow:Evaluation [7/100] INFO:tensorflow:Evaluation [8/100] INFO:tensorflow:Evaluation [9/100] INFO:tensorflow:Evaluation [10/100] INFO:tensorflow:Evaluation [11/100] INFO:tensorflow:Evaluation [12/100] INFO:tensorflow:Evaluation [13/100] ... INFO:tensorflow:Evaluation [100/100] INFO:tensorflow:Finished evaluation at 2017-08-01-20:00:31 INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step = 1, loss = 630.425 Copy Worker# Run this on worker: Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for 40000 steps. It will run evaluation a couple of times during training. Make sure the model_dir is the same as defined on the TF_CONFIG. python cifar10_main.py --data-dir=gs://path/cifar-10-data \\ --job-dir=gs://path/model_dir/ \\ --num-gpus=4 \\ --train-steps=40000 \\ --sync Copy Output: INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/ INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'worker', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6918438e10>, '_model_dir': 'gs://<path>/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1 gpu_options { } allow_soft_placement: true , '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } ... 2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:00:04.0 Total memory: 11.17GiB Free memory: 11.09GiB 2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:00:05.0 Total memory: 11.17GiB Free memory: 11.10GiB ... 2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000 INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8) INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64) INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11) INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2 INFO:tensorflow:Create CheckpointSaverHook. 2017-07-31 22:38:04.629150: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:master/replica:0/task:0 2017-07-31 22:38:09.263492: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session cc58f93b1e259b0c with config: intra_op_parallelism_threads: 1 gpu_options { per_process_gpu_memory_fraction: 1 } allow_soft_placement: true INFO:tensorflow:loss = 5.82382, step = 0 INFO:tensorflow:loss = 5.82382, learning_rate = 0.8 INFO:tensorflow:Average examples/sec: 1116.92 (1116.92), step = 10 INFO:tensorflow:Average examples/sec: 1233.73 (1377.83), step = 20 INFO:tensorflow:Average examples/sec: 1485.43 (2509.3), step = 30 INFO:tensorflow:Average examples/sec: 1680.27 (2770.39), step = 40 INFO:tensorflow:Average examples/sec: 1825.38 (2788.78), step = 50 INFO:tensorflow:Average examples/sec: 1929.32 (2697.27), step = 60 INFO:tensorflow:Average examples/sec: 2015.17 (2749.05), step = 70 INFO:tensorflow:loss = 37.6272, step = 79 (19.554 sec) INFO:tensorflow:loss = 37.6272, learning_rate = 0.8 (19.554 sec) INFO:tensorflow:Average examples/sec: 2074.92 (2618.36), step = 80 INFO:tensorflow:Average examples/sec: 2132.71 (2744.13), step = 90 INFO:tensorflow:Average examples/sec: 2183.38 (2777.21), step = 100 INFO:tensorflow:Average examples/sec: 2224.4 (2739.03), step = 110 INFO:tensorflow:Average examples/sec: 2240.28 (2431.26), step = 120 INFO:tensorflow:Average examples/sec: 2272.12 (2739.32), step = 130 INFO:tensorflow:Average examples/sec: 2300.68 (2750.03), step = 140 INFO:tensorflow:Average examples/sec: 2325.81 (2745.63), step = 150 INFO:tensorflow:Average examples/sec: 2347.14 (2721.53), step = 160 INFO:tensorflow:Average examples/sec: 2367.74 (2754.54), step = 170 INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec) ... Copy PS# Run this on ps: The ps will not do training so most of the arguments won't affect the execution python cifar10_main.py --job-dir=gs://path/model_dir/ Copy Output: INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/ INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'ps', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f48f1addf90>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1 gpu_options { } allow_soft_placement: true , '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_evaluation_master': '', '_master': u'grpc://master-ip:8000'} 2017-07-31 22:54:58.928088: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-ip:8000} 2017-07-31 22:54:58.928153: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000} 2017-07-31 22:54:58.928160: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-ip:8000} 2017-07-31 22:54:58.929873: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000 Copy "},{"title":"Visualizing results with TensorBoard","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#visualizing-results-with-tensorboard","content":"When using Estimators you can also visualize your data in TensorBoard, with no changes in your code. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it. You'll see something similar to this if you \"point\" TensorBoard to thejob dir parameter you used to train or evaluate your model. Check TensorBoard during training or after it. Just point TensorBoard to the model_dir you chose on the previous step. tensorboard --log-dir=\"<job dir>\" Copy "},{"title":"Warnings","type":1,"pageTitle":"README","url":"docs/userDocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README#warnings","content":"When running cifar10_main.py with --sync argument you may see an error similar to: File \"cifar10_main.py\", line 538, in <module> tf.app.run() File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py\", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File \"cifar10_main.py\", line 518, in main hooks), run_config=config) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py\", line 210, in run return _execute_schedule(experiment, schedule) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py\", line 47, in _execute_schedule return task() File \"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py\", line 501, in train_and_evaluate hooks=self._eval_hooks) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py\", line 681, in _call_evaluate hooks=hooks) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py\", line 292, in evaluate name=name) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py\", line 638, in _evaluate_model features, labels, model_fn_lib.ModeKeys.EVAL) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py\", line 545, in _call_model_fn features=features, labels=labels, **kwargs) File \"cifar10_main.py\", line 331, in _resnet_model_fn gradvars, global_step=tf.train.get_global_step()) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py\", line 252, in apply_gradients variables.global_variables()) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py\", line 170, in wrapped return _add_should_use_warning(fn(*args, **kwargs)) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py\", line 139, in _add_should_use_warning wrapped = TFShouldUseWarningWrapper(x) File \"/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py\", line 96, in __init__ stack = [s.strip() for s in traceback.format_stack()] Copy This should not affect your training, and should be fixed on the next releases. "},{"title":"Docker Images for TensorFlow","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/WriteDockerfileTF","content":"","keywords":""},{"title":"How to create docker images to run Tensorflow on YARN","type":1,"pageTitle":"Docker Images for TensorFlow","url":"docs/userDocs/yarn/WriteDockerfileTF#how-to-create-docker-images-to-run-tensorflow-on-yarn","content":"Dockerfile to run Tensorflow on YARN need two part: Base libraries which Tensorflow depends on 1) OS base image, for example ubuntu:18.04 2) Tensorflow depended libraries and packages. For example python, scipy. For GPU support, need cuda, cudnn, etc. 3) Tensorflow package. Libraries to access HDFS 1) JDK 2) Hadoop Here's an example of a base image (w/o GPU support) to install Tensorflow: FROM ubuntu:18.04 # Pick up some TF dependencies RUN apt-get update && apt-get install -y --no-install-recommends \\ build-essential \\ curl \\ libfreetype6-dev \\ libpng-dev \\ libzmq3-dev \\ pkg-config \\ python \\ python-dev \\ rsync \\ software-properties-common \\ unzip \\ && \\ apt-get clean && \\ rm -rf /var/lib/apt/lists/* RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean RUN curl -O https://bootstrap.pypa.io/get-pip.py && \\ python get-pip.py && \\ rm get-pip.py RUN pip --no-cache-dir install \\ Pillow \\ h5py \\ ipykernel \\ jupyter \\ matplotlib \\ numpy \\ pandas \\ scipy \\ sklearn \\ && \\ python -m ipykernel.kernelspec RUN pip --no-cache-dir install \\ http://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.13.1-cp27-none-linux_x86_64.whl Copy On top of above image, add files, install packages to access HDFS RUN apt-get update && apt-get install -y openjdk-8-jdk wget # Install hadoop ENV HADOOP_VERSION=\"2.9.2\" RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current RUN rm hadoop-${HADOOP_VERSION}.tar.gz Copy Build and push to your own docker registry: Use docker build ... and docker push ... to finish this step. "},{"title":"Use examples to build your own Tensorflow docker images","type":1,"pageTitle":"Docker Images for TensorFlow","url":"docs/userDocs/yarn/WriteDockerfileTF#use-examples-to-build-your-own-tensorflow-docker-images","content":"We provided following examples for you to build tensorflow docker images. For Tensorflow 1.13.1 (Precompiled to CUDA 10.x) docker/tensorflow/base/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1: Tensorflow 1.13.1 supports CPU only.docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1: Tensorflow 1.13.1 supports CPU only, and included modelsdocker/tensorflow/base/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10.docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models. "},{"title":"Build Docker images","type":1,"pageTitle":"Docker Images for TensorFlow","url":"docs/userDocs/yarn/WriteDockerfileTF#build-docker-images","content":""},{"title":"Manually build Docker image:","type":1,"pageTitle":"Docker Images for TensorFlow","url":"docs/userDocs/yarn/WriteDockerfileTF#manually-build-docker-image","content":"Under docker/ directory, run build-all.sh to build Docker images. It will build following images: tf-1.13.1-gpu-base:0.0.1 for base Docker image which includes Hadoop, Tensorflow, GPU base libraries.tf-1.13.1-gpu-base:0.0.1 for base Docker image which includes Hadoop. Tensorflow.tf-1.13.1-gpu:0.0.1 which includes cifar10 modeltf-1.13.1-cpu:0.0.1 which inclues cifar10 model (cpu only). "},{"title":"Use prebuilt images","type":1,"pageTitle":"Docker Images for TensorFlow","url":"docs/userDocs/yarn/WriteDockerfileTF#use-prebuilt-images","content":"(No liability) You can also use prebuilt images for convenience: hadoopsubmarine/tf-1.13.1-gpu:0.0.1hadoopsubmarine/tf-1.13.1-cpu:0.0.1 "},{"title":"YARN Runtime Quick Start Guide","type":0,"sectionRef":"#","url":"docs/userDocs/yarn/YARNRuntimeGuide","content":"","keywords":""},{"title":"Prerequisite","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#prerequisite","content":"Check out the Running Submarine on YARN "},{"title":"Build your own Docker image","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#build-your-own-docker-image","content":"When you follow the documents below, and want to build your own Docker image for Tensorflow/PyTorch/MXNet? Please check out Build your Docker image for more details. "},{"title":"Launch TensorFlow Application:","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#launch-tensorflow-application","content":""},{"title":"Without Docker","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#without-docker","content":"You need: Build a Python virtual environment with TensorFlow 1.13.1 installedA cluster with Hadoop 2.9 or above. "},{"title":"Building a Python virtual environment with TensorFlow","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#building-a-python-virtual-environment-with-tensorflow","content":"TonY requires a Python virtual environment zip with TensorFlow and any needed Python libraries already installed. wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz tar xf virtualenv-16.0.0.tar.gz # Make sure to install using Python 3, as TensorFlow only provides Python 3 artifacts python virtualenv-16.0.0/virtualenv.py venv . venv/bin/activate pip install tensorflow==1.13.1 zip -r myvenv.zip venv deactivate Copy The above commands will produced a myvenv.zip and it will be used in below example. There's no need to copy it to other nodes. And it is not needed when using Docker to run the job. Note: If you require a version of TensorFlow and TensorBoard prior to 1.13.1, take a look at this issue. "},{"title":"Get the training examples","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#get-the-training-examples","content":"Get mnist_distributed.py from https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-tensorflow SUBMARINE_VERSION=<REPLACE_VERSION> SUBMARINE_HADOOP_VERSION=3.1 CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \\ java org.apache.submarine.client.cli.Cli job run --name tf-job-001 \\ --framework tensorflow \\ --verbose \\ --input_path \"\" \\ --num_workers 2 \\ --worker_resources memory=1G,vcores=1 \\ --num_ps 1 \\ --ps_resources memory=1G,vcores=1 \\ --worker_launch_cmd \"myvenv.zip/venv/bin/python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode\" \\ --ps_launch_cmd \"myvenv.zip/venv/bin/python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode\" \\ --insecure \\ --conf tony.containers.resources=path-to/myvenv.zip#archive,path-to/mnist_distributed.py,path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar Copy You should then be able to see links and status of the jobs from command line: 2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING 2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING 2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING 2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi 2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi 2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi 2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED 2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED 2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED Copy "},{"title":"With Docker","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#with-docker","content":"SUBMARINE_VERSION=<REPLACE_VERSION> SUBMARINE_HADOOP_VERSION=3.1 CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \\ java org.apache.submarine.client.cli.Cli job run --name tf-job-001 \\ --framework tensorflow \\ --docker_image hadoopsubmarine/tf-1.8.0-cpu:0.0.1 \\ --input_path hdfs://pi-aw:9000/dataset/cifar-10-data \\ --worker_resources memory=3G,vcores=2 \\ --worker_launch_cmd \"export CLASSPATH=\\$(/hadoop-3.1.0/bin/hadoop classpath --glob) && cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --variable-strategy=CPU --num-gpus=0 --sync\" \\ --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \\ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \\ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \\ --env HADOOP_HOME=/hadoop-3.1.0 \\ --env HADOOP_YARN_HOME=/hadoop-3.1.0 \\ --env HADOOP_COMMON_HOME=/hadoop-3.1.0 \\ --env HADOOP_HDFS_HOME=/hadoop-3.1.0 \\ --env HADOOP_CONF_DIR=/hadoop-3.1.0/etc/hadoop \\ --conf tony.containers.resources=path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar Copy Notes:# 1) DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image. 2) DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker image. We removed TonY submodule after applying SUBMARINE-371 and changed to use TonY dependency directly. After Submarine v0.2.0, there is a uber jar submarine-all-${SUBMARINE_VERSION}-hadoop-${HADOOP_VERSION}.jar released together with the submarine-core-${SUBMARINE_VERSION}.jar, submarine-yarnservice-runtime-${SUBMARINE_VERSION}.jar and submarine-tony-runtime-${SUBMARINE_VERSION}.jar.  "},{"title":"Launch PyTorch Application:","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#launch-pytorch-application","content":""},{"title":"Without Docker","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#without-docker-1","content":"You need: Build a Python virtual environment with PyTorch 0.4.0+ installedA cluster with Hadoop 2.9 or above. "},{"title":"Building a Python virtual environment with PyTorch","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#building-a-python-virtual-environment-with-pytorch","content":"TonY requires a Python virtual environment zip with PyTorch and any needed Python libraries already installed. wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz tar xf virtualenv-16.0.0.tar.gz python virtualenv-16.0.0/virtualenv.py venv . venv/bin/activate pip install pytorch==0.4.0 zip -r myvenv.zip venv deactivate Copy "},{"title":"Get the training examples","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#get-the-training-examples-1","content":"Get mnist_distributed.py from https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-pytorch SUBMARINE_VERSION=<REPLACE_VERSION> SUBMARINE_HADOOP_VERSION=3.1 CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \\ java org.apache.submarine.client.cli.Cli job run --name PyTorch-job-001 \\ --framework pytorch --num_workers 2 \\ --worker_resources memory=3G,vcores=2 \\ --num_ps 2 \\ --ps_resources memory=3G,vcores=2 \\ --worker_launch_cmd \"myvenv.zip/venv/bin/python mnist_distributed.py\" \\ --ps_launch_cmd \"myvenv.zip/venv/bin/python mnist_distributed.py\" \\ --insecure \\ --conf tony.containers.resources=path-to/myvenv.zip#archive,path-to/mnist_distributed.py, \\ path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar Copy You should then be able to see links and status of the jobs from command line: 2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING 2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING 2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING 2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi 2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi 2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi 2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED 2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED 2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED Copy "},{"title":"With Docker","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#with-docker-1","content":"SUBMARINE_VERSION=<REPLACE_VERSION> SUBMARINE_HADOOP_VERSION=3.1 CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \\ java org.apache.submarine.client.cli.Cli job run --name PyTorch-job-001 \\ --framework pytorch --docker_image pytorch-latest-gpu:0.0.1 \\ --input_path \"\" \\ --num_workers 1 \\ --worker_resources memory=3G,vcores=2 \\ --worker_launch_cmd \"cd /test/ && python cifar10_tutorial.py\" \\ --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \\ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \\ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \\ --env HADOOP_HOME=/hadoop-3.1.2 \\ --env HADOOP_YARN_HOME=/hadoop-3.1.2 \\ --env HADOOP_COMMON_HOME=/hadoop-3.1.2 \\ --env HADOOP_HDFS_HOME=/hadoop-3.1.2 \\ --env HADOOP_CONF_DIR=/hadoop-3.1.2/etc/hadoop \\ --conf tony.containers.resources=path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar Copy "},{"title":"Launch MXNet Application:","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#launch-mxnet-application","content":""},{"title":"Without Docker","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#without-docker-2","content":"You need: Build a Python virtual environment with MXNet installedA cluster with Hadoop 2.9 or above. "},{"title":"Building a Python virtual environment with MXNet","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#building-a-python-virtual-environment-with-mxnet","content":"TonY requires a Python virtual environment zip with MXNet and any needed Python libraries already installed. wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz tar xf virtualenv-16.0.0.tar.gz python virtualenv-16.0.0/virtualenv.py venv . venv/bin/activate pip install mxnet==1.5.1 zip -r myvenv.zip venv deactivate Copy "},{"title":"Get the training examples","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#get-the-training-examples-2","content":"Get image_classification.py from this link SUBMARINE_VERSION=<REPLACE_VERSION> SUBMARINE_HADOOP_VERSION=3.1 CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \\ java org.apache.submarine.client.cli.Cli job run --name MXNet-job-001 \\ --framework mxnet --input_path \"\" \\ --num_workers 2 \\ --worker_resources memory=3G,vcores=2 \\ --worker_launch_cmd \"myvenv.zip/venv/bin/python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync\" \\ --num_ps 2 \\ --ps_resources memory=3G,vcores=2 \\ --ps_launch_cmd \"myvenv.zip/venv/bin/python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync\" \\ --num_schedulers=1 \\ --scheduler_resources memory=1G,vcores=1 \\ --scheduler_launch_cmd=\"myvenv.zip/venv/bin/python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync\" \\ --insecure \\ --conf tony.containers.resources=path-to/myvenv.zip#archive,path-to/image_classification.py, \\ path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar Copy You should then be able to see links and status of the jobs from command line: 2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000004/pi status: RUNNING 2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000003/pi status: RUNNING 2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000006/pi status: RUNNING 2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000005/pi status: RUNNING 2020-04-16 20:23:43,834 INFO tony.TonyClient: Task status updated: [TaskInfo] name: scheduler, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000002/pi status: RUNNING 2020-04-16 20:23:43,839 INFO tony.TonyClient: Logs for scheduler 0 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000002/pi 2020-04-16 20:23:43,839 INFO tony.TonyClient: Logs for server 0 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000003/pi 2020-04-16 20:23:43,840 INFO tony.TonyClient: Logs for server 1 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000004/pi 2020-04-16 20:23:43,840 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000005/pi 2020-04-16 20:23:43,840 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000006/pi 2020-04-16 21:02:09,723 INFO tony.TonyClient: Task status updated: [TaskInfo] name: scheduler, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000002/pi status: SUCCEEDED 2020-04-16 21:02:09,736 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000005/pi status: SUCCEEDED 2020-04-16 21:02:09,737 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000004/pi status: SUCCEEDED 2020-04-16 21:02:09,737 INFO tony.TonyClient: Task status updated: [TaskInfo] name: worker, index: 1, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000006/pi status: SUCCEEDED 2020-04-16 21:02:09,737 INFO tony.TonyClient: Task status updated: [TaskInfo] name: server, index: 0, url: http://pi-aw:8042/node/containerlogs/container_1587037749540_0005_01_000003/pi status: SUCCEEDED Copy "},{"title":"With Docker","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#with-docker-2","content":"You could refer to this sample Dockerfile for building your own Docker image. SUBMARINE_VERSION=<REPLACE_VERSION> SUBMARINE_HADOOP_VERSION=3.1 CLASSPATH=$(hadoop classpath --glob):path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar \\ java org.apache.submarine.client.cli.Cli job run --name MXNet-job-001 \\ --framework mxnet --docker_image <your_docker_image> \\ --input_path \"\" \\ --num_schedulers 1 \\ --scheduler_resources memory=1G,vcores=1 \\ --scheduler_launch_cmd \"/usr/bin/python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync\" \\ --num_workers 2 \\ --worker_resources memory=2G,vcores=1 \\ --worker_launch_cmd \"/usr/bin/python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync\" \\ --num_ps 2 \\ --ps_resources memory=2G,vcores=1 \\ --ps_launch_cmd \"/usr/bin/python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync\" \\ --verbose \\ --insecure \\ --conf tony.containers.resources=path-to/image_classification.py,path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar Copy "},{"title":"Use YARN Service to run Submarine: Deprecated","type":1,"pageTitle":"YARN Runtime Quick Start Guide","url":"docs/userDocs/yarn/YARNRuntimeGuide#use-yarn-service-to-run-submarine-deprecated","content":"Historically, Submarine supports to use YARN Service to submit deep learning jobs. Now we stop supporting it because YARN service is not actively developed by community, and extra dependencies such as RegistryDNS/ATS-v2 causes lots of issues for setup. As of now, you can still use YARN service to run Submarine, but code will be removed in the future release. We will only support use TonY when use Submarine on YARN. "}]