tree: 3c4ae79913b7b11abc7a27e7748937e43ece50a6 [path history] [tgz]
  1. dist_mnist.py
  2. Dockerfile
  3. README.md
  4. tf-job-mnist.yaml
deployments/examples/tfjob/README.md

TensorFlow training with YuniKorn

This doc gives a brief introduction about how to use training-operator to train TF models with YuniKorn scheduler on K8s, please read this guide for more information.

Setup

  1. You need to set up YuniKorn scheduler on K8s cluster, please refer to this doc.
  2. Install training-operator that makes it easy to run distributed or non-distributed ML jobs on K8s. You can install it with the following command.
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
  1. Build a docker image with the following command.
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .

Run a TensorFlow job

You need to create a TFjob and configure it to use YuniKorn scheduler.

kubectl create -f tf-job-mnist.yaml

Monitor your job

You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI, please read the doc here.