This document gives you a quick view on the basic usage of Submarine platform. You can finish each step of ML model lifecycle on the platform without messing up with the troublesome environment problems.
Start minikube
# You can go to https://minikube.sigs.k8s.io/docs/start/ and follow the tutorial to install minikube. # Then you can start kubernetes with minikube: minikube start --vm-driver=docker --cpus 8 --memory 8192 --kubernetes-version v1.24.12 # The version of k8s can be adjusted to the range of your current minikube. # For example, minikube v1.28.0 can provide versions from v1.25.0 to v1.25.3 in k8s 1.25 # Or if you want to support Pod Security Policy (https://minikube.sigs.k8s.io/docs/tutorials/using_psp) in k8s 1.21 or 1.22, you can use the following command to start cluster minikube start --extra-config=apiserver.enable-admission-plugins=PodSecurityPolicy --addons=pod-security-policy --vm-driver=docker --cpus 8 --memory 8192 --kubernetes-version v1.21.2
Install Istio, there are two ways to install: Command-Istioctl-based, or Helm-based
# You can go to the https://github.com/istio/istio/releases/ to download the istioctl for your k8s version # e.g. we can execute the following command to download the istio version adapted to k8s 1.24.12 # wget https://github.com/istio/istio/releases/download/1.17.1/istio-1.17.1-linux-amd64.tar.gz istioctl install -y # Alternatively, you can use istio's helm to install # This is the link: https://istio.io/latest/docs/setup/install/helm/ ## Add istio repo helm repo add istio https://istio-release.storage.googleapis.com/charts helm repo update ## Create istio-system namespace kubectl create namespace istio-system ## Install istio resources helm install istio-base istio/base -n istio-system helm install istiod istio/istiod -n istio-system helm install istio-ingressgateway istio/gateway -n istio-system
git clone https://github.com/apache/submarine.git cd submarine
# create namespace for submarine, training, notebook and seldon-core operators kubectl create namespace submarine kubectl label namespace submarine istio-injection=enabled # create namespace for deploying submarine-server kubectl create namespace submarine-user-test kubectl label namespace submarine-user-test istio-injection=enabled # After k8s 1.25, we can turn on PSA (Pod Security Admission) labels for namespace. # We use a common PSA enforcement level. If you want to use a more detailed configuration, you can refer to # https://kubernetes.io/docs/concepts/security/pod-security-admission/#pod-security-admission-labels-for-namespaces kubectl label namespace submarine-user-test 'pod-security.kubernetes.io/enforce=privileged'
# Update helm dependency. helm dependency update ./helm-charts/submarine # Install submarine operator in namespace submarine. helm install submarine ./helm-charts/submarine --set seldon-core-operator.istio.gateway=submarine/seldon-gateway -n submarine
kubectl apply -f submarine-cloud-v3/config/samples/_v1_submarine.yaml -n submarine-user-test
$ kubectl get pods -n submarine NAME READY STATUS RESTARTS AGE notebook-controller-deployment-66d85984bf-x562z 1/1 Running 0 7h7m training-operator-6dcd5b9c64-nxwr2 1/1 Running 0 7h7m submarine-operator-9cb7bc84d-brddz 1/1 Running 0 7h7m $ kubectl get pods -n submarine-user-test NAME READY STATUS RESTARTS AGE submarine-database-0 1/1 Running 0 7h6m submarine-minio-686b8777ff-zg4d2 2/2 Running 0 7h6m submarine-mlflow-68c5559dcb-lkq4g 2/2 Running 0 7h6m submarine-server-7c6d7bcfd8-5p42w 2/2 Running 0 9m33s submarine-tensorboard-57c5b64778-t4lww 2/2 Running 0 7h6m
kubectl port-forward --address 0.0.0.0 -n istio-system service/istio-ingressgateway 32080:80
Go to http://0.0.0.0:32080
We put the code of this example here. train.py
is our training script, and build.sh
is the script to build a docker image.
Take a simple mnist tensorflow script as an example. We choose MultiWorkerMirroredStrategy
as our distributed strategy.
""" ./dev-support/examples/quickstart/train.py Reference: https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/distribution_strategy/keras-API/multi_worker_strategy-with-keras.py """ import tensorflow as tf import tensorflow_datasets as tfds from packaging.version import Version from tensorflow.keras import layers, models import submarine def make_datasets_unbatched(): BUFFER_SIZE = 10000 # Scaling MNIST data from (0, 255] to (0., 1.] def scale(image, label): image = tf.cast(image, tf.float32) image /= 255 return image, label # If we use tensorflow_datasets > 3.1.0, we need to disable GCS # https://github.com/tensorflow/datasets/issues/2761#issuecomment-1187413141 if Version(tfds.__version__) > Version("3.1.0"): tfds.core.utils.gcs_utils._is_gcs_disabled = True datasets, _ = tfds.load(name="mnist", with_info=True, as_supervised=True) return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE) def build_and_compile_cnn_model(): model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation="relu")) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation="relu")) model.add(layers.Flatten()) model.add(layers.Dense(64, activation="relu")) model.add(layers.Dense(10, activation="softmax")) model.summary() model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]) return model def main(): strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( communication=tf.distribute.experimental.CollectiveCommunication.AUTO ) BATCH_SIZE_PER_REPLICA = 4 BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync with strategy.scope(): ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat() options = tf.data.Options() options.experimental_distribute.auto_shard_policy = ( tf.data.experimental.AutoShardPolicy.DATA ) ds_train = ds_train.with_options(options) # Model building/compiling need to be within `strategy.scope()`. multi_worker_model = build_and_compile_cnn_model() class MyCallback(tf.keras.callbacks.Callback): def on_epoch_end(self, epoch, logs=None): # monitor the loss and accuracy print(logs) submarine.log_metric("loss", logs["loss"], epoch) submarine.log_metric("accuracy", logs["accuracy"], epoch) multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()]) # save model submarine.save_model(multi_worker_model, "tensorflow") if __name__ == "__main__": main()
Build a docker image equipped with the requirement of the environment.
eval $(minikube docker-env) ./dev-support/examples/quickstart/build.sh
Open submarine workbench and click + New Experiment
Choose Define your experiment
Fill the form accordingly. Here we set 3 workers.
In the meantime, we have built this image in docker hub and you can run this experiment directly if you choose quickstart
in From predefined experiment library
.
submarine
from submarine-sdk
to record the metrics. To see the result, click corresponding experiment with name mnist-example
in the workbench.VirtualService
and Endpoint
that use istio for external port forward or ingress.## get VirtualService with your model name kubectl describe VirtualService -n submarine-user-test -l model-name=tf-mnist Name: submarine-model-1-2508dd65692740b18ff5c6c6c162b863 Namespace: submarine-user-test Labels: model-id=2508dd65692740b18ff5c6c6c162b863 model-name=tf-mnist model-version=1 Annotations: <none> API Version: networking.istio.io/v1beta1 Kind: VirtualService Metadata: Creation Timestamp: 2022-09-18T05:26:38Z Generation: 1 Managed Fields: ... Spec: Gateways: submarine/seldon-gateway Hosts: * Http: Match: Uri: Prefix: /seldon/submarine-user-test/1/1/ Rewrite: Uri: / Route: Destination: Host: submarine-model-1-2508dd65692740b18ff5c6c6c162b863 Port: Number: 8000 Events: <none>
To confirm that the serving endpoint is available, try using the swagger address to confirm the availability of the interface. In our example, the address of the swagger is: http://localhost:32080/seldon/submarine-user-test/1/1/api/v1.0/doc/
More details can be found in the official seldon documentation: https://docs.seldon.io/projects/seldon-core/en/latest/workflow/serving.html#generated-documentation-swagger-ui