First let's look at what user will interact for most of the time:
+---------+ +------------+ |Logs |<--+|Notebook | +----------+ +---------+ +------------+ +----------------+ |Trackings | <-+|Experiment |<--+>|Model Artifacts | +----------+ +-----------------+ +------------+ +----------------+ +----------+<---+|ML-related Metric|<--+Servings | |tf.events | +-----------------+ +------------+ +----------+ ^ +-----------------+ + | Environments | +----------------------+ | | +-----------------+ | Submarine Metastore | | Dependencies | |Code | +----------------------+ | | +-----------------+ |Experiment Meta | | Docker Images | +----------------------+ +-----------------+ |Model Store Meta | +----------------------+ |Model Serving Meta | +----------------------+ |Notebook meta | +----------------------+ |Experiment Templates | +----------------------+ |Environments Meta | +----------------------+
First of all, all the notebook-sessions / experiments / model-serving instances) are more or less interact with following storage objects:
Object Type | Characteristics | Where to store |
---|---|---|
Metrics: tf.events | Time series data with k/v, appendable to file | Local/EBS, HDFS, Cloud Blob Storage |
Metrics: other tracking metrics | Time series data with k/v, appendable to file | Local, HDFS, Cloud Blob Storage, Database |
Logs | Large volumes, #files are potentially huge. | Local (temporary), HDFS (need aggregation), Cloud Blob Storage |
Submarine Metastore | CRUD operations for small meta data. | Database |
Model Artifacts | Size varies for model (from KBs to GBs). #files are potentially huge. | HDFS, Cloud Blob Storage |
Code | Need version control. (Please find detailed discussions below for code storage and localization) | Tarball on HDFS/Cloud Blog Storage, or Git |
Environment (Dependencies, Docker Image) | Public/private environment repo (like Conda channel), Docker registry. |
There're following ways to get experiment code:
1) Code is part of Git repo: (Recommended)
This is our recommended approach, once code is part of Git, it will be stored in version control, any change will be tracked, and much easier for users to trace back what change triggered a new bug, etc.
2) Code is part of Docker image:
This is an anti-pattern and we will NOT recommend you to use it, Docker image can be used to include ANYTHING, like dependencies, the code you will execute, or even data. But this doesn't mean you should do it. We recommend to use Docker image ONLY for libraries/dependencies.
Making code to be part of Docker image makes hard to edit code (if you want to update a value in your Python file, you will have to recreate the Docker image, push it and rerun it).
3) Code is part of S3/HDFS/ABFS:
User may want to store their training code to a tarball on a shared storage. Submarine need to download code from remote storage to the launched container before running the code.
To make user experiences keeps same across different environment, we will localize code to a same folder after the container is launched, preferably /code
For example, there's a git repo need to be synced up for an experiment/notebook/model-serving (example above):
experiment: #Or notebook, model-serving name: "abc", environment: "team-default-ml-env" ... (other fields) code: sync_mode: git url: "https://foo.com/training-job.git"
After localize, training-job/
will be placed under /code
When we running on K8s environment, we can use K8s‘s initContainer and emptyDir to do these things for us. K8s POD spec (generated by Submarine server instead of user, user should NEVER edit K8s spec, that’s too unfriendly to data-scientists):
apiVersion: v1 kind: Pod metadata: name: experiment-abc spec: containers: - name: experiment-task image: training-job volumeMounts: - name: code-dir mountPath: /code initContainers: - name: git-localize image: git-sync command: "git clone .. /code/" volumeMounts: - name: code-dir mountPath: /code volumes: - name: code-dir emptyDir: {}
The above K8s spec create a code-dir and mount it to /code
to launched containers. The initContainer git-localize
uses https://github.com/kubernetes/git-sync
to do the sync up. (If other storages are used such as s3, we can use similar initContainer approach to download contents)
Other than ML-related objects, we have system-related objects, including:
All these information should be handled by 3rd party system, such as Grafana, Prometheus, etc. And system admins are responsible to setup these infrastructures, dashboard. Users of submarine should NOT interact with system related metrics/logs. It is system admin's responsibility.
It is possible user has needs to have an attachable volume for their experiment / notebook, this is especially useful for notebook storage, since contents of notebook can be automatically saved, and it can be used as user's home folder.
Downside of attachable volume is, it is not versioned, even notebook is mainly used for adhoc exploring tasks, an unversioned notebook file can lead to maintenance issues in the future.
Since this is a common requirement, we can consider to support attachable volumes in Submarine in a long run, but with relatively lower priority.
Describe what Submarine project should own and what Submarine project should NOT own.