[docs] Add public docs for Cgroups v2. Currently there is no official documentation outlining the changes we have been making to support Cgroups v2. We add a main document outlining how Mesos interacts with Cgroups v2, and update some documents on the changes that were made, such as the device isolator document. Review: https://reviews.apache.org/r/75191/
diff --git a/docs/cgroups2-support.md b/docs/cgroups2-support.md new file mode 100644 index 0000000..692eab2 --- /dev/null +++ b/docs/cgroups2-support.md
@@ -0,0 +1,90 @@ +--- +title: Apache Mesos - Cgroups v2 Support +layout: documentation +--- + +# Using Mesos on systems with Cgroups2 enabled + +As part of the move towards Cgroups2, the Cgroups isolator has been updated to +support the updated interface, Changes are outlined below, and it is recommended +to read up on the [Cgroups v2](https://docs.kernel.org/admin-guide/cgroup-v2.html) +documentation for an deeper understanding. + +### Requirements + +The `cgroups2` filesystem must be mounted at `/sys/fs/cgroup`. This allows Mesos +to pick the Cgroups2 Isolator when creating the Mesos Containerizer. + +### Cgroup Names + +A cgroup called “CGROUP_NAME” has a path `/sys/fs/cgroup/$CGROUP_NAME`. This +applies for all cgroups. A cgroup's name is the cgroup's path relative to +`/sys/fs/cgroup`, where the cgroup2 filesystem is mounted. + +`flags.cgroups_root` (default: "mesos"): Root cgroup name. + +The client has control over the name of the root cgroup subtree under +`/sys/fs/cgroup` that Mesos manages. The default name is “mesos”. + +### Process Cgroup + +Every process Mesos manages will have a cgroup, and a leaf cgroup under it which +contains the pids. This is done to adhere to the [No Internal Process Constraint](https://docs.kernel.org/admin-guide/cgroup-v2.html#no-internal-process-constraint) +imposed by Cgroups v2. + +### Container + +When the cgroups v2 isolator is `prepare`d for a new container, cgroups are +created for the new container. When the cgroups v2 isolator `isolate`s, the new +container is moved into it's leaf cgroup. + +Container Non-leaf Cgroup: `<flags.cgroups_root>/<containerId>` + +Container Leaf Cgroup: `<flags.cgroups_root>/<containerId>/leaf` + +### Nested Containers + +The Cgroups v2 isolator supports nested containers. + +Unlike Cgroups v1, we now create cgroups for all containers, even if they +indicated they do not want their own resource isolation. This is to make it +easier to keep track of a container’s processes. + +If a container does not wish to have its own resource isolation, it can pass in +a flag `share_cgroups` and the isolator will not update any controllers for it. + +### Systemd Integration + +We currently do not have systemd integration. This section should be updated +with our approach if systemd support is implemented. + +### Linux Launcher & Cgroups v2 Isolator + +On Linux systems that support cgroups v2, the Mesos Containerizer will use the [Linux Launcher](https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/linux_launcher.cpp) and the [Cgroups v2 Isolator](https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/cgroups2/cgroups2.cpp). + +It’s recommended to review to code to gain a complete understanding of these steps. + +Operations on startup: + +- Linux Launcher `recover`: Parse the cgroups subtree rooted at +`flags.cgroups_root` to obtain container ids. Compares the persisted state to +the recovered dcontainers to determine what contains are orphans. +- Cgroups v2 Isolator `recover`: Create internal state to track recovered +containers. Calls `recover` on all of the controllers that are used by each of +the recovered containers. + +Operations when a new container is started: + +- Cgroups v2 Isolator `prepare`: Creates cgroups for the new container and adds +the container to isolator's internal state. Configures namespace creation flags +and mount setups; does not create mounts or namespaces. Calls `prepare` on all +of the controllers that are used by the new container. +- Linux Launcher `fork`: Forks the Mesos Agent process to create the new +container's process. Also moves the child processes into the container's leaf +cgroup. Creates mounts and namespaces. +- Cgroups v2 Isolator `watch`: Calls `watch` on each of the controllers that +are used by the container. When a resource-watch promise is resolved a handler +is invoked. +- Cgroups v2 Isolator `isolate`: Calls `isolate` on each of the controllers that +are used by the container. Then moves the container process into the container's +leaf cgroup; at this point the container is isolated. \ No newline at end of file
diff --git a/docs/isolators/cgroups-devices.md b/docs/isolators/cgroups-devices.md index 0807bd7..f8c4ec8 100644 --- a/docs/isolators/cgroups-devices.md +++ b/docs/isolators/cgroups-devices.md
@@ -12,6 +12,13 @@ `cgroups/devices` isolator, append `cgroups/devices` to the `--isolation` flag when starting the Mesos agent. +## Changes for Cgroups2 Support + +In Croups2, we create EBPF programs to keep track of which devices +would be allowed or denied access. This is because cgroups2 no longer offers +interface files for device access controls. Our default witelisted devices list +remains unchanged for cgroups2. + ## Default whitelisted devices The following devices are, by default, whitelisted for each container, if you
diff --git a/docs/mesos-containerizer.md b/docs/mesos-containerizer.md index 9a7b25d..6c2c5ba 100644 --- a/docs/mesos-containerizer.md +++ b/docs/mesos-containerizer.md
@@ -77,3 +77,8 @@ [Service] Delegate=true ``` + +## Cgroups2 Integration + +In order to support the new requirements for Cgroups V2, the changes are +documented in the [Cgroups2 Support](cgroups2-support.md) documentation.