Containers

Distributed HPC Applications with Unprivileged Containers

UD2.208 (Decroly)
We will present the challenges in doing distributed deep learning training at scale on shared heterogeneous infrastructure. At NVIDIA, we use containers extensively in our GPU clusters for both HPC and deep learning applications. We love containers for how they simplify software packaging and enable reproducibility without sacrificing performance. Docker is a popular tool for running application containers on Linux, and while it is possible to enable container workflows for users by granting them access to the docker daemon, the security impact needs to be carefully considered, especially in a shared environment. Relying on docker for the container runtime also requires a large amount of complicated boilerplate code to start multi-node jobs using the Message Passing Interface (MPI) for communication. In this presentation, we will introduce a new lightweight container runtime inspired from LXC and an associated plugin for the Slurm Workload Manager. Together, these two open-source projects enable a more secure architecture for our clusters, while also enabling a smoother user experience with containers on multi-node clusters.
There are many container runtimes available, but none met all of our needs for running distributed applications with no performance overhead and no privileged helper tools. For our use case, we built a simple container runtime called enroot - it's a tool to turn traditional container images into lightweight unprivileged sandboxes; a modern chroot. One key feature is that enroot remaps all UIDs inside the container to a single UID on the host. So, unlike runtimes which rely on /etc/subuid and /etc/subgid, with enroot there is no risk of overlapping UID ranges on a node, and no need to synchronize ranges across the cluster. It is also trivial to remap to UID 0 inside the container which enables users to safely run apt-get install to add their own packages. Enroot is also configured to automatically mount drivers and devices for accelerators from NVIDIA and Mellanox using enroot's flexible plugin system. Finally, enroot is highly optimized to download and unpack large docker images, which is particularly useful for images containing large applications. We also created a new plugin for the Slurm Workload manager which adds command-line flags for job submission. When the “--container-image” flag is set, our plugin imports a container image, unpacks it on the local filesystem, creates namespaces for the container, and then attaches the current job to these new namespaces. Therefore, tasks transparently land inside of the container with minimal friction. Users can even make use of the PMI2 or PMIx APIs to coordinate workloads inside the containers without needing to invoke mpirun, further streamlining the user experience. Currently, the plugin works with two different tools - enroot and LXC. It could be extended to other container runtimes in the future.

Additional information

Type devroom

More sessions

2/1/20
Containers
Sascha Grunert
UD2.208 (Decroly)
Podman is the container management tool of your choice when it comes to boosting day-to-day development tasks around containers. The journey of Podman started as a drop-in replacement for docker, but nowadays it’s even more than just that. For example, Podman is capable of managing pods, running containers without being root and supports fine granular configuration possibilities.
2/1/20
Containers
Akihiro Suda
UD2.208 (Decroly)
The biggest problem of the OCI Image Spec is that a container cannot be started until all the tarball layers are downloaded, even though more than 90% of the tarball contents are often unneeded for the actual workload. This session will show state-of-the-art alternative image formats, which allow runtime implementations to start a container without waiting for all its image contents to be locally available. Especially, this session will put focus on CRFS/stargz and its implementation status in ...
2/1/20
Containers
Daniel Borkmann
UD2.208 (Decroly)
BPF as a foundational technology in the Linux kernel provides a powerful tool for systems developers and users to dynamically reprogram and customize the kernel to meet their needs in order to solve real-world problems and without having to be a kernel expert. Thanks to BPF we have come to the point to overcome having to carry legacy accumulated over decades of development grounded in a more traditional networking environment that is typically far more static than your average Kubernetes ...
2/1/20
Containers
Ralf Haferkamp
UD2.208 (Decroly)
Kata Containers provide a secure container runtime offering an experience close to that of native containers, while providing stronger workload isolation and host infrastructure security by using hardware virtualization technology. This is particularly useful when containers are used to host and run third-party applications. In this presentation, after a short intro to Kata, we will demonstrate how easy it is to install and use on openSUSE. We will show it in action both as part of a podman ...
2/1/20
Containers
Laurent Bernaille
UD2.208 (Decroly)
Kube-proxy enables access to Kubernetes services (virtual IPs backed by pods) by configuring client-side load-balancing on nodes. The first implementation relied on a userspace proxy which was not very performant. The second implementation used iptables and is still the one used in most Kubernetes clusters. Recently, the community introduced an alternative based on IPVS. This talk will start with a description of the different modes and how they work. It will then focus on the IPVS ...
2/1/20
Containers
Adrian Reber
UD2.208 (Decroly)
The difficult task to checkpoint and restore a process is used in many container runtimes to implement container live migration. This talk will give details how CRIU is able to checkpoint and restore processes, how it is integrated in different container runtimes and which optimizations CRIU offers to decrease the downtime during container migration. In this talk I want to provide details how CRIU checkpoints and restores a process. Starting from ptrace() to pause the process, how parasite code ...
2/1/20
Containers
Christian Brauner
UD2.208 (Decroly)
Recently the kernel landed seccomp support for SECCOMPRETUSER_NOTIF which enables a process (supervisee) to retrieve a fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (supervisor). The supervisor will then be able to receive seccomp messages about the syscalls having been performed by the supervisee. We have integrated this feature into userspace and currently make heavy use of this to intercept mknod(), mount(), and other syscalls in user ...