Seccomp is a feature of the Linux kernel that allows to filter the system calls that a process is allowed to execute. This is commonly used by containers as a way to improve the isolation between the container and the host. Both container runtime runc and Kubernetes allow users to define a Seccomp policy via the OCI Runtime Specification and the PodSpec respectively.
Seccomp recently grew a new feature called the Seccomp Notify in Linux 5.0 and improved in Linux 5.9. This allows a seccomp policy not only to take an immediate decision on whether to allow or deny a system call, but also to defer the control to an external process that I called the Seccomp Agent. The Seccomp agent can decide to block the system call, let it continue, or, up to some extent, execute the system call on behalf of the container. This allows new use cases like running privileged workloads in a safer way and some unprivileged container builds setups.
In this talk, I will present the Seccomp Notify feature and the architecture in runc that makes use of it. I will describe the current status of this feature in Kubernetes. I will demonstrate a couple of use cases in Kubernetes and show how easy it is to build your own seccomp agent in Golang to support new use cases. The audience can expect mentions of pidfd_getfd, the addfd ioctl, and more.