Stories by Akihiro Suda on Medium

Improvements to Rootless mode in Docker v29.5

Akihiro Suda — Fri, 15 May 2026 20:31:33 GMT

Rootless mode, which enables running Docker daemon without root privileges, has been significantly improved in Docker v29.5 (May 15, 2026):

Faster image pulling and pushing
Support for docker run --net=host
Support for localhost registries
Source IP propagation without the legacy slirp4netns dependency

What is Rootless mode?

Rootless mode means running the entire Docker daemon (not just containers) as a non-root user, for protecting the host from potential Docker vulnerabilities and misconfigurations. Even if an attacker escapes from a container, they can access only the files and processes available to the non-root daemon user.

Rootless mode itself is not new; it was originally implemented in 2018 and has been merged into Docker since v19.03 (2019).

https://www.slideshare.net/Docker/dcsf19-hardening-docker-daemon-with-rootless-mode/3

RootlessモードでDockerをより安全にする [DockerCon発表レポート]

Getting started

To get started with Rootless Docker, install the docker-ce-rootless-extras package and run dockerd-rootless-setuptool.sh install as a non-root user.

# See https://docs.docker.com/engine/install for how to configure apt
sudo apt install docker-ce-rootless-extras

dockerd-rootless-setuptool.sh install

Improvements in Docker v29.5

Rootless Docker had been notorious for its limitations in networking, as the entire daemon was encapsulated in a network namespace (NetNS) associated with a user-mode TCP/IP stack such as slirp4netns.

Such limitations included:

Poor throughput of image pulling and pushing (typically less than 10 Gbps)
Lack of support for docker run --net=host
Lack of support for localhost registries ( docker pull localhost:PORT/IMAGE)

These limitations have been resolved in Docker v29.5, by moving the daemon out of the NetNS associated with user-mode TCP/IP stack.

NetNS for User-mode TCP/IP is now detached from dockerd

Notably, support for docker run --net=host should be highly useful, as it allows containers to bypass the overhead of the user-mode TCP/IP entirely. It should be still noted that --net=host carries the security risk of exposing abstract UNIX sockets to the containers, however, it is less likely catastrophic in the case of rootless mode. This concern can be even alleviated by specifying --user=SUBUSER in conjunction.

[CVE-2020–15257] Don’t use --net=host . Don’t use spec.hostNetwork .

Elimination of slirp4netns dependency

Besides, this release also replaces slirp4netns with gvisor-tap-vsock in the default setup, as slirp4netns is based on very old and potentially unsafe C code dating back to the 1990s. In contrast, gvisor-tap-vsock is written in pure Go and expected to have fewer potential vulnerabilities, although it is still not completely free from unsafe code.

In prior releases of Rootless Docker, users often specified an environment variable DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=slirp4netns to switch the port driver of RootlessKit from builtin to slirp4netns for enabling source IP propagation in port forwarding ( docker run -p ). Otherwise, the source IP information in TCP packets was always forged to have the IP of Docker’s bridge interface (typically 172.17.0.1).

In Docker v29.5, users do not need to specify DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER , however, they have to disable userland-proxy for enabling source IP propagation:

mkdir -p ~/.config/docker
echo '{"userland-proxy": false}' >~/.config/docker/daemon.json
systemctl --user restart docker

The userland-proxy is planned to be disabled by default in a future release of Docker:

Disable Userland proxy by default · Issue #14856 · moby/moby

Also, depending on the host configuration, users may need to load the br_netfilter kernel module:

sudo tee /etc/modules-load.d/docker.conf </dev/null
br_netfilter
EOF
sudo systemctl restart systemd-modules-load.service

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of containers, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、コンテナなどの領域のオープンソースコミュニティで、共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>

Improvements to Rootless mode in Docker v29.5 was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ubuntu 26.04 can install APT packages from GitHub Container Registry

Akihiro Suda — Thu, 23 Apr 2026 12:58:50 GMT

With the release of version 26.04 planned today, Ubuntu now supports installing APT packages hosted on OCI-compliant container image registries such as GitHub Container Registry (ghcr.io), via the apt-transport-oci plugin I wrote 5 years ago.

This means third-party package maintainers no longer need to maintain their own web servers for hosting apt packages. They still have to maintain their container image registry, but it is already offered by GitHub for free.

Note: "OCI" in this article refers to the “Open Container Initiative”, not to “Oracle Cloud Infrastructure”.

Example (for package consumers)

The following example installs the hello-apt-transport-oci package from the oci://ghcr.io/akihirosuda/apt-transport-oci-examples:latest image, which is built from the GitHub repository https://github.com/AkihiroSuda/apt-transport-oci-examples .

First, install the apt-transport-oci plugin:

sudo apt install apt-transport-oci

Then create /etc/apt/sources.list.d/oci.sources with the following content:

Types: deb
URIs: oci://ghcr.io/akihirosuda/apt-transport-oci-examples:latest
Suites: stable
Components: main
Signed-By: /etc/apt/keyrings/apt-transport-oci-examples.gpg

Download the GPG key:

curl -fsSL https://raw.githubusercontent.com/AkihiroSuda/apt-transport-oci-examples/refs/heads/master/apt-transport-oci-examples.gpg \
  | sudo gpg --dearmor -o /etc/apt/keyrings/apt-transport-oci-examples.gpg

Confirm the signature:

$ gpg --show-keys --with-fingerprint /etc/apt/keyrings/apt-transport-oci-examples.gpg 
pub   ed25519 2026-04-21 [SC] [expires: 2029-04-20]
      E26B 12C8 C96A 4E4B CDF5  517A 3EB0 4A34 581C DAF6
uid                      Akihiro Suda, on behalf of apt-transport-oci-examples 
sub   cv25519 2026-04-21 [E]

Update the apt cache and install the hello-apt-transport-oci package:

sudo apt update
sudo apt install hello-apt-transport-oci

Confirm it works:

$ hello-apt-transport-oci
Hello, apt-transport-oci

Example (for package maintainers)

Packaging dpkg

A dpkg file can be created using the traditional dpkg-deb command with the file tree to be packaged and the DEBIAN/control metadata file as follows:

$ tree hello-apt-transport-oci/
hello-apt-transport-oci/
├── DEBIAN
│   └── control
└── usr
    └── bin
        └── hello-apt-transport-oci

4 directories, 2 files

$ cat hello-apt-transport-oci/DEBIAN/control 
Package: hello-apt-transport-oci
Version: 0.1
Architecture: all
Maintainer: example@example.com
Description: hello apt-transport-oci

$ dpkg-deb --build --root-owner-group hello-apt-transport-oci hello-apt-transport-oci_0.1_all.deb
dpkg-deb: building package 'hello-apt-transport-oci' in 'hello-apt-transport-oci_0.1_all.deb'.

There are also several package building tools. Notably, Dalec is useful for Docker users, as it is implemented as a custom syntax for Dockerfiles: https://project-dalec.github.io/dalec/quickstart

Creating an APT tree

aptly is convenient for creating the APT repository tree:

sudo apt install aptly

aptly repo create hello-apt-transport-oci
aptly repo add hello-apt-transport-oci hello-apt-transport-oci_0.1_all.deb
aptly publish repo -distribution=stable -architectures=all,amd64,arm64 hello-apt-transport-oci

The repository data will be locally published on ~/.aptly/public :

$ tree ~/.aptly/public
/home/USER/.aptly/public
├── dists
│   └── stable
│       ├── Contents-all.gz
│       ├── Contents-amd64.gz
│       ├── Contents-arm64.gz
│       ├── InRelease
│       ├── main
│       │   ├── binary-all
│       │   │   ├── Packages
│       │   │   ├── Packages.bz2
│       │   │   ├── Packages.gz
│       │   │   └── Release
│       │   ├── binary-amd64
│       │   │   ├── Packages
│       │   │   ├── Packages.bz2
│       │   │   ├── Packages.gz
│       │   │   └── Release
│       │   ├── binary-arm64
│       │   │   ├── Packages
│       │   │   ├── Packages.bz2
│       │   │   ├── Packages.gz
│       │   │   └── Release
│       │   ├── Contents-all.gz
│       │   ├── Contents-amd64.gz
│       │   └── Contents-arm64.gz
│       ├── Release
│       └── Release.gpg
└── pool
    └── main
        └── h
            └── hello-apt-transport-oci
                └── hello-apt-transport-oci_0.1_all.deb

11 directories, 22 files

Pushing to a registry

Use ORAS (OCI Registry As Storage) to push an APT tree to a registry:

sudo apt install oras

cd ~/.aptly/public
find . -type f -printf "%p:application/octet-stream\n" \
  | xargs oras push ghcr.io/USERNAME/hello-apt-transport:latest

See https://github.com/AkihiroSuda/apt-transport-oci-examples for further details.

FAQ: Do I need to use containers?

No. The topic is about using the distribution protocol that has been used for containers, not about using containers themselves.

FAQ: Why use container image registries?

Because GitHub Container Registry is free in both storage size and bandwidth:

Billing for container image storage: Container image storage and bandwidth for the Container registry is currently free.

https://docs.github.com/en/billing/concepts/product-billing/github-packages#free-use-of-github-packages

It still has a few limitations, but those limitations are very loose and almost negligible:

・The Container registry has a 10 GB size limit for each layer.
・The Container registry has a 10 minute timeout limit for uploads.

https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry#troubleshooting

For these benefits, Homebrew has also been using GitHub Container Registry since 2021.

Why not use GitHub Packages?

Because GitHub Packages does not support APT. The service has been focusing on language package managers such as NPM and Maven so far.

Why not use GitHub Pages?

Because GitHub Pages is intended to be used for Web pages, not for serving arbitrary HTTP(S) content such as APT packages. For that reason, it comes with relatively tight usage limits:

・Published GitHub Pages sites may be no larger than 1 GB.
[...]
・GitHub Pages sites have a soft bandwidth limit of 100 GB per month.

https://docs.github.com/en/pages/getting-started-with-github-pages/github-pages-limits#usage-limits

FAQ: What about DNF?

I didn't support DNF when I wrote the apt-transport-oci plugin in 2021, because DNF at that time didn't appear to have a plugin system that is as flexible as APT.

However, the situation has changed with the release of DNF5 (Fedora 41). Luiz Carvalho at Red Hat recently implemented an experimental DNF5 plugin that enables OCI transport in the same way as apt-transport-oci: https://github.com/lcarva/libdnf5-oci-plugin . I hope that libdnf5-oci-plugin will eventually be included in Fedora, RHEL, and similar distributions.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of containers, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

Ubuntu 26.04 can install APT packages from GitHub Container Registry was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

gomodjail: library sandboxing for Go modules

Akihiro Suda — Tue, 27 Jan 2026 00:38:02 GMT

This article introduces gomodjail, an experimental tool that “jails” Go modules by applying syscall restrictions using seccomp and symbol tables, in order to mitigate potential supply chain attacks and other vulnerabilities.

In other words, gomodjail provides a “container” engine for Go modules but with finer granularity than Docker containers, FreeBSD jails, etc.

gomodjail focuses on simplicity; a security policy for gomodjail can be applied just by adding // gomodjail:confined comment to the go.mod file of the target program, and running it with thegomodjail run command.

gomodjail focuses on simplicity

Background: open source is under attack

Software is practically never written from scratch; It’s always assembled from enormous library dependencies, often including open-source ones. This supply chain is under attack:

xz/liblzma backdoor incident (2024): A backdoor was injected to xz/liblzma by its maintainer (not by the original author), who had been making harmless contributions to the project for more than two years. This incident proved that even maintainers of widely adopted libraries cannot be blindly trusted.
Massive campaign of fake Go modules (circa 2025-): In spring 2025, hundreds of fake Go modules were published on GitHub. The repositories impersonated genuine ones but contained malicious code. In some cases, the numbers of the GitHub stars even exceeded those of the genuine repositories.
Slopsquatting (circa 2024-): AI coding agents may hallucinate and inject malicious dependencies with plausible package names. Even when an LLM itself doesn’t hallucinate, it can be still deceived by fake sites on the Internet. The chat session below shows Microsoft Copilot being deceived to suggest downloading a malicious copy of 7-zip from a fake “official” site.

Microsoft Copilot being deceived to suggest downloading 7-zip from a fake “official” site. The chat session is from https://x.com/longer_n/status/2014335971505123760

Library sandboxing

Library sandboxing means confining capabilities of a library so that it cannot perform specific operations, such as reading/writing arbitrary files and executing arbitrary shell commands. This is similar to containers such as Docker and FreeBSD jails, but it differs from containers in that the confinement applies at the granularity of a library, not an OS process.

For example, Firefox adopted RLBox in 2021 to wrap C library calls in a WebAssembly sandbox. However, library sandboxing hasn’t seen wide adoption, perhaps due to its complexity: sandboxing a library typically takes days, not minutes.

“On average, sandboxing a library takes only a few days”
— https://www.usenix.org/system/files/sec20_slides_narayan.pdf

Introducing gomodjail

gomodjail is a library sandboxing tool for Go, focusing on simplicity.

Take a look at examples/victim/main.go :

package main

import (
 "fmt"

 p "github.com/AkihiroSuda/gomodjail/examples/poisoned"
)

func main() {
 const x, y = 42, 43
 fmt.Printf("%d + %d = %d\n", x, y, p.Add(x, y))
}

The code is expected to just print 42 + 43 = 85 without any side effects. However, the Add(x, y) function here is poisoned to execute a “malicious” command:

$ go build
$ ./victim
*** ARBITRARY SHELL CODE EXECUTION ***

This 'vi' command was executed by the 'github.com/AkihiroSuda/gomodjail/examples/poisoned' module.

This example is harmless, of course, but suppose that this was a malicious code.

Type ':q!' to leave this screen.

gomodjail can confine this poisoned module so that it cannot execute such commands. It can be applied in just the following two steps:

Step 1: Make sure that go.mod has the comment directive //gomodjail:confined

require github.com/AkihiroSuda/gomodjail/examples/poisoned v0.0.0-00010101000000-000000000000 // gomodjail:confined

Step 2: Run the program with the gomodjail run command:

$ gomodjail run --go-mod=go.mod -- ./victim
level=WARN msg=***Blocked*** syscall=pidfd_open module=github.com/AkihiroSuda/gomodjail/examples/poisoned

How it works

gomodjail hooks dangerous syscalls such as open() and execve() using seccomp on Linux, or DYLD_INSERT_LIBRARIES on macOS. When a hooked syscall is executed, gomodjail unwinds the call stack to identify the Go module that invoked the syscall. If the module belongs to a blocklist, gomodjail blocks the syscall and injects EPERM as errno .

This run-time approach comes with several caveats; notably it is not applicable to modules that import unsafe , reflect , C, etc., since such modules may alter the call stack. A future version of gomodjail may incorporate a compilation-time approach to mitigate these caveats.

Meet me at FOSDEM 2026 for further details

I’ll talk about gomodjail at FOSDEM:

Title: “gomodjail: library sandboxing for Go modules”
DevRoom: Go (UB5.132, Building U)
Date: February 1, 2026 (Sunday)
Time: 12:00–12:30

Feel free to visit my session for further details about the project.

Besides that, I’ll also have a session titled “Lima v2.0: expanding the focus to hardening AI” on Saturday (Jan 31, 15:30–16:00). Lima has been an adopter of gomodjail, although gomodjail is not going to be the main topic in the session on Saturday.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of software supply chain security, sandboxing, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、ソフトウェアサプライチェーンセキュリティやサンドボックスなどの領域のオープンソースコミュニティで、共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>

gomodjail: library sandboxing for Go modules was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

Alcoholless: A Lightweight Security Sandbox for macOS Programs (Homebrew, AI Agents, etc.)

Akihiro Suda — Tue, 22 Jul 2025 07:15:14 GMT

This article introduces Alcoholless: a lightweight security sandbox for macOS programs. While Alcoholless was originally made for the sake of securing Homebrew, basically it can be used for almost any CLI programs on macOS. Notably, Alcoholless is useful for allowing an AI agent to run shell commands with less risk of breaking the host operating system.

AI may hallucinate, or may be deceived by a web search result, to run a malicious command

Software supply chain under attack

Homebrew is the most popular open source package manager on macOS, with more than 7700 package formulae. The large number of packages is both a strength and a weakness; While having many packages is certainly convenient, it raises doubts about whether the entire source code for all of them has been comprehensively reviewed.

For example, the notorious xz/liblzma backdoor incident (CVE-2024–3094) has shown even well-known packages can be compromised, although Homebrew was not affected in this case by chance.

This kind of supply chain attack may actually happen with any package manager; this month (July 2025) saw a very sophisticated phishing campaign that compromised several packages on npm.

AI agents may make mistakes

For better or worse, it is becoming common practice to allow AI agents to run arbitrary shell commands. This practice is extremely dangerous of course; an AI agent may hallucinate, or may be deceived by a web search result, to install malware with plausible package names:

pip install

To alleviate the risk of running such arbitrary commands, AI agents such as OpenAI Codex CLI and Google Gemini CLI utilize Apple’s sandbox-execcommand on macOS.

sandbox-exec -f PROFILE COMMAND [ARGS]

sandbox-exec supports limiting file access with a profile like:

(allow file-read*)
(deny file-write*)
(allow file-write* (literal “/dev/null”))

However, sandbox-exec seems to have been deprecated since circa 2016.

$ man sandbox-exec
[…]
DESCRIPTION
 The sandbox-exec command is DEPRECATED.
 Developers who wish to sandbox an app should instead adopt
 the App Sandbox feature described in the App Sandbox Design Guide.

In the manual page, Apple recommends using “App Sandbox” instead, however, App Sandbox doesn’t actually provide the direct replacement for the sandbox-exec command.

Introducing Alcoholless

Alcoholless provides a simple CLI to run shell commands with reduced security risk:

cd ~/SOME_DIRECTORY
alcless brew install xz
alcless xz SOME_FILE

In the example above, xz works as a separate user with an access for the copy of the current directory. Changed files are synced back to the current directory when the command exits.

How it works

Alcoholless just utilizes 1990s’ commands ( su , sudo , rsync) and the macOS equivalent of useradd to implement container-like environments, without extending the XNU kernel to support Linux-style container syscalls. A fun fact is that both su and sudo have to be utilized because sudo can’t fully switch the user context on macOS by itself, due to Mach’s quirks that are not recognized by POSIX.

Alcoholless could even harden security if it utilized Apple’s Virtualization.framework, however, it doesn’t do that, as the framework apparently does not provide a way to automate the initialization steps of a macOS VM (accept EULA, skip enabling iCloud, set up SSH, etc.).

This barrier is annoying, but it also comes with several bonuses in avoiding virtualization:

No performance overhead
Minimal disk consumption
Direct access to the host hardware (GPU, etc.)
Works fine on GitHub Actions (no nested virtualization)

Getting started

Alcoholless can be installed from the source code as follows:

brew install go
git clone https://github.com/AkihiroSuda/alcless.git
cd alcless
git checkout v0.1.1
make
sudo make install

Alternatively, you can also download binary packages from <https://github.com/AkihiroSuda/alcless/releases>.

For the first run, the alclessctl create default command has to be executed. You’ll be asked to type the password to create the new user account alcless_${USER}_default :

$ alclessctl create default
7:41PM INF Creating an instance instance=default instUser=alcless_user_default
⚠️  The following commands will be executed:
sudo sysadminctl -addUser alcless_user_default -password -
sudo chmod go-rx /Users/alcless_user_default
sudo sh -c 'echo '"'"'suda ALL=(root) NOPASSWD: /usr/bin/su - alcless_user_default -c *'"'"' >'"'"'/etc/sudoers.d/alcless_user_default'"'"''
❓ Press return to continue, or Ctrl-C to abort
[RETURN]
CONTINUE
7:42PM INF Running command cmd="sudo sysadminctl -addUser alcless_user_default -password -"
2025-07-21 19:42:06.758 sysadminctl[37537:5738895] ----------------------------
2025-07-21 19:42:06.758 sysadminctl[37537:5738895] No clear text password or interactive option was specified (adduser, change/reset password will not allow user to use FDE) !
2025-07-21 19:42:06.758 sysadminctl[37537:5738895] ----------------------------
User password: [PASSWORD]
[...]

You may also have to verify that your home directory has restrictive permissions:

$ ls -ld ~
drwxr-x---+ 55 user staff  1760  7 22 05:19 /Users/user/

$ chmod 700 ~

$ ls -ld ~
drwx------+ 55 user staff  1760  7 22 05:19 /Users/user/

Basic usage

After the completion of alcless create default, you have to cd to a directory that is to be synced to the alcess_${USER}_default environment:

mkdir -p ~/tmp
cd ~/tmp
# Create some content in the current project directory
echo foo >foo

Then you can install and run a Homebrew package such as xz :

$ alcless brew install xz
[...]

$ alcless xz foo
7:44PM INF ➡️Syncing the files src=/Users/user/tmp/ dst=default:/Users/alcless_user_default/Users/user/tmp
7:44PM INF ⬅️Syncing the files back (dry run) src=default:/Users/alcless_user_default/Users/user/tmp/ dst=/Users/user/tmp
*deleting foo
.d..t.... ./
>f+++++++ foo.xz
7:44PM INF ⬅️Syncing the files back src=default:/Users/alcless_user_default/Users/user/tmp/ dst=/Users/user/tmp
⚠️  The following commands will be executed:
rsync -rai --delete -e '/usr/local/bin/alclessctl shell --workdir=/ --plain' default:/Users/alcless_user_default/Users/user/tmp/ /Users/user/tmp
❓ Press return to continue, or Ctrl-C to abort
[RETURN]
CONTINUE
*deleting foo
.d..t.... ./
>f+++++++ foo.xz

alcless syncs the current directory /Users/${USER}/tmp to /Users/alcless_${USER}_default/Users/${USER}/tmp , executes the specified command using the user credential of alcless_${USER}_default, and syncs back the directory with a confirmation prompt.

Usage with Gemini

Gemini CLI can be installed in Alcoholless as follows:

alcless brew install gemini-cli

This command may take around 10 minutes due to recompilation of several dependency packages such as node. This recompilation is needed as the Homebrew prefix (/Users/alcless_${USER}_default/homebrew) differs from the standard installation of Homebrew (/opt/homebrew).

Then add your GEMINI_API_KEY to .zshenv inside Alcoholless:

alcless sh -c 'vi ~/.zshenv'

Now you can run Gemini CLI and let it run arbitrary shell commands inside Alcoholless:

$ alcless gemini

╭─────────────────────────────────────────────────────────────╮
│  > Install a Python package that shows the current weather  │
╰─────────────────────────────────────────────────────────────╯

 ╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  GoogleSearch Searching the web for: "python package current weather"                              │
 │                                                                                                      │
 │    Search results for "python package current weather" returned.                                     │
 ╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ I will install the python-weather package, which allows you to get the current weather for a location.
 ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ?  Shell pip install python-weather (Install the `python-weather` package using pip.) ←                 │
 │                                                                                                         │
 │   pip install python-weather                                                                            │
 │                                                                                                         │
 │ Allow execution?                                                                                        │
 │                                                                                                         │
 │ ● 1. Yes, allow once                                                                                    │
 │   2. Yes, allow always "pip ..."                                                                        │
 │   3. No (esc)                                                                                           │
 │                                                                                                         │
 ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

It should be noted that Alcoholless is not a panacea; it is still highly recommended to review AI-generated commands before execution.
While the AI is unlikely able to steal or falsify a file outside the working directory (unless macOS has a bug around the user isolation), it might be still able to exploit other attacks such as cryptomining or denial-of-service.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of software supply chain security, AI sandboxing, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、ソフトウェアサプライチェーンセキュリティやAIサンドボックスなどの領域のオープンソースコミュニティで、共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>

Alcoholless: A Lightweight Security Sandbox for macOS Programs (Homebrew, AI Agents, etc.) was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

containerd v2.1, nerdctl v2.1, and Lima v1.1

Akihiro Suda — Thu, 12 Jun 2025 10:59:02 GMT

This post highlights the updates in containerd v2.1, nerdctl (contaiNERD CTL) v2.1, and Lima v1.1, all released last month.

See also my previous post on containerd v2.0, nerdctl v2.0, and Lima v1.0 (Nov 2024):

containerd v2.0, nerdctl v2.0, and Lima v1.0

containerd v2.1

containerd is the industry’s standard container runtime used by Docker and several Kubernetes-based products such as Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE).

The internals and the latest trends of container runtimes (2023)

containerd v2.1 introduces several improvements, particularly in filesystem support:

Support for EROFS (Enhanced Read-Only File System). Efficient for images with many layers. See here for the usage.
Mounting a container image as a Kubernetes volume. Allows separating the application code image from the data images (e.g., AI models). See here for the usage.
Writable cgroupfs (/sys/fs/cgroup) without the root privilege. Enables containers to self-control computation resources (CPU time, memory limits, etc.) for its descendant processes. See here for the usage.

Aside from the filesystem enhancements, this release also improves the support UserNS-Remap mode by allowing non-contiguous UID mapping ranges (e.g., uidmap=0:666:1000,1000:6666:64536).

nerdctl v2.1

nerdctl (contaiNERD CTL) is a Docker-like command line interface tool for containerd.

nerdctl v2.1 adds the support for UserNS-Remap mode, which balances security and performance between Rootless and Rootful modes:

Rootless: executes everything as a non-root user. Network performance is limited by default (but can be accelerated via the experimental bypass4netns)
UserNS-Remap: executes containers as a non-root, but containerd per se still runs as the root.
Rootful: executes everything as the root user.

Rootless vs. UserNS-Remap vs. Rootful

UserNS-Remap mode has been supported in the containerd daemon for a long time, however, it was not supported in nerdctl so far.

nerdctl v2.1 also brings the experimental support for gomodjail: Jail for Go Modules. gomodjail imposes syscall restrictions on a specific set of Go modules so as to mitigate potential vulnerabilities and supply chain attacks (some caveats apply).

In thego.mod snippet below, most of the dependency modules (e.g., github.com/Masterminds/semver/v3 ) are confined so that they cannot execute commands or open new files.

//gomodjail:confined
module github.com/containerd/nerdctl/v2

require (
    github.com/Masterminds/semver/v3 v3.3.1
    ...
    golang.org/x/sys v0.31.0 //gomodjail:unconfined
    ...
)

gomodjail is enabled in the nerdctl.gomodjail binary included in the nerdctl-full distribution. Usage is identical to nerdctl :

nerdctl.gomodjail run hello-world

Lima v1.1

Lima is a command line utility for creating Linux virtual machines. Lima was originally made with an opinionated focus on running containerd and nerdctl on desktop operating systems, however, its current scope is extended to support non-container workloads as well. In that sense, Lima is more comparable to WSL2 and Vagrant than to Docker Desktop and Apple’s Containerization (to appear in macOS 26).

Lima v1.1 adds support for inheritance and composition of template files. With the new syntax ( base ), a template with a custom provision command can be written as follows:

base:
- template://_images/ubuntu-lts
- template://_default/mounts

provision:
- mode: system
  script: |
    #!/bin/bash
    set -eux
    apt-get install -y build-essential

In previous versions, you had to duplicate the entire content of the ubuntu-lts template in your own template.

Other notable updates in Lima v1.1 include:

New port forwarder implementation by default. Faster and supports both TCP and UDP.
Support for DragonFly BSD hosts
Support for S390X and PPC64LE guests
The lima package is now split to lima and lima-additional-guestagents . The latter one is needed only for running a guest with a non-native architecture (e.g., Intel on ARM).

Visit the containerd maintainers at KubeCon Japan

Some containerd maintainers, including myself, will be presenting at KubeCon Japan 2025:

Tuesday, June 17, 2025, 16:30–17:00 JST

containerd: Project Update and Deep Dive — Akihiro Suda & Kohei Tokunaga (NTT), Kirtana Ashok (Microsoft), Akhil Mohan (VMware by Broadcom)

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of containers, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、コンテナなどの領域でのオープンソースコミュニティで共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>

containerd v2.1, nerdctl v2.1, and Lima v1.1 was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

なぜオープンソースソフトウェアにコントリビュートすべきなのか

Akihiro Suda — Sun, 30 Mar 2025 23:31:37 GMT

NTTの須田です。2024年9月に開催された第57回情報科学若手の会にて、「なぜオープンソースソフトウェアにコントリビュートすべきなのか」と題して招待講演させていただきました。講演内容をブログとして再編成しました。

講演資料 (PDF)

なぜOSSにコントリビュートすべきなのか

結論から言うと、主にOSSの持続可能性のためです。

OSSは「タダ飯」(free lunch) であるかの如く、対価を支払うことなく消費されがちです。ミートアップなどで提供される実際の「タダ飯」🍕🍣とは異なり、遠慮なく好きなだけ食べても他の人の迷惑にはなりませんが、この「タダ飯」を提供する側のことを誰かが気にかけていないと次の問題が生じます:

「タダ飯」が出てこなくなる (OSSの開発が停滞する)
毒入りの「タダ飯」が出てくる (OSSにマルウェアが混入する)

前者はましな方で、後者が特に事業や社会にとっての脅威となります。後述しますが、xz・liblzmaのように広く普及しているOSSにさえ、バックドアが混入する事件が実際に起こっています。

結局、OSSは「タダ飯」ではありません (No such thing as a free lunch)。持続的に安全に利用するには、対価、すなわちコントリビューションが不可欠です。

OSSとは

まず、そもそもOSSとは何かについて見ていきます。誤解されがちですが、OSSとは「無料で使えるソフトウェア」(freeware) のことではありません。

非営利公益法人 Open Source Initiative はOSSの定義 (Open Source Definition, v1.9) として10個の条件を挙げています。主に次の内容が含まれます:

再配布の自由
ソースコード形式での配布
派生ソフトウェアの作成・配布の自由
利用目的制限の禁止

無料で使えるソフトウェアであっても、ソースコードが公開されていなかったり、利用目的に制限があったりするソフトウェアはOSSではありません。

OSS と Free Software

OSSと非常によく似た概念として、Richard M. Stallman 氏らが唱える Free Software があります。Free Software もまた、「無料で使えるソフトウェア」(freeware) のことではありません。Free Software は日本語では「自由(な)ソフトウェア」と訳されます。なお、日本語の「フリーソフト」はFree Softwareのことではなく、単なる無料ソフトを指すことが多いようです。

OSS と Free Software は実践上はほとんど区別がつかないこともありますが、異なる動機に基づいています。Free Software が「自由と正義のための運動」(movement for freedom and justice) であると位置付けられているのに対し、OSS では理念よりも実利が重視されています。

両者に思想的な違いはありますが、まとめて FOSS (Free and Open Source Software) と呼ばれたり、「自由」(libre) の側面を強調して FLOSS (Free/Libre and Open Source Software) と呼ばれたりもします。

主要なOSSライセンス

一口にOSSといっても、そのライセンスには様々なものがあります。ライセンスが異なるソフトウェアは、混ぜて良いこともありますし、混ぜてはいけないこともあります。例えば、Apache License v2.0 を採用するソフトウェアに MIT License のコードを混ぜても問題ありませんが、GPL のコードを混ぜてしまうとApache License v2.0 での配布を継続できなくなります。近年では、GitHub Copilot などのコーディング支援AI の普及により、意図せず他ライセンスのコードが混ぜられることも懸念されつつあります。

また、利用目的に制限を加えるなど、OSSに似て非なるライセンスも多数存在し、OSSとの混同が問題視されています。

本記事ではOSSへのコントリビュートを推奨していますが、ライセンスについて最低限の知識を得るまでは、OSSにコントリビュートしてはいけません。ライセンスを無視してコードを切り貼りされると、ソフトウェアの配布を継続できなくなることもあるため、むしろ迷惑になります。

以下では主要なOSSライセンスをいくつか紹介しますが、ここでは特徴的な点についてのみ触れます。利用にあたっては、必ずライセンス原文をご参照ください。

MIT License: 制約が緩いことで知られます。プロプライエタリな派生物の作成も許可しています。Ruby on Rails や Node.js など多数のプロジェクトで採用されており、GitHub上で最もよく使われているライセンス (44.69%、2021年) とも言われています。類似ライセンスにX11 License、BSD License (2-Clause)、ISC License などがあります。
Apache License v2.0: MIT License に似ていますが、特許ライセンスを明示的に付与している点で異なっています。ユーザが特許訴訟を起こすと、ユーザ自身に付与された特許ライセンスは終了します。Apache HTTP Server、Docker Engine (Moby)、Kubernetes など多数のプロジェクトで採用されています。
GPL (GNU General Public License) v2: バイナリを受け取ったユーザへのソースコード再配布を義務付けている点が特徴です。Linux や git などで採用されています。
GPL v3: GPL v2の後継ライセンスです。改変したソフトウェアをインストーすることの妨害 (“Tivoization”) の禁止などの条項を追加しています。GCC や Emacs など多くのGNUプロジェクトはGPL v2 から v3 に移行しましたが、Linux は v3 への移行を予定していません。”Tivoization” を禁止する意図がないためです。
AGPL (Affero GPL) v3: バイナリを受け取っていないSaaSユーザに対してもソースコードを再配布することを義務付けています。Mastodon や OnlyOffice などで採用されています。
LGPL (Lesser GPL) v2.1、v3: GPLと異なり、LGPLを採用しているライブラリは動的ライブラリとして呼び出す場合にはライセンスが「感染」しません。glibc や glib などで採用されています。

OSSライセンスに似た非OSSライセンス

OSSライセンスと混同されがちな、非OSSライセンスも紹介します。

BUSL (Business Source License) v1.1 : OSSライセンスと異なり、商用利用を禁じています。ただし、一定期間後に別のライセンス(大抵はOSSライセンス)が適用されます。元々は MariaDB MaxScale のために作られたライセンスで、Terraform (2023年8月以降) や Vagrant (同) でも採用されています。
SSPL (Server Side Public License) v1.0: AGPL v3 に類似しますが、当該のソフトウェアのみならず、提供するサービス全体のソースコード開示を要求しています。元々はMongoDB (2018年10月以降) のために作られたライセンスで、Elasticsearch (2021年2月以降) や Redis (2024年3月以降)でも採用されています。Elasticsearch や Redis では他のライセンスも併用しています。
Llama 3 Community License: ユーザ数や使用目的に制限を課しています。Llama 3 で採用されています。

これらの非OSSライセンスを採用するプロプライエタリソフトウェアも、OSS を自称していたり、あるいはマスコミなどによってOSSと混同されていたりすることがあるため注意が必要です。

OSS略史

OSSとは何かを理解するためには、OSSが定義されるまでの歴史を把握する必要があります。

OSSの歴史がいつ始まったかの問いに答えるのは容易ではありませんが、OSSの文脈での “open source” との表記は1998年まで確認できません。さらに古い用例 “Caldera Announces Open-Source Code Model for DOS” (1996)も見受けられますが、1998年以降の用法とは意味が異なっています。

ただし、OSSに似通ったソフトウェア配布形態は、20世紀半ばの電子計算機黎明期から既に見られます。そもそも、1960年代半ば(米国の場合)までソフトウェアの著作権自体が確立していませんでした。それ以降のソフトウェアはプロプライエタリな著作物としての配布が進みました。

1974年: UNIX のAT&T (米国電信電話会社)社外への提供が始まりました。UNIXは後に広く普及したOSですが、この時点での提供先はごく少数に限られたようです。当時のUNIXは無料ではありましたが、その利用は自由ではなく、学術・教育目的に限られていました。翌年には有償化されました。UNIXはやがて、AT&T系の System V と、カリフォルニア大学バークリー校系の Berkeley Software Distribution (BSD) とに分かれましたが、BSDもAT&T や関連会社の著作物を含んでいました。
1983年: Richard M. Stallman 氏が “Free Unix!” を標語とし、GNU (Gnu’s Not Unix) プロジェクトを創設しました。カーネルを含む完全なUNIX互換OSの開発を目指しましたが、実際には bash などのユーザ空間のみが普及し、今日でも広く使われています。カーネルの開発にも1985年頃には着手していたものの、一度失敗しています。1990年頃にMach 3 マイクロカーネルをベースとして開発が始まったGNU Hurd は、普及には至っていないものの、現在でも開発が続いています。

“I consider that the golden rule requires that if I like a program I must share it with other people who like it.”
— Richard M. Stallman (September 27, 1983)

1988年: 初のfreeなBSD (FreeBSDではない) を目指した4.3BSD Net/1 がリリースされました。AT&T 関連の著作物を除去したことにより、自由な再配布が可能になったとされていましたが、除去が不十分であったため1992年には訴訟に至りました。4.4BSD Lite (1994) にて、改めて自由な再配布が可能になりました。Net/1 の系譜は、Net/2 (1991) から 386BSD (1992) を経て、NetBSD (1993) や FreeBSD (1993) に連なります。NetBSDやFreeBSDは、後に4.4BSD Lite のコードを元にして書き直されました。
1991年: Linux v0.01 がリリースされました。当初は有償再配布を厳格に禁じており、「無料」ではあっても「自由」なソフトウェアではありませんでした。Linux v0.12 (1992) にてGPL v2 を採用し、晴れて「自由」ソフトウェアとなりました。Linuxは初のfreeなUNIX互換OSというわけではありませんが、GNUは技術的に、BSDは法的に難航している間に、漁夫の利を得て勢力を築きました。なお、この時点での Linux は Linus Torvalds 氏の趣味として開発されており、大規模なOSとなることは想定されていませんでした。

“just a hobby, won’t be big and professional like gnu”
— Linus Torvalds (August 25, 1991)

1997年: Eric S. Raymond 氏が講演「伽藍とバザール」(The Cathedral and Bazaar) にて、閉鎖的な「伽藍」型 (GNU Emacs等) と、開放的な「バザール」型 (Linux 等) の開発モデルとを比較し、後者の優位性を指摘しました。なお、混同されがちですが、「伽藍」がプロプライエタリなソフトウェアを、「バザール」がOSSを意味するわけではありません。
1998年: 当時広く使われていたWebブラウザ Netscape Communicator の開発元である Netscape 社が、次期製品のソースコードを公開すると発表しました。この時に公開されたソースコードの大半は一旦破棄されましたが、紆余曲折を経て今日の Mozilla Firefox に繋がっています。Netscapeのソースコード公開は、前述の「伽藍とバザール」の影響を受けたものとされています。この時点では ”free source distribution with a license which allows source code modification and redistribution” との文言が使われており、未だ “open source” とは呼ばれていませんでしたが、まもなく Christine Peterson 氏らにより “open source” の表現が提案されました。”free software “の表現は既に長らく存在していましたが、「自由」ではなく「無料」ソフトウェアのように解釈されがちなことが問題視されたようです。
1999年: OSS向けホスティングサービスとして SourceForge が始まりました。以後、OSSコミュニティが活発化しました。この頃には、Linux 等の主要OSSの商用導入も進みました。2000年には、後の Red Hat Enterprise Linux や SUSE Linux Enterprise Server に繋がる製品が発売されました。
2007年: Linus Torvalds 氏を雇用する非営利団体 Open Source Development Labs (OSDL) が Free Standards Group (FSG) と合併し、Linux Foundation となりました。Linux Foundation は Linux に限らず、極めて多数の OSS プロジェクトの開発を推進しています。
2008年: GitHub のサービスが開始し、OSSコミュニティの一層の活発化が進みました。
2000年代末: OSSをかつては癌 (cancer) とも呼んでいた Microsoft 社が、OSS への敵対を中止し、むしろ積極的に OSS に貢献するようになりました。他の大企業からも OSS への貢献が進むようになりました。
2023年: OSSの定義を管理する非営利公益法人 Open Source Initiative が、Open Source AI の定義の策定を開始しました。策定された定義は Open Source AI Definition (OSAID) v1.0 (2024) として公開されました。機械学習モデルの「ソースコード」とも言える(が、似て非なる)訓練データについては、法的観点から非公開を許容しています。

なぜOSSにコントリビュートすべきなのか

OSSの概要や歴史を踏まえた上で、表題の問いについて考えてみます。

OSSへの依存は不可避

Synopsys社は、2023年の時点で 96% の商用コードはOSSを含んでいると報告しています (2024 Open Source Security and Risk Analysis Report)。また、OSSを直接使っている認識がない場合でも、開発ツールやOSなどのことを考えると、誰しもが少なくとも間接的にはOSSに依存しているといえます。身近な例では、iOS や Android を搭載したスマートフォンには多数のOSSが含まれているので、ソフトウェアエンジニア以外の方でも知らないうちにOSSに依存しています。

OSSへの依存が不可避である今日では、OSSの停滞や脆弱性がビジネスや社会の脅威に直結します。特に、OSSプロジェクトが乗っ取られて悪意のあるコードを仕込まれたりすると多大な損害が発生する可能性があります。

こうした脅威は、企業、学校、団体、個人がOSSに自ら積極的に関与することで抑えることができます。ここでの関与とはソフトウェアのコーディングに限った話ではなく、むしろプロジェクトのマネジメントにも関わる話です。とはいえ、コーディングで貢献しなければマネジメントに携われないことが多いので、まずはコーディングでの貢献が重要となります。

OSSは誰が開発・維持しているのか

OSS の開発・維持は個人の趣味とみなされがちですが、それは必ずしも正しい認識ではありません。1991年時点でのLinuxなど、趣味 (“Just for Fun”) で開発されたOSSも多数存在するのは事実ですが、今日の主要なOSSには企業や団体の業務で開発されているものも多数存在します。Linux を開発した Linus Torvalds 氏の場合は、2003年より前述のOSDL (現 Linux Foundation) に雇用されています。

Tidelift社の調査 (The 2023 Tidelift state of the open source maintainer report) によると、OSSメンテナの13%は収入の大半を、23%は収入の一部をOSS活動により得ているとされています。併せて36%のメンテナはOSS活動で収入を得ていることになります。「メンテナ」とはプロジェクトの管理権限を持つ開発者のことで、プロジェクトによっては「コミッタ」とも呼ばれます。メンテナ以外の開発者については、OSS活動で収入を得ている割合が下がるものと思われます。大規模で活発なOSSに限れば、収入を得ている割合は上がるようにも思われます (個人的な感覚)。

OSSに強く依存している企業であっても、従業員がOSSに業務で取り組むことを認めていないこともありますが、OSSを個人の趣味とみなして「やりがい搾取」するのは持続可能ではありません。

合成の誤謬

市場経済の下では、営利企業は経済的に合理的な行動を選択するはずです。しかしながら、何が合理的な行動であるのかは自明ではありません。ミクロな視点で合理性を追求すると、マクロでは却って非合理的になることもあります。これを合成の誤謬 (fallacy of composition) と呼びます。経済学者 Paul Samuelson 氏の言葉とされます。

OSSに当てはめてみると、OSSは無料で使えますし、他の誰かが勝手に開発・維持してくれるので、自らはコントリビュートしないことがミクロな視点では合理的に思えます。しかしながら、誰しもがこの「合理的」な行動を選択すると、誰もOSSにコントリビュートしなくなってしまうため、結局は合理的ではありません。自らOSSにコントリビュートし、その価値を他者と分かち合うのが実は合理的であると言えます。

OSSは贈与経済か

Eric S. Raymond 氏は OSS文化の理解を促進するため、エッセイ ”Homesteading the Noosphere” (1998) にて、社会地位を築く方法 (ways of gaining social status) を3つに分類しました:

上意下達 (command hierarchy): 軍事力・強制力に依る方法です。
交換経済 (exchange economy): 使用・交換するモノ (things) に対する支配力に依る方法です。典型例として自由市場経済が挙げられます。
贈与文化 (gift culture): 何を贈与するかに依る方法です。典型例として北米先住民のポトラッチ (potlatch) が挙げられます。ポトラッチは競覇的贈与とも呼ばれます。相手が返礼をできなくなるまで食糧や毛皮の贈与を繰り返すことで権力を得る風習です。なお、Raymond氏のエッセイでは触れられていませんが、関連する著作として社会学者 Marcel Mauss 氏の贈与論 (Essai sur le don, 1925) が挙げられます。日本語訳「太平洋民族の原始經濟 : 古制社會に於ける交換の形式と理由」(1943) は国立国会図書館デジタルコレクションで無料で閲読できます (要登録)。Mauss 氏は、贈与は無償ではなく、時として過大なまでの返礼の義務が伴うことを指摘しています。この義務を果たせないものは地位を失うとしています。

Raymond氏はOSSを3番目の「贈与文化」 (gift culture) に分類し、身内での評判 (reputation among one’s peers) が競争の成功を測る唯一の指標となる状況が生まれると指摘しました。しかしながら、今日のOSSは評判のみがモチベーションであるとは言い難いように思われます。個人レベルでみると、業務で取り組んでいる場合は給与が第一のモチベーションとなり得ます。趣味の場合でも、自己研鑽がモチベーションとなり得ます。企業レベルでみると、例えば次のようなモチベーションが考えられます:

プロジェクト維持によるセキュリティの担保
社外開発者との協力による新技術創出
社内forkを維持する負担の軽減

もちろん、評判も主要なモチベーションにはなり得ますが、これは承認欲求や感情論ではなく経済的実益としても解釈できます。個人レベルでは昇進や転職、企業レベルでは売上や人材獲得が経済的実益となり得ます。

結局、古典的な贈与モデルではOSSコミュニティのダイナミクスを説明しきれないように思われます。特に、元々開発に参加していないユーザは「贈与」に対する「返礼」を怠っても開発者コミュニティ内での地位を失いません。というより、元々築いていない地位は失いようがありません。OSS活動には贈与モデルを当てはめなくても、自由市場経済の下での合理的行動として解釈できます。これは純粋公共財 (pure public goods) が非公共部門 (non-public sectors) により効率的・持続的に供給されうる稀有な例であると言えます。

OSSにただ乗りし続けると何が起こるか

古典的な贈与モデルがOSSに当てはまりきらないとしても、ただ乗りが望ましくない点での結論は変わりません。ただ乗りが続くとOSSコミュニティは停滞し、新機能が追加されなくなったり、バグが修正されなくなったりします。ですが、これ自体はユーザにとっては大した問題ではありません。他のソフトウェアに乗り換えるとか、自分でforkしてメンテナンスするとかの選択肢があるからです。問題なのは、悪意を持った開発者からの “Gift” です。”Gift” は英語では「贈与物」を意味しますが、ドイツ語では「毒」を意味します。贈与物→投与物→毒と意味が変化したようです。

xz 乗っ取り事件

そのような “Gift” の一例としては、2024年3月に発覚した、xz・liblzma 乗っ取り事件を挙げることができます。圧縮・展開ツールであるxz および、そのライブラリであるliblzmaはほとんどのLinuxディストリビューションに標準で含まれているコンポーネントであり、当然に信頼できるものと思われがちでした。実際、元々のメンテナは悪意を持っていませんでしたが、途中から開発に参加した “Jia Tan” (おそらく偽名)と名乗るメンテナによって、不正なSSH接続を可能にするバックドアが仕掛けられていました。

“Jia Tan” が xz・liblzma の乗っ取りに成功した背景には、xz・liblzma が広く使われているにも関わらず、開発コミュニティが停滞していたことが挙げられます。停滞するコミュニティにおいて、”Jia Tan” は有益・無害(と思われる)コントリビューションに2年を費やして信頼を築き、身元が明らかでないにも関わらずメンテナ権限を付与されていました。個人のいたずらにしてはあまりに長い時間をかけていることから、組織的な犯行とも推測されています。

xz・liblzmaの場合は、幸いなことに主要なディストリビューションにパッケージングされる前にバックドアが発見されました。しかしながら、xz・lzmaの事例は氷山の一角かもしれません。他のOSSも、悪意ある個人ないしは組織によって乗っ取られている可能性があります。

他の事案

やや似た事案をいくつか紹介します。

2022年1月: JavaScriptライブラリ colors.js および faker.js が開発者自身によって意図的に破壊され、意味不明な文字列やアスキーアートを無限回表示するコードが加えられました。AWS Cloud Development Kit などに影響しました。先立つこと2020年には、開発者は次のコメントを投稿していました:

”Respectfully, I am no longer going to support Fortune 500s
( and other smaller sized companies ) with my free work.”
— faker.js 開発者 (November 9, 2020)

2022年3月: Node.js 用 IPCライブラリ node-ipc が開発者自身によって意図的に破壊されました。ロシアやベラルーシで実行されている場合にファイルを破壊するコードが加えられました。
2024年2月: ブラウザの差異を吸収するJavaScript ライブラリである polyfill.io のドメイン及びGitHubアカウントが売却され、悪性サイトへリダイレクトするコードが加えられました。

コミュニティによる貢献や相互監視が活発になされていれば、これらの事案は防げた可能性があります。

何をコントリビュートすべきか

何をコントリビュートすべきかについて、まずはコミュニティの持続可能性の観点から考えてみます。

コミュニティの持続可能性

コミュニティの持続可能性の観点では、他の開発者がやりたがらない作業に取り組むことが重要です。例えば、バグ修正、テスト、リファクタリング、ドキュメント更新、質問対応などが挙げられます。これらの地道な作業を自らやってくれる人の善意を疑わないといけなくなったのがxz事件の最も悲しいところでもあります。

また、他の開発者の支援や監視も重要です。支援としては、pull requestをレビューしたり、途中で放棄されたpull requestを引き継いだりすることが挙げられます。監視項目としては、不審なコミットがないか、アカウントが乗っ取られていないか、名前や所属に偽りがないかなどが挙げられます。監視は相互かつ友好的に実施する必要があります。

自分・自社の活動の持続可能性

「コミュニティの持続可能性」に挙げた項目は味気ないものです。自分・自社がモチベーションを保って活動を持続できなければ、コミュニティの持続にも貢献できなくなりますので、モチベーションを保てることをコントリビューションするのも重要です。

例えば、大きい新機能を提案すると、その機能のメンテナンスで数年以上活動を続けられる(悪く言えば縛られる)ことがあります。

なお、モチベーションを保つために自己研鑽や趣味として凝った機能を作ってみても良いのですが、他の開発者がメンテナンスできるかへの配慮も必要になります。配慮したくない場合は、既存プロジェクトにマージさせるのではなく、forkしたり新規プロジェクトを立ち上げたりすることも検討すると良いでしょう。

何をやりたいか自分でもわからない場合はtypo修正など簡単なことから始めても構いませんが、typo修正ばかりやっていると荒らし(troll)と見做される恐れもあります。

コードを書くだけがコントリビューションではない

コードを書くだけがコントリビューションではなく、マネジメント面でのコントリビューションが実は一番重要です。特に、どうすれば他社・他者の行動を促せるかが鍵となります。例えば、コーディングやテストを必要とする部分を整理し、開発者を募ることが挙げられます。業務指示を出せるわけではない他社の方に行動を促すのは容易ではありませんが、会社をまたがってプロジェクトをまとめ上げられると、効率的に開発を進めることができます。

また、レビューされず放置されている pull request や脆弱性報告を洗い出し、適切なレビューワーを割り当てたり、自ら判断を下したりすることも重要です。マージするつもりがないpull requestについては、放置するよりもリジェクトする方が親切です。明示的にリジェクトしてもらえると、開発者はレビューを待たずに次の行動に進むことができます。脆弱性報告については、そもそも脆弱性が実在するのか怪しかったり、zero-day attack 防止の観点から公表時期の決定が難しかったりして対応が長引くことがあります。公表に当たっては、主要なユーザとの折衝が必要となることもあります。

こうしたマネジメントについては、ソフトウェア技術よりも、人脈や交渉力の方が重要になることもよくあります。とは言っても、コードを書かない「口だけ番長」は発言力を得られないので、結局はコードを書くのが基本となります。

その他、各 foundation 等への資金拠出ももちろん重要なコントリビューションとなります。

コントリビュータの評価

コントリビュータをどう評価すべきかについて、コミュニティ目線および企業目線でそれぞれ考えてみます。

コミュニティ内での評価は、主にメンテナの選定に関係します。大きい新機能を追加したコントリビュータは、当該モジュールのメンテナンスを長期的に任されることがあります。また、バグ管理、リリース管理、他のプロジェクトとの折衝などができると、プロジェクト全体の管理を任されやすいと言えます。

次に、OSS活動に取り組む従業員を、企業としてどう評価すべきかについて考えてみます。OSS活動が自社製品の売上に直結していると理想的ではありますが、それだけを評価指標にするとコミュニティの持続可能性を損なう恐れがあります。コミュニティの持続可能性の観点からは、コミュニティ内での評価も社内評価に取り入れることが望ましいと考えられます。なお、業務成果を定量評価するためにコミット件数やコード行数を測定している企業も多く存在しますが、定量性にこだわりすぎると迷惑行為の助長に繋がるのでよくありません。数値目標を満たすために、typo修正などの些細なpull requestを大量に投稿する組織も散見されます。

NTTでの自身のOSSコントリビューション事例

NTTグループは今までに Linux、 PostgreSQL、 OpenStack、 Hadoop など多くのOSSに積極的にコントリビュートしてきました。

日本電信電話株式会社ソフトウェアイノベーションセンタに所属する私自身は、Docker/Moby (2016-)、BuildKit (2017-)、containerd (2017-)、runc (2020-)、OCI Runtime Spec (2022-) など主要なコンテナ関連OSSのメンテナを務めています。2015年末、Docker のファイルシステム関連の問題に遭遇したことがきっかけで、pull request を投稿したり、課題整理などのコントリビューションを行なったりするようになった結果、メンテナとしての役割を任せてもらえるようになりました。

機能的に大きなコントリビューションとしては、コンテナランタイムを非root権限で実行することでセキュリティを強化するRootlessコンテナと呼ばれる技術をcontainerd、BuildKit、Docker、Kubernetes に実装しました (2018-)。コンテナのネットワークを非root権限で実行可能にするモジュールとして開発した slirp4netns は、Red Hat社が主導するDocker互換OSSである Podman でも採用されました。振り返ってみると、提案がDockerに採用された時はモチベーションが大きく向上しましたが、マージとリリースに時間がかかった点で向上分がやや相殺されたと感じています。

こうした経験も踏まえ、2020年からはcontainerd をベースにしたDocker互換プロジェクトとして nerdctl (contaiNERD CTL) を開発しました。OSSとしてのDocker (Moby) のリリースが当時は停滞しており、containerd側で進んでいたセキュリティや性能面での改善を取り込みにくい状態が続いていたことが、新しい互換プロジェクトを立ち上げた理由です。OSSとしてのDocker (Moby) は現在では活気を取り戻しています。

nerdctl: Docker-compatible CLI for contaiNERD

また、 nerdctl 入りの Linux 仮想マシンを簡単に立ち上げるツールとして Lima も開発しました。nerdctl 及び Lima は SUSE社のRancher Desktop や AWS社のFinchなどの製品にも取り込まれており、広く使われています。

Lima is now a CNCF project 🎉

2023年頃からは、OSSのサプライチェーンセキュリティを向上する取り組みをいくつか並行して進めています。その取り組みの一つとして、ソースコード汚染の検出を容易化する技術である Reproducible Builds のコンテナへの採用を推進してきましたが、ツール群の実装は進んでいても Docker Hub への採用交渉は難航しているところです。OSS活動においては実装力よりも交渉力が重要となることを改めて実感しています。

OSSの今後

最後に、OSSの今後についての展望を述べます。

LLM界隈からの影響

良くも悪くも、OSSはLLM界隈からの影響に晒されつつあります。GitHub Copilot などのLLM系アシスタントが生成するコードには他社の著作物が混入する懸念がありますが、生産性の観点からはLLMの使用を禁止するのは現実的ではありません。仮に禁止したとしても、コントリビュータは勝手にLLMを使うので実効性がないと考えられます。生産性を維持・向上しつつ、ライセンス上の懸念を払拭するにはどうすれば良いかが課題となっています。

また、利用目的に制限を課すLLMをも “open source” と呼ぶ文化が、LLM 以外のソフトウェアにも波及する恐れがあります。”open source” を名乗っていたり、あるいは “open source” であると報道されていても信用せず、ライセンスの原文を確認することが従来にも増して重要になります。

匿名性の低下

氏名や所属の公開を望まないOSS開発者も多く存在しますが、そのような匿名開発者は一部の活動が困難になる可能性があります。特に、2024年3月のxz・liblzma 乗っ取り事件以後は、優秀でも身元が不明な人物をメンテナに登用することは難しくなりつつあります。

新規のコントリビュータが信頼を得る方法の1つとしては、Open Source Summit や FOSDEM (Free and Open source Software Developers’ European Meeting) などの会議にオフラインで参加し、他の開発者と交流することが挙げられます。ただし、旅費を勤務先に請求できない個人コントリビュータをどう包摂するかが課題です。業務としてOSSに取り組んでいるコントリビュータでも、登壇しない場合は出張を申請しにくいことも考えられます。

プロジェクトの定量化

本記事ではOSSプロジェクトの持続可能性について何度も言及してきましたが、これを定量化することは容易ではありません。開発者が何人バスに撥ねられてもプロジェクトを持続できるかを示す、”Bus factor” (“Truckfactor”) なる物騒な指標も提唱されてはいますが、厳密性には乏しく、普及していないようです。別の指標が必要そうです。

また、プロジェクトの発展・衰退(・復活)を数理モデル化することも有益になると思われます。再現性のあるやり方で、プロジェクトを発展させたり復活させたりできるようになると良さそうです。

まとめ

長文となりましたが、お伝えしたかったことは次の3点です:

なぜOSSにコントリビュートすべきなのか
→ 持続可能性のため (だけとは言っていない)
OSSが放置されると悪意をもった開発者に乗っ取られることがある
ただ乗りは合理的のようで合理的ではない

本記事がOSSコミュニティの更なる活性化に少しでも役立てば幸いです。

私たちNTTは、様々なオープンソースコミュニティで共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください。

なぜオープンソースソフトウェアにコントリビュートすべきなのか was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

containerd v2.0, nerdctl v2.0, and Lima v1.0

Akihiro Suda — Wed, 06 Nov 2024 06:13:39 GMT

Ahead of the KubeCon North America 2024 (November 12–15), this week saw the releases of containerd v2.0, nerdctl (contaiNERD CTL) v2.0, and Lima v1.0 🎉.

containerd v2.0

The internals and the latest trends of container runtimes (2023)

containerd was originally written by Docker, Inc. in 2015 to provide a minimalistic daemon to manage the lifecycles of containers, under the hood of the Docker daemon.

containerd was transferred to the Cloud Native Computing Foundation (CNCF) and reached its v1.0 in 2017, with the expanded scope of the project to support non-Docker use cases. The built-in support for Kubernetes was merged in v1.1 (2018).

containerd v2.0 focuses on the removal of the legacy features that have been deprecated through the past nine years. This breaking change resulted in bumping up the major number from v1 to v2.

Removed features

The old containerd-shim and containerd-shim-runc-v1, in favor of containerd-shim-runc-v2. The old shims lacked the support for modern features such as cgroup v2, and were inefficient to support Kubernetes pods. Those old shims had been deprecated since containerd v1.4 (2020).
The support for AUFS , in favor of OverlayFS that has been merged in the upstream of the Linux kernel. The support for AUFS had been deprecated since containerd v1.5 (2021).
The support for the Kubernetes CRI v1alpha2 API, in favor of CRI v1. Kubernetes has already dropped the support for CRI v1alpha2, in Kubernetes v1.26 (2022).
The support for "Docker Schema 1" images is now disabled, in preparation of removal in containerd v2.1. Schema 1 has been substantially deprecated since circa. 2017 in favor of Schema 2 introduced in Docker v1.10 (2016), but some image registries did not support Schema 2 until 2020-ish. Docker has already disabled pushing Schema 1 images in Docker v20.10 (2020), so almost all images built in the last few years should have been formatted in Schema 2, or, its successor OCI Image Spec v1. ("OCI" here refers to "Open Container Initiative", not to "Oracle Cloud Infrastructure".)

containerd v1.6.27+/v1.7.12+ users can investigate whether they are using those removed features, by running the ctr deprecations list command.

New features

User Namespaces for Kubernetes, so as to map the user IDs in pods to different user IDs on the host. Especially, this features allows mapping the root user in the pod to an unprivileged user on the host.
Recursive Read-only Mounts for Kubernetes, so as to prohibit accidentally having writable submounts. See also my previous blog at kubernetes.io: <https://kubernetes.io/blog/2024/04/23/recursive-read-only-mounts/>.
Image verifier plugins, so as to enforce cryptographic signing, malware scanning, etc.

Other notable changes

Sandboxed CRI is now enabled by default, for efficient handling of pods
NRI (Node Resource Interface) is now enabled by default, for plugging vendor-specific logic into runtimes
CDI (Container Device Interface) is now enabled by default, for the enhanced support for Kubernetes Device Plugins.
/etc/containerd/config.toml now expects the version=3 header. The previous config versions are still supported.
The Go package github.com/containerd/containerd is now renamed to github.com/containerd/containerd/v2/client .

nerdctl v2.0

nerdctl (contaiNERD CTL) is a Docker-like command line interface tool for containerd.

nerdctl was originally written by myself in 2020 to facilitate experimental features such as eStargz that were not supported in Docker at that time. nerdctl became a subproject of containerd in 2021, and reached its v1.0 in 2022.

Released nerdctl v1.0

nerdctl v2.0 enables detach-netns for Rootless mode by default:

Faster and more stable nerdctl pull, nerdctl push, and nerdctl build
Proper support for nerdctl pull 127.0.0.1:.../...
Proper support for nerdctl run --net=host .

The detach-netnsmode may sound similar to bypass4netns, which utilizes SECCOMP_IOCTL_NOTIF_ADDFD to accelerate socket syscalls in rootless containers. While bypass4netns accelerates containers, detach-netns accelerates the runtime layers that are responsible for pulling and pushing images, by leaving them in the host network namespace. Containers are executed in the "detached" network namespace so that they can obtain IP addresses used for container-to-container communications.

Other major changes in nerdctl v2.0 include the addition of nerdctl run --systemd for running systemd in containers. Also, the stability was significantly improved in this release, thanks to lots of refactoring and testing by the GitHub user @apostasie .

See also the release note: https://github.com/containerd/nerdctl/releases/tag/v2.0.0

Lima v1.0

Lima is a command line utility to run containerd and nerdctl on desktop operating systems such as macOS, by running a Linux virtual machine with automatic filesystem sharing and port forwarding. Lima is often compared with WSL2, former Docker Machine, and Vagrant.

brew install lima
limactl start
lima nerdctl run -p 80:80 nginx

Lima was originally written by myself too in 2021, and joined CNCF in 2022. Lima has been adapted by several famous third-party projects such as Colima, Rancher Desktop, and AWS’s Finch.
Lima is also used by several organizations including NTT Communications.

Lima is now a CNCF project 🎉

Lima finally reached v1.0 today, with the support from 110+ contributors and 15,000+ stargazers in the past 3+ years.

https://star-history.com/#lima-vm/lima

This release introduces several breaking changes, such as switching the default machine driver on macOS from QEMU to Virtualization.framework (VZ) for better filesystem performance.

The limactl CLI is designed to print hints when the user hits those breaking changes. e.g., limactl create template://experimental/vz now fails with a hint that suggests using limactl create --vm-type=vz template://default instead.

Other notable changes include the addition of the support for nested virtualization, UDP port forwarding, and the limactl tunnel command (SOCKS proxy).

See also the release note: https://github.com/lima-vm/lima/releases/tag/v1.0.0

Visit the maintainers at KubeCon

Some of the maintainers of the projects, including myself, will show up at KubeCon North America 2024:

Wednesday, November 13

15:15–20:00: Project Kiosk: containerd

Friday, November 15

10:30-14:30: Project Kiosk: Lima
11:55-12:30: What Containerd 2.0 Means for You — Samuel Karp, Google
14:55-15:30: What’s Going on in the Containerd Neighborhood? — Phil Estes, AWS; Samuel Karp, Google; Akihiro Suda (myself), NTT; Michael Brown, IBM; Kirtana Ashok, Microsoft

The full schedule of the conference can be found at <https://kccncna2024.sched.com/>.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of containers, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

containerd v2.0, nerdctl v2.0, and Lima v1.0 was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

Accelerating Llama on Lima, with WASI-NN RPC

Akihiro Suda — Wed, 19 Jun 2024 23:38:14 GMT

WasmEdge v0.14 was released last month, with our contribution for exposing WASI-NN (WebAssembly System Interface API for Neural Networks) over gRPC.

The WASI-NN RPC is useful for accelerating LLM workloads (e.g., Llama) on virtual machines (e.g., Lima) that do not support virtualizing GPUs.

On Apple M2 Pro, Llama 2 can run 22.3 times faster. (0.66 tokens/s → 14.73 tokens/s)

Note: “Lima” in this context refers to <https://lima-vm.io> (VM), not to <https://gitlab.freedesktop.org/lima> (Mali GPU driver).

Problem: GPUs are inaccessible from VMs

Lima is a tool that creates a Linux virtual machine with a simple command line interface. Lima was originally made for running containerd including nerdctl (contaiNERD CTL) on macOS. However, Lima has gained popularity for other use cases as well.

Lima is now a CNCF project 🎉

For macOS hosts, Lima supports two backends: QEMU and Virtualization.framework. The lack of the support for GPUs in these backends has been a huge burden for users who want to efficiently run AI workloads such as Llama inside Lima.

Solution: WASI-NN as the high-level RPC for neural networks on GPUs

Implementing GPU passthrough in these VM backends is not a straightforward task. Instead, we chose to implement an RPC subsystem that delegates neural network computations to a host process (WASI-NN RPC Server) with direct access to the host GPUs.

The RPC is built on top of gRPC and directly mapped to the WITX specification of the WASI-NN API.

// gRPC
message SetInputRequest {
  uint32 resource_handle = 1;
  uint32 index = 2;
  Tensor tensor = 3;
}

message ComputeRequest{
  uint32 resource_handle = 1;
}

message GetOutputRequest {
  uint32 resource_handle = 1;
  uint32 index = 2;
}

message GetOutputResult {
  bytes data = 1;
}

service GraphExecutionContextResource {
  rpc SetInput(SetInputRequest) returns (google.protobuf.Empty) {};
  rpc Compute(ComputeRequest) returns (google.protobuf.Empty) {};
  rpc GetOutput(GetOutputRequest) returns (GetOutputResult) {};
}

;; WITX
(@interface func (export "set_input")
  (param $context $graph_execution_context)
  (param $index u32)
  (param $tensor $tensor)
  (result $error (expected (error $nn_errno)))
)

(@interface func (export "compute")
  (param $context $graph_execution_context)
  (result $error (expected (error $nn_errno)))
)

(@interface func (export "get_output")
  (param $context $graph_execution_context)
  (param $index u32)
  (param $out_buffer (@witx pointer u8))
  (param $out_buffer_max_size $buffer_size)
  (result $error (expected $buffer_size (error $nn_errno)))
)

The RPC client is implemented in WasmEdge. The RPC itself is agnostic to WASM and can be implemented by non-WASM applications too.

Why does WASM matter here?

Actually, it really doesn’t. WASM appears here simply because:

the WASI-NN API provides a quite simple abstraction for neural networks
the WasmEdge implementation of WASI-NN already covers several backends such as PyTorch and GGML, with the support for Apple Metal.

Alternatively, Dawn Wire (RPC for WebGPU) could be adopted instead of WASM and WASI-NN, but it would incur a higher implementation cost due to the difference in abstraction levels.

Demo: 22 times faster

Launching Lima

An instance of Lima virtual machine can be created as follows:

# Host (macOS)
brew install lima
limactl start --vm-type=vz
lima

As of the time of writing this, the brew command installs Lima v0.22 with Ubuntu 24.04 as the default VM template.

The --vm-type=vz flag in the limactl start command specifies Virtualization.framework (vz) as the VM driver. This flag is optional, but recommended for better performance and stability.

Installing WasmEdge onto the Lima guest

After running the lima command to open a shell for the VM, run the following commands to install WasmEdge inside the guest:

# Guest (Linux)
sudo apt-get install -y cmake libgrpc++-dev liblld-dev libopenblas-dev libopenblas64-dev llvm ninja-build pkg-config protobuf-compiler-grpc

git clone https://github.com/WasmEdge/WasmEdge.git 
cd WasmEdge
git checkout 0.14.0

cmake -S. -B ./build -GNinja \
  -DCMAKE_BUILD_TYPE=Release \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=ON \
  -DWASMEDGE_BUILD_WASI_NN_RPC=ON
cmake --build ./build
sudo cmake --install ./build

Running Llama on Lima, without the acceleration

Inside the Lima VM, Llama can be executed with WasmEdge as follows:

# Guest (Linux)
curl -OSL https://github.com/second-state/WasmEdge-WASINN-examples/raw/da18b35c3c911a40a5d2784947ce78610ce51daf/wasmedge-ggml/nnrpc/wasmedge-ggml-nnrpc.wasm
curl -OSL https://huggingface.co/wasmedge/llama2/resolve/23de599453ce999ab1dc650bd01f6298af38eb18/llama-2-7b-chat-q5_k_m.gguf

wasmedge \
  --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf \
  --env enable_log=true \
  wasmedge-ggml-nnrpc.wasm default

The license and acceptable use policy for the llama-2-7b-chat-q5_k_m.gguf file can be found at <https://huggingface.co/wasmedge/llama2/tree/23de599>.
Llama was chosen to be executed inside Lima as a pun; it is possible to use other GGUF-formatted models as well.

In the terminal, you can chat with the model, but it is quite slow (0.66 tokens per second on Apple M2 Pro) due to the lack of access to the host GPUs:

USER:                                                                                                                                      
What is the capital city of Peru?
[...]
eval time =   13535.83 ms /     9 runs   ( 1503.98 ms per token,     0.66 tokens per second)
[...]
ASSISTANT:
The capital city of Peru is Lima.

It may even appear to hang, as the model’s output is not printed until text generation is complete. This issue is being addressed in <https://github.com/WasmEdge/WasmEdge/pull/3386> by implementing the WASI-NN Streaming Extension.

Installing WASI-NN RPC server onto the macOS host

The next step is to install WasmEdge along with the WASI-NN RPC server onto the macOS host, so that the guest can delegate the LLM inference computations to the host with the access to the GPUs.

# Host (macOS)
brew install cmake grpc llvm@16 ninja pkg-config

git clone https://github.com/WasmEdge/WasmEdge.git 
cd WasmEdge
git checkout 0.14.0

export LLVM_DIR="${HOMEBREW_PREFIX}/opt/llvm@16/lib/cmake"
export CC="${HOMEBREW_PREFIX}/opt/llvm@16/bin/clang"
export CXX="${HOMEBREW_PREFIX}/opt/llvm@16/bin/clang++"
cmake -S. -B ./build -GNinja \
  -DCMAKE_BUILD_TYPE=Release \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_METAL=ON \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=OFF \
  -DWASMEDGE_BUILD_WASI_NN_RPC=ON
cmake --build ./build
sudo cmake --install ./build

The WASI-NN RPC server listens on a UNIX domain socket on the host. The socket can be forwarded to the guest with ssh -R ::

# Host (macOS)
curl -OSL https://huggingface.co/wasmedge/llama2/resolve/23de599453ce999ab1dc650bd01f6298af38eb18/llama-2-7b-chat-q5_k_m.gguf

wasi_nn_rpcserver \
  --nn-rpc-uri unix://$HOME/nn.sock \
  --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf

ssh -F $HOME/.lima/default/ssh.config -R /home/${USER}.linux/nn.sock:$HOME/nn.sock lima-default

Running Llama on Lima, with the acceleration

WasmEdge running inside the Lima instance can now connect to the WASI-NN RPC server socket with the --nn-rpc-uri flag:

# Guest (Linux)
wasmedge \
  --nn-rpc-uri unix://$HOME/nn.sock \
  --env enable_log=true \
  wasmedge-ggml-nnrpc.wasm default

# Before
eval time =   13535.83 ms /     9 runs   ( 1503.98 ms per token,     0.66 tokens per second)

# After
eval time =     611.14 ms /     9 runs   (   67.90 ms per token,    14.73 tokens per second)

On Apple M2 Pro, the performance is improved from 0.66 tokens per second to 14.73 tokens per second. (22.3 times faster)

Future: wRPC

In the future, WASI-NN RPC maybe replaced by wRPC. wRPC is a fairly new Bytecode Alliance project that aims to define the standard for the distributed communication model of WASM components. wRPC could potentially be useful for exposing other host resources, such as biometric authenticators, to Lima as well.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities in the fields of containers, WASM, LLM, etc. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、コンテナ、WASM、LLMなどの領域でのオープンソースコミュニティで共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>

Accelerating Llama on Lima, with WASI-NN RPC was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

[DockerCon 2023] Reproducible builds with BuildKit for software supply chain security

Akihiro Suda — Mon, 23 Oct 2023 14:13:41 GMT

This is a recap of my talk “Reproducible builds with BuildKit for software supply chain security” at DockerCon (October 5th, 2023).

Slide 1

This was similar to my previous talk at FOSDEM in February, but the toolchain was simplified since then.

Background

Security assessment of third party Docker images has been a long challenge, due to the lack of verifiability in the software supply chain.

Images maintained by a reputable organization or an individual are often considered to be trustworthy, however, it is hard to deny a possibility that they might have silently injected malicious codes that are not present in the source repo. Also, even if they have no malicious intent, their images can be still compromised on an accidental leakage of registry credentials.

Reproducible builds reduce this concern. Reproducible builds is a technique to ensure that a bit-for-bit identical image can be reproduced from its source code, by anybody, at any time. When multiple actors can attest to an image’s reproducibility, it signifies that the image contains no code of a secret origin.

Slide 3

Are Docker Hub images actually reproducible?

Most of them are not. You can run docker build https://github.com/docker-library/... to rebuild an image on Docker Hub by yourself, and use my diffoci(diff for Open Container Initiative images) tool <https://github.com/reproducible-containers/diffoci> to see why they are not reproducible:

docker pull golang:1.21.1-alpine@sha256:96634e55b363cb93d39f78fb18aa64abc7f96d372c176660d7b8b6118939d97b

# DOCKER_BUILDKIT=0 with Docker 20.10.23 corresponds to the current Docker Hub image (Will change in the future)
export DOCKER_BUILDKIT=0
docker build -t my-golang "https://github.com/docker-library/golang.git#585c8c1e705a7a458455f0629922a4f90628ce08:1.21/alpine3.18”

go install github.com/reproducible-containers/diffoci/cmd/diffoci@latest

diffoci diff docker://golang:1.21.1-alpine docker://my-golang

The diffoci result for golang:1.21.1-alpine contains more than 14,000 lines of diffs, but most of them are just the differences of the timestamps:

$ diffoci diff docker://golang:1.21.1-alpine docker://my-golang
TYPE     NAME                                                   INPUT-0                         INPUT-1
Desc     application/vnd.docker.distribution.manifest.v2+json   b25862...                       3c4eca0...
...
File     etc/ssl/certs/3e45d192.0                               2023-08-09 03:36:47 +0000 UTC   2023-09-21 08:35:31 +0000 UTC
...
(More than 14,000 lines)
...
File     go/                                                    2023-09-06 18:31:40 +0000 UTC   2023-09-21 08:35:45 +0000 UTC

The --semantic flag can be used to ignore such “boring” differences:

$ diffoci --semantic diff docker://golang:1.21.1-alpine docker://my-golang
TYPE     NAME                      INPUT-0                                                                        INPUT-1
Layer    ctx:/layers-1/layer       length mismatch (457 vs 454)                                                   
Layer    ctx:/layers-1/layer       name "usr/local/share/ca-certificates/.wh..wh..opq" only appears in input 0    
Layer    ctx:/layers-1/layer       name "etc/ca-certificates/.wh..wh..opq" only appears in input 0                
Layer    ctx:/layers-1/layer       name "usr/share/ca-certificates/.wh..wh..opq" only appears in input 0          
File     lib/apk/db/scripts.tar    eef110e...                                                                     e9bfe18...
Layer    ctx:/layers-2/layer       length mismatch (13939 vs 13938)                                               
Layer    ctx:/layers-2/layer       name "usr/local/go/.wh..wh..opq" only appears in input 0                       
File     lib/apk/db/scripts.tar    60e22bb...                                                                     67f2648...
Layer    ctx:/layers-3/layer       length mismatch (4 vs 3)                                                       
Layer    ctx:/layers-3/layer       name "go/.wh..wh..opq" only appears in input 0

The remaining differences are:

.wh..wh..opq (AUFS whiteouts) are missing in the local build due to the filesystem difference
lib/apk/db/scripts.tar differs due to the timestamp information inside itself (the --semantic flag isn’t still clever enough to ignore timestamps inside nested tar archives)

How to make images reproducible

Timestamps

Timestamps are one of the obvious challenges to achieve reproducibility. Docker/OCI (Open Container Initiative) images have timestamps in:

the createdproperty in the OCI Image Config (shown in docker image ls )
the historyproperty in the OCI Image Config (shown in docker image history )
the org.opencontainers.image.created annotation in the OCI Image Index
the timestamps of the files in the image layers

BuildKit v0.11 added the support for rewriting the timestamps for 1, 2, and 3 to reduce non-reproducibility.
This features was extended in BuildKit v0.13 (beta) to cover 4 as well.

# Configure buildx to use BuildKit v0.13 beta1
docker buildx create --use --driver-opt image=moby/buildkit:v0.13.0-beta1

# Rewrite the timestamps in the image to the timestamp of the latest git commit
docker buildx build --build-arg SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) \
  --output type=image,name=example.com/image,push=true,rewrite-timestamp=true

SOURCE_DATE_EPOCH (uint64; seconds from 1970–01–01 00:00:00 UTC) here is an environment variable standardized by <https://reproducible-builds.org/>. This environment variable is also recognized by gcc, clang, cmake, etc.to make application binaries reproducible too. See <https://reproducible-builds.org/docs/source-date-epoch/> for the details.

Pinning packages

The base image for Dockerfile can be pinned with tags like FROM debian:bookworm-20230904-slim . However, this is not enough for reproducing apt-get results, as apt-get installs the packages from the latest repos, not from the snapshot on 2023–09–04.

To install packages from a past snapshot, you have to configure the package manager to use a past snapshot explicitly. For Debian, /etc/apt/sources.list can be configured to use snapshot.debian.org/archive/debian/20230904T000000Z as follows:

FROM debian:bookworm-20230904-slim
ENV DEBIAN_FRONTEND=noninteractive
RUN rm -rf /etc/apt/sources.list* && \
  echo 'deb [check-valid-until=no] http://snapshot.debian.org/archive/debian/20230904T000000Z bookworm main' \
  >/etc/apt/sources.list && \
  echo 'deb [check-valid-until=no] http://snapshot.debian.org/archive/debian-security/20230904T000000Z bookworm-security main' \
  >>/etc/apt/sources.list && \
  echo 'deb [check-valid-until=no] http://snapshot.debian.org/archive/debian/20230904T000000Z bookworm-updates main' \
  >>/etc/apt/sources.list && \
  apt-get update && \
  apt-get install -y gcc

I wrote a script <https://github.com/reproducible-containers/repro-sources-list.sh> to simplify setting up /etc/apt/sources.list and enabling the cache for /var/cache/apt :

FROM debian:bookworm-20230904-slim
ADD --chmod=0755 \
  https://raw.githubusercontevnt.com/reproducible-containers/repro-sources-list.sh/v0.1.0/repro-sources-list.sh \
  /usr/local/bin/repro-sources-list.sh
ENV DEBIAN_FRONTEND=noninteractive
RUN --mount=type=cache,target=/var/cache/apt \
  repro-sources-list.sh && \
  apt-get update && \
  apt-get install -y gcc

Caching /var/cache/apt is optional, but highly recommended, as the snapshot server isn’t as fast as regular apt-get servers. The cache for /var/cache/aptcan be saved on GitHub Actions using <https://github.com/reproducible-containers/buildkit-cache-dance> :

steps:
  - uses: actions/cache@v3
    with:
      path: var-cache-apt
      key: var-cache-apt-${{ hashFiles('Dockerfile') }}
  - uses: reproducible-containers/buildkit-cache-dance@v2.1.2
    with:
      cache-source: var-cache-apt
      cache-target: /var/cache/apt

The techniques above work for Ubuntu (snapshot.ubuntu.com) and ArchLinux ( archive.archlinux.org ) too.

However, this is still challenging for Alpine Linux, Rocky Linux, AlmaLinux, etc., as they do not have snapshot servers. A workaround for these distro is to preserve /etc/apk/cache , /var/cache/dnf ,etc. by yourself: <https://github.com/reproducible-containers/repro-pkg-cache>.
In the long term, BuildKit frontends may have a built-in feature to help this: <https://github.com/moby/buildkit/issues/4259>.

Future work

After the general availability of BuildKit v0.13, I’ll submit PRs to make well-known images reproducible.

We also need a “single-click” platform for attesting reproducibility and sharing the result. This will probably need help from registry service providers.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities like Docker/Moby, BuildKit, and their relevant projects. Visit <https://www.rd.ntt/e/sic/recruit/> to see how to join us.

私たちNTTは、Docker/Moby や BuildKit などのオープンソースコミュニティで共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: <https://www.rd.ntt/sic/recruit/>

Links

Tools and examples: <https://github.com/reproducible-containers>

diffoci: diff for OCI images, to analyze non-reproducible builds
repro-sources-list.sh: reproducibility helper for Debian, Ubuntu, etc.
repro-pkg-cache: reproducibility helper for Alpine, Alma, Rocky, etc.
buildkit-cache-dance: apt-get cache for GitHub Actions

BuildKit docs: <https://github.com/moby/buildkit/blob/master/docs/build-repro.md>

[DockerCon 2023] Reproducible builds with BuildKit for software supply chain security was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.

The internals and the latest trends of container runtimes (2023)

Akihiro Suda — Wed, 21 Jun 2023 20:38:01 GMT

Last week I had an opportunity to give an online lecture about containers to students at Kyoto University.

The slide deck can be found here (PDF):

Contents:

Introduction to containers
Internals of container runtimes
Latest trends in container runtimes

1. Introduction to containers

What are containers?

Containers are a set of various lightweight methods to isolate filesystems, CPU resources, memory resources, system permissions, etc. Containers are similar to virtual machines in many senses, but they are more efficient and often less secure than virtual machines. (Slide 5)

An interesting thing is that there is still no strict definition of “containers”. Even virtual machines can be called "containers" when they provide container-like interfaces, e.g., when they implement the OCI (Open Container Initiative) specs. Such "non-container" containers are discussed later in Section 3.

Docker

Docker is the most popular container engine. Docker natively supports Linux containers and Windows containers, but Windows containers are out of the scope of this talk.

A typical command line to start a Docker container is as follows:

docker run -p 8080:80 -v .:/usr/share/nginx/html nginx:1.25

After executing this command, the content of `index.html` in the current directory will be visible in http://:8080/ .

The `-p 8080:80` part in the command line specifies to forward the TCP port 8080 of the host into the port 80 of the container.

The `-v .:/usr/share/nginx/html` part specifies to mount the current directory on the host onto `/usr/share/nginx/html` in the container.

The `nginx:1.25` specifies to use the official nginx image on Docker Hub. Docker images are somewhat similar to virtual machine images, however, they usually do not contain additional daemons such as systemd and sshd.

You can find the official images for other applications on Docker Hub too. You can also build your own images by yourself, using a language called Dockerfile:

FROM debian:12
RUN  apt-get update && apt-get install -y openjdk-17-jre
COPY myapp.jar /myapp.jar
CMD  ["java", "-jar", "/myapp.jar"]

An image can be built with the `docker build` command, and can be pushed to Docker Hub or other registry services with the `docker push` command.

Kubernetes

Kubernetes clusterizes multiple container hosts such as (but not limited to) Docker hosts to provide load balancing and fault-tolerance (Slide 10).

It is noteworthy that Kubernetes is also an abstraction framework for interacting with objects such as Pods (groups of containers that are always co-scheduled on a same host), Services (entities for network connectivity), and any kind of objects, but it is beyond the scope of this talk.

Docker vs pre-Docker containers

While containers didn't get much attention until the release of Docker in 2013, Docker wasn’t the first container platform:

1999: FreeBSD Jail
2000: Virtual Environment system for Linux (precursor to Virtuozzo and OpenVZ)
2001: Linux Vserver
2002: Virtuozzo
2004: BSD Jail for Linux
2004: Solaris Containers (Apparently, the term "container" was coined this time)
2005: OpenVZ
2008: LXC
2013: Docker

It is widely considered that FreeBSD Jail (circa 1999) is the first practical container implementation for Unix-like operating systems, although the term "container" wasn't coined at that time.

Since then, several implementations appeared for Linux too. However, pre-Docker containers were fundamentally different from Docker containers; they had focused on mimicking an entire machine with System V init, sshd, syslogd, etc., inside it. It was also often common to put a Web server, an application server, a database server, and everything into a single container

Docker changed the paradigm. In the case of Docker, a container usually only contains a single service (Slide 14) so that containers can be stateless and immutable. This design significantly reduces maintenance costs, as containers are now disposable; When something needs to be updated, you can just remove the container and recreate it from the latest image. You no longer need to install sshd and other utilities inside the container either, as you never need a shell access for it. This simplifies load-balancing and fault-tolerance too for multi-host clusters.

2. Internals of container runtimes

This section assumes using Docker v24 with its default configuration, but most parts are applicable to non-Docker containers too.

Docker under the hood

Docker consists of the client program ( `docker` CLI ) and the daemon program (`dockerd`). The `docker` CLI connects to the `dockerd` daemon via an Unix socket (`/var/run/docker.sock`) to create containers.

However, the `dockerd` daemon doesn't create containers by itself. It delegates control to the `containerd` (/container-dee/) daemon to create containers (Slide 17). But it doesn't create containers either; it further delegates control to the `runc` (/run-see/) runtime, which composes multiple Linux kernel features such as Namespaces, Cgroups, and Capabilities to implement the concept of "containers". There is no "container" object in the Linux kernel.

Namespaces

Namespaces isolate resources from the host and from other containers.

The most well-known namespaces are mount namespaces (Slide 19). Mount namespaces isolate the filesystem view so that a container can change the rootfs to `/var/lib/docker/.../` using the `pivot_root(2)` syscall. This syscall is similar to traditional `chroot(2)` but more secure.

The container's rootfs has very similar structure as the host, but it has several restrictions on `/proc`, `/sys`, and `/dev`. e.g.,

The `/proc/sys` directory is remounted as a read-only bind mount to prohibit sysctl.
The `/proc/kcore` file (RAM) is masked by mounting `/dev/null` over it.
The `/sys/firmware` directory (firmware data) is masked by mounting an empty read-only tmpfs over it.
Accesses to the `/dev` directories are restricted by Cgroups (discussed later).

Network namespaces (Slide 21) allow assigning dedicated IP addresses to containers so that they can talk to each other by IP.

PID namespaces (Slide 23) isolate process trees so that a container can't control processes outside it.

User namespaces (Slide 24; not to be confused with "user spaces") isolate the root privilege by mapping a non-root user on the host to the pseudo "root" in a container. The pseudo root can behave like the root in the container to run `apt-get`, `dnf`, etc., but it doesn't have privileged accesses to resources outside the container.

User namespaces significantly mitigate potential container breakout attacks, but it is still not used by default in Docker.

Other namespaces:

IPC namespaces: Isolates System V inter-process communication objects, etc.
UTS namespaces: Isolates the hostname. "UTS" (Unix Time Sharing system) seems a misnomer for this namespace.
(Optional) Cgroup namespaces: Isolates `/sys/fs/cgroup` hierarchy.
(Optional) Time namespaces: Isolates clocks. Not used by most containers yet.

Cgroups

Cgroups (control groups) impose several resource quotas such as CPU usage, memory usage, block I/O, and number of processes in a container.

Cgroups also control accesses to device nodes. The default configuration of Docker allows unlimited accesses to `/dev/null`, `/dev/zero`, `/dev/urandom`, etc., and disallows accesses to`/dev/sda` (disk devices), `/dev/mem` (memory), etc.

Capabilities

On Linux, the root privilege is represented by a 64-bit capability flag set. 41 bits are in use today.

The default configuration of Docker drops system-wide administration capabilities such as `CAP_SYS_ADMIN`.

The retained capabilities include:

`CAP_CHOWN`: for running `chown` inside containers.
`CAP_NET_BIND_SERVICE`: for binding TCP and UDP ports beneath 1024 inside containers.
`CAP_NET_RAW`: for running legacy `ping` implementations that need to craft raw Ethernet packets. This capability is quite dangerous, as it allows ARP spoofing and DNS spoofing in the container's network. A future version of Docker may disable it by default.

(Optional) Seccomp

Seccomp (Secure computing) allows specifying an explicit allowlist (or a denylist) of syscalls. The default configuration of Docker allows about 350 syscalls.

Seccomp is used for defense in depth; It is not a hard requirement for containers. For the sake of backward compatibility, Kubernetes still does not use seccomp by default, and it probably will never change the default configuration in the foreseeable future. Users can still opt-in to enable seccomp via `KubeletConfiguration`.

(Optional) AppArmor XOR SELinux

AppArmor and SELinux (Security Enhanced Linux) are LSMs (Linux Security Modules) that provide further fine-grained configuration knobs.

These are mutually exclusive; one is chosen by host OS distributors (not by container image distributors):

AppArmor: chosen by Debian, Ubuntu, SUSE, etc.
SELinux: chosen by Fedora, Red Hat Enterprise Linux, and similar host OS distributions.

Docker's default AppArmor profile almost just overlaps with its default configuration for capabilities, mount masks, etc., for the sake of defense-in-depth. Users may add custom settings for further security.

But the story is different for SELinux. To run containers in the `selinux-enabled` mode, you have to append an option `:z` (lower character) or `:Z` (upper character) to a bind mount, or run complex `chcon` commands by yourself to avoid permission errors.

The `:z` (lower character) option is used for Type Enforcement (Slide 32). Type Enforcement protects host files from containers, by assigning "types" to processes and files. A process running with the `container_t` type can read files with the `container_share_t` type, and read/write files with the `container_file_t` type, but it can't access files with other types.

The `:Z` (upper character) option is used for Multi-category Security (Slide 33). Multi-category Security protects a container from another container, by assigning category numbers to processes and files. e.g., A process with Category 42 can't access files labeled with Category 43.

What about Docker for Mac/Win?

Docker Desktop products support running Linux containers on Mac and Windows, but they are just running a Linux virtual machine under the hood to run containers on it. The containers are not directly running on macOS and Windows.

3. Latest trends in container runtimes

Alternatives to Docker (as Kubernetes runtimes)

The first version of Kubernetes (2014) was solely made for Docker (Slide 37). Kubernetes v1.3 (2016) added an interim support for an alternative container runtime called rkt, but rkt was retired in 2019. The effort for supporting alternative container runtimes yielded the Container Runtime Interface (CRI) API in Kubernetes v1.5 (2016). After the debut of CRI, the industry has converged to have two alternative runtimes: containerd (/container-dee/) and CRI-O (/cry-oh/, /cree-oh/, or /see-er-eye-oh/).

Kubernetes still had a built-in support for Docker (Slide 38), but it was finally removed in Kubernetes v1.24 (2022). Docker still continues to work for Kubernetes as a third party runtime (via the `cri-dockerd` shim), but Docker is now seeing less adoptions for Kubernetes.

The big names in the industry has already switched away from Docker to containerd, or to CRI-O:

Adopters of containerd: Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), k3s, ... (many)
Adopters of CRI-O: Red Hat OpenShift, Oracle Container Engine for Kubernetes (OKE), ...

containerd focuses on extensibility and supports non-Kubernetes workloads as well as Kubernetes workloads. In contrast, CRI-O focuses on simplicity and solely supports Kubernetes.

Alternatives to Docker (as CLI)

While Kubernetes has become the standard for multi-node production clusters, users still want Docker-like CLI for building and testing containers locally on their laptops. Docker basically satisfies this demand, but runtime developers in the community wanted to build their own "lab" CLIs to incubate new features ahead of Docker and Kubernetes, as it was often hard to propose new features to Docker and Kubernetes, for several technical/technological reasons.

Podman (formerly called kpod in 2016) is a Docker-compatible standalone container engine created by Red Hat and others. Its main difference from Docker is that it does not have the daemon process by default. Also, Podman is unique in the sense that it provides first-class support for managing Pods (groups of containers that share the same network namespace and often data volumes on the same host for efficient communication) as well as containers. However, most users seem to just use Podman for non-pod containers.

nerdctl (/nerd-see-tee-el/, founded by myself in 2020) is a Docker-compatible CLI for containerd (/container-dee/). nerdctl was originally made for experimenting new features such as lazy-pulling (discussed later), but it is also useful for debugging Kubernetes nodes that are running containerd.

See also my blog article "Released nerdctl v1.0" (October 2022) for the further information:

Released nerdctl v1.0

Running containers on Mac

Docker Desktop products for Mac and Windows are proprietary. Windows users can just run the Linux version of Docker (Apache License 2.0, no GUI) in WSL2, but there was no equivalent for Mac users so far.

Lima (/lee-mah/, founded by myself too in 2021) is a command line tool to create a WSL2-like environment on macOS for running containers. Lima uses nerdctl by default, but it supports Docker and Podman too.

See also my blog article "Lima is now a CNCF project" (October 2022).

Lima is now a CNCF project 🎉

Lima is also adopted by third party projects such as colima (2021), Rancher Desktop (2021), and Finch (2022).

Podman community released Podman Machine (command line tool, 2021) and Podman Desktop (GUI, 2022) as an alternative for Docker Desktop. Podman Desktop supports Lima too, optionally.

Docker being refactored

containerd mainly provides two subsystems: the runtime subsystem and the image subsystem. However, the latter one is not used by Docker. This is problematic because Docker's own legacy image subsystem is far behind containerd's modern image subsystem (and it caused me to launch the nerdctl project):

No support for lazy-pulling (on-demand image pulling)
Limited support for multi-platform images (e.g., AMD64/ARM64 dual-platform images)
Limited compliance of OCI Image Spec

This long-standing problem is finally being resolved. Docker v24 (2023) added an experimental support for using containerd's image subsystem with an undocumented option (subject to change) in `/etc/docker/daemon.json`:

{"features":{"containerd-snapshotter": true}}

A future version of Docker (2024? 2025?) is likely to use containerd's image subsystem by default.

Lazy-pulling

Most files in container images are never used:

“pulling packages accounts for 76% of container start time, but only 6.4% of that data is read”
From “Slacker: Fast Distribution with Lazy Docker Containers” (Harter, et al., FAST 2016)

"Lazy-pulling" is a technique to reduce container startup time by pulling partial image contents on demand. This is not possible with OCI-standard tar.gz images, as they do not support `seek()` operations. Several alternative formats are being proposed to support lazy-pulling:

eStargz (2019): Optimizes gzip granularity for seek()-ability; Forward compatible with OCI v1 tar.gz.
SOCI (2022): Captures a checkpoint of tar.gz decoder state; Forward compatible with OCI v1 tar.gz.
Nydus (2022): An alternate image format;
Not compatible with OCI v1 tar.gz.
OverlayBD (2021): Block devices as container images; Not compatible with OCI v1 tar.gz.

Slide 51 shows a benchmark result of eStargz. Lazy-pulling (+additional optimizations) can reduce the container startup time to 1/9.

See also articles from my colleague Kohei Tokunaga:

Expanding adoption of User namespaces

User namespaces are still rarely used in the Docker and Kubernetes ecosystem, although Docker has been supporting it since v1.9 (2015).

One of the reasons is that the complexity and the overhead of “chowning” container rootfs for a pseudo root. Linux kernel v5.12 (2021) added “idmapped mounts” to eliminate the necessity for chowning. This is planned to be supported in runc v1.2.

After the release of runc v1.2, user namespaces are expected to be more popular for Docker and Kubernetes, which just added preliminary support for user namespaces in v1.25 (2022). For compatibility sake, it is unlikely that Kubernetes will ever enable User namespaces by default. However, Docker may still potentially enable user namespaces by default in future. Nothing is decided yet, though.

Rootless containers

Rootless containers is a technique to put container runtimes, as well as containers, in a user namespace that is created by a non-root user to mitigate potential vulnerabilities of runtimes.

Even if a container runtime has a bug that allows an attacker to escape from a container, an attacker can't have a privileged access to other user's files, kernel, firmware, and devices.

Here is a brief history of rootless containers:

2014: LXC v1.0 introduced support for rootless containers. At that time, rootless containers were called "unprivileged containers". LXC's unprivileged containers are slightly different from modern rootless containers, as they require a SETUID binary for bringing up networks.
2017: runc v1.0-rc4 gained initial support for rootless containers
2018: Several works has begun to support rootless containers in containerd, BuildKit (backend of `docker build`), Docker, Podman, etc., slirp4netns (Slide 56) was created (by myself) to allow SETUID-less networking by translating Ethernet packets to unprivileged socket syscalls.
2019: Docker v19.03 was released with an experimental support for rootless containers. Podman v1.1 was also released with the same feature in this year, slightly ahead of Docker v19.03.
2020: Docker v20.10 was released with general availability of rootless containers.

Through 2020 to 2022, we also worked on bypass4netns (Slide 57) to eliminate the overhead of slirp4netns, by hooking socket file descriptors inside a container and reconstructing them outside the container. The achieved throughput is even faster than "rootful" containers.

Rootless containers have successfully gained popularity, but there have been also criticisms against rootless containers. Especially, it is controversial whether non-root users should be allowed to create user namespaces that are required for running rootless containers. I'd answer yes for container users, because rootless containers are at least much safer than running everything as the root. However, I'd rather answer no for who don't use containers, because user namespaces can be also attack surfaces. e.g., CVE-2023–32233: "Privilege escalation in Linux Kernel due to a Netfilter nf_tables vulnerability".

The community has been already seeking remedies for this dilemma. Ubuntu (since 13.10) and Debian provide a sysctl knob `kernel.unprivileged_userns_clone=` to specify whether to allow or disallow creating unprivileged user namespaces. However, their patch is not merged in the upstream Linux kernel.

Instead, the upstream kernel introduced a new LSM (Linux Security Module) hook `userns_create` in Linux v6.1 (2022) so that an LSM can dynamically decide whether to allow or disallow creating a user namespace. This hook is callable from eBPF (`bpf_program__atttach_lsm()`), so it is expected that there will be a fine-grained and non-distribution-specific knob that does not depend on AppArmor nor SELinux. However, userspace utilities for eBPF + LSM are not matured yet to provide a good user experience for this.

More LSMs

Landlock LSM was merged into Linux v5.13 (2021). Landlock is similar to AppArmor in the sense that it restricts file accesses by paths (`LANDLOCK_ACCESS_FS_EXECUTE`, `LANDLOCK_ACCESS_FS_READ_FILE`, etc.), but Landlock does not require the root privilege for setting up a new profile. Landlock is also very similar to OpenBSD's `pledge(2)`.

Landlock is still not supported by the OCI Runtime Spec, but I guess it can be included in the OCI Runtime Spec v1.2.

Kata Containers

As I mentioned in Section 1, "containers" is not a well-defined terminology. Anything can be called "containers" when it provides good compatibility with the existing container ecosystem.

Kata Containers (2017) are such sort of "containers" that are not actually containers in the narrower sense. Kata Containers are actually virtual machines but with support for the OCI Runtime Spec. Kata Containers are much more secure than runc containers, however, they have drawbacks on performance and they do not work well on typical non-baremetal IaaS instances that do not support nested virtualization.

Kata Containers works as a containerd runtime plugin, and receives same images and runtime configurations as runc containers. Its user experience is almost indistinguishable from runc containers.

gVisor

gVisor (2018) is yet another exotic container runtime. gVisor traps syscalls and execute them in a Linux-compatible usermode kernel to mitigate attacks. gVisor currently has three modes for trapping syscalls:

KVM mode: rarely used, but the best option for bare-metal hosts
ptrace mode: the most common option but slow
SIGSYS trap mode (since 2023): expected to replace ptrace mode eventually

gVisor has been used in Google's several products including Google Cloud Run. However, Google Cloud Run has switched away from gVisor to microVM in 2023:

“This means that software that previously didn’t run in Cloud Run due to unimplemented system call issues can now run in Cloud Run’s second-generation execution environment.”
From https://cloud.google.com/blog/products/serverless/cloud-run-jobs-and-second-generation-execution-environment-ga/?hl=en

This implies that gVisor's performance and compatibility issues are not negligible for their business.

WebAssembly

WebAssembly (WASM) is a platform-independent byte code format that was originally designed for Web browsers in 2015. WebAssembly is somewhat similar to Java applets (1995) but it puts more focus on portability and security. One interesting aspect of WebAssembly is that it splits the code address space from the data address space; there are no instructions like `JMP <immediate>` and `JMP *<reg>`. It only supports jumping to labels that are resolved on compilation time. This design reduces arbitrary code execution bugs, although it also sacrifices feasibility of JIT-compiling other byte code formats into WebAssembly.

WebAssembly is also in the spotlight as a potential alternative to containers. For running WebAssembly out of browsers, WASI (WebAssembly System Interface) was proposed in 2019 to provide low-level API (e.g., `fd_read()`, `fd_write()`, `sock_recv()`, `sock_send()`) that can be used for implementing POSIX-like layers on it. containerd added "runWASI" plugin in 2022 to treat WASI workloads as containers.

In 2023, WASIX was proposed to extend WASI to provide more convenient (and somewhat controversial) functions:

Threads: `thread_spawn()`, `thread_join()`, ...
Processes: `proc_fork()`, `proc_exec()`, ...
Sockets: `sock_listen()`, `sock_connect()`, ...

Eventually, these movements may replace a huge (but non-100%) portion of containers. Solomon Hykes, the founder of Docker, says that "If WASM+WASI existed in 2008, we wouldn’t have needed to created Docker":

Solomon Hykes / @shykes@hachyderm.io on Twitter: "If WASM+WASI existed in 2008, we wouldn't have needed to created Docker. That's how important it is. Webassembly on the server is the future of computing. A standardized system interface was the missing link. Let's hope WASI is up to the task! https://t.co/wnXQg4kwa4 / Twitter"

If WASM+WASI existed in 2008, we wouldn't have needed to created Docker. That's how important it is. Webassembly on the server is the future of computing. A standardized system interface was the missing link. Let's hope WASI is up to the task! https://t.co/wnXQg4kwa4

Recap

Containers are more efficient, but often less secure, than virtual machines. Lots of security technologies are being introduced to harden containers. (User namespaces, Rootless containers, Linux security modules, ...)
Alternatives to Docker are arising (containerd, CRI-O, Podman, nerdctl, Finch, ...), but Docker isn’t fading out.
“Non-container” containers are trends too.
(Kata: VM-based, gVisor: user mode kernel, runWASI: WebAssembly, ...)

Slide 71 shows the landscape of the well-known runtimes.

See also the rest of the slides for the further topics that could not be covered in the talk.

NTT is hiring!

We at NTT have been proudly leading the trends of containers and other open source software. Visit https://www.rd.ntt/e/sic/recruit/ to see how to join us.

私たちNTTは、コンテナ等のOSSの流行を牽引していることを自負しています。ぜひ弊社採用情報ページをご覧ください: https://www.rd.ntt/sic/recruit/

The internals and the latest trends of container runtimes (2023) was originally published in nttlabs on Medium, where people are continuing the conversation by highlighting and responding to this story.