On The Way To LXC 3.0: Moving The Cgroup Pam Module Into The LXC Tree (Including A Detour About Fully Unprivileged Containers)

Hey everyone,
This is another update about the development of LXC 3.0.
A few days ago the
pam_cgfs.so
pam module has been moved out of the LXCFS
tree and into the LXC tree.
This means LXC 3.0 will be shipping with pam_cgfs.so included. The pam
module has been placed under the configure.ac flags --enable-pam and
--disable-pam. By default pam_cgfs.so is disabled. Distros that are
currently shipping pam_cgfs.so through LXCFS should adapt their packaging
accordingly and pass --enable-pam during the configure stage of LXC.
What’s That pam_cgfs.so Pam Module Again?
Let’s take short detour (“short” cough cough). LXC has supported fully
unprivileged containers since 2013 when user namespace support was merged
into the kernel. (/me tips hat to Serge Hallyn and Eric Biedermann). Fully
unprivileged containers are containers using user namespaces and idmappings
which are run by normal (non-root) users. But let’s not talk about this let’s
show it. The first asciicast shows a fully unprivileged system container
running with a rather complex idmapping in a new user namespace:
The second asciicast shows a fully unprivileged application container running
without a mapping for root inside the container. In fact, it runs with just
a single idmap that maps my own host uid 1000 and host gid 1000 to
container uid 1000 and container gid 1000. Something which I can do
without requiring any privilege at all. We’ve been doing this a long time at
LXC:
As you can see no non-standard privileges are used when setting up and running
such containers. In fact, you could remove even the standard privileges all
unprivileged users have available through standard system tools like
newuidmap and newgidmap to setup idmappings (This is what you see in the
second asciicast.). But this comes at a price, namely that cgroup management is
not available for fully unprivileged containers. But we at LXC want you to
be able to restrict the containers your run in the same way that the system
administrator wants to restrict unprivileged users themselves. This is just
good practice to prevent excessive resource consumption. What this means is
that you should be free to delegate resources that you have been given by the
system administrator to containers. This e.g. allows you to limit the cpu usage
of the container, or the number of processes it is allowed to spawn, or the
memory it is allowed to consume. But unprivileged cgroup management is not
easily possible with most init system. That’s why the LXC team came up with
pam_cgfs.so a long time ago to make things easier. In essence, the
pam_cgfs.so pam module takes care of placing unprivileged users into writable
cgroups at login. The cgroups that are supposed to be writable can be specified
in the corresponding pam configuration file for your distro (probably something
under /etc/pam.d). For example, if you wanted your user to be placed into
a writable cgroup for all enabled cgroup hierarchies you could specify all:
session optional pam_cgfs.so -c all
If you only want your user to be placed into writable cgroups for the
freezer, memory, unified and the named systemd hierarchy you would
specify:
session optional pam_cgfs.so -c freezer,memory,name=systemd,unified
This would lead pam_cgfs.so to create the common cgroup user and also
create a cgroup just for my own user in there. For example, my user is called
chb. This would cause pam_cgfs.so to create the
/sys/fs/cgroup/freezer/user/chb/0 inside the freezer hierarchy. If
pam_cgfs.so finds that your init system has already placed your users inside
a session specific cgroup it will be smart enough to detect it and re-use that
cgroup. This is e.g. the case for the named systemd cgroup hierarchy.
chb@conventiont|~
> cat /proc/self/cgroup
12:hugetlb:/
11:devices:/user.slice
10:memory:/user.slice
9:perf_event:/
8:net_cls,net_prio:/
7:cpu,cpuacct:/user.slice
6:rdma:/
5:pids:/user.slice/user-1000.slice/session-1.scope
4:cpuset:/
3:blkio:/user.slice
2:freezer:/user/chb/0
1:name=systemd:/user.slice/user-1000.slice/session-1.scope
0::/user.slice/user-1000.slice/session-1.scope
Christian

