Containers from first principles

0xbadcafebee · on June 5, 2020

> Each container should have at least the following isolated: network stack, filesystem, processes

I'm going to go out on a limb here and say that in 99% of cases, there is no benefit to the network isolation. It adds unnecessary overhead and complexity, all so we don't have to configure individual services with a unique listening port. But the host that exposes the virtual networks still has to route something to them, so you still need to assign an arbitrary port and then do some port forwarding.

Going even further in the unnecessary abstractions category is process isolation. In most cases we don't need that either. From a security perspective, I trust the Linux kernel about as far as I can throw it, so I don't care what guarantees there supposedly are, container breakout and local privesc are (in my opinion) a near-certainty. So why are we forcing ourselves to jump through tons of hoops just to heap dump or ptrace() an application? The regular-old security mechanisms in Linux are enough for every other process on the system, why not container processes?

Filesystem abstraction (copy-on-write overlays and chroots) has been the killer feature of containers since day one. That is the one thing about containers that makes them useful: a reproducible application snapshot without dependency management hell. If we strip everything else about containers away, this is the one thing we need to keep the useful purpose of a container.

Docker threw in a lot of extra incredibly extra features, such as the Dockerfile (no more configuration management!) and overlays and build cache and layers, etc. Nobody would be using containers if all these features weren't present in one solution, and we all owe them a big debt and thanks. But if we really strip the container down to its essential useful element, it's basically just a wrapper round chroot().

Terretta · on June 5, 2020

> That is the one thing about containers that makes them useful: a reproducible application snapshot without dependency management hell.

Joe Stein, of Kafka renown, calls containers "21st century tarball".

// I realize I've mentioned this before, in 2017's "My VM is lighter and safer than your container": https://news.ycombinator.com/item?id=15614777

infogulch · on June 6, 2020

My favorite is "static linking for millennials"

barrkel · on June 5, 2020

If you're just using docker and running containers yourself, manually, or via simple daemon scripts, you're not far wrong.

If you're deploying to a cloud environment and you just want your stuff to run, have no restrictions around being on the same box or different boxes, and make your stuff find the things it needs to via service names, and have stateless things that auto-scale and load-balance transparently, then you need more isolation because you can't have an environment which delivers that kind of functionality without it.

jeffbee · on June 5, 2020

There are large-scale existence proofs of environments that do all this without any of network or process or filesystem isolation.

jayd16 · on June 5, 2020

Maybe, but network isolation is an easy way to solve it and the existence of those environments shows its an important thing to solve.

0xbadcafebee · on June 6, 2020

The thing it's solving is not needing to configure the software to bind on a particular port and IP at runtime. I will agree that this is the easiest possible solution from a usability standpoint. But this functionality adds a unique non-portable requirement to the host, in addition to making routing more difficult. An entrypoint could take care of making the port change at runtime, and then we wouldn't need the extra abstraction.

One of the biggest problems of running containers in the cloud is that they're not natively rootless. That largely comes from all the extra layers and abstractions and functionality, which again are super useful, but not necessary. The cloud would be a lot easier to manage if we didn't make it so complex.

jayd16 · on June 6, 2020

I disagree. You'd still have to manage it, except instead of late binding and managing it in the image routing, you need to manage it in your apps instead.

Routing may be complex but IMO the system as a whole is a lot less complex.

And philosophically its the same arguments for file system isolation. You could just do the hard work to make sure your software knows not to impact other apps running on the server...but that's the whole thing you want to avoid.

We use containers because its hard to guess the environment they'll be run in and we made the decision to jump through some hoops to be able to do this configuration at deploy time not dev time.

barrkel · on June 5, 2020

Like Mesos, or YARN, they still package up an application for orchestration, and they're not as flexible.

jayd16 · on June 5, 2020

You're really underestimating the complexity the network isolation is solving for.

If you had dynamic port binding, you'd still need to solve for security around open ports, conflict management, and you're adding this complexity into apps of varying quality.

I don't know what you're using docker for but in most of my cases, a non-isolated network would require rewrites and a lot of extra management.

nsajko · on June 5, 2020

> The regular-old security mechanisms in Linux

Do you mean setpriv (separation between users and user groups)? Tangentially related: https://news.ycombinator.com/item?id=23279068

JamesSwift · on June 5, 2020

I don't have a strong enough understanding of either Docker or the specific pieces you mention to form an opinion on your post. I would love if others could chime in to give their perspective on this position though.

emmelaich · on June 5, 2020

I mostly agree with peterwwillis, but I regard the security isolation as not prevention against malicious attacks, but a convenience against accidental leakages.

Regarding network isolation, it's nice to not have to fiddle with e.g. changing ports in some bit of software to avoid conflicts.

Yes there are individual ways to achieve most of what Docker does without wrapping it all up but the wrapping up all the functionality into one is where its value is.

Also I love being able to create a container on my mac laptop and then deploying that as a unit to production for repeatability's sake.

disqard · on June 5, 2020

I recently discovered systemd-nspawn and was amazed at how lightweight a basic container can be.

westurner · on June 5, 2020

"Docker Without Docker" (2015) explains /sbin/init and systemd-nspawn. Systemd did not exist when docker was first created. https://chimeracoder.github.io/docker-without-docker/

oso2k · on June 5, 2020

That's not what I remember. Wikipedia backs up that recollection as well [0], marking systemd's initial release in 2010. Docker's initial release is listed as 2013 [1]. Maybe that was true of the dotCloud internal releases, and certainly, not all of the EL and other Linux distros had not adopted systemd during 2010 - 2013. Certainly after 2014 or 2015, systemd had spread to the major Linux distros so Docker could have chosen to take a systemd-based approach at that point.

[0] https://en.wikipedia.org/wiki/Systemd

[1] https://en.wikipedia.org/wiki/Docker_(software)

westurner · on June 6, 2020

Are there other systemd + containers solutions?

"Chapter 4. Running containers as Systemd services with Podmam" https://access.redhat.com/documentation/en-us/red_hat_enterp...

AFAIU, when running containers with systemd:

- logs go to journald by default

- there's no docker-compose for just the [name-prefixed] containers in the docker-compose.yml,

- you can use systemd unit template parametrization

- it's not as easy to collect metrics on every container on the system without a read-only docker socket: how many containers are running, how much RAM quota are they assigned and utilizing? What are the filesystem and port mappings?

- you can run containers as non-root

- you can run containers in systemd timer units

- you use runC to handle seccomp

... You can do cgroups and namespaces with just systemd; but keeping chroots/images upgraded is outside the scope of systemd: where is the ideal boundary between systemd and containers?

See this comment regarding per-container MAC MCS labels: https://news.ycombinator.com/item?id=23430959

There's much additional complexity that justifies k8s / OpenShift: when would I want to manage containers with just systemd units?

kcolford · on June 5, 2020

Why do we need to use mount and pivot_root when we have chroot available? Am I missing something here about why those can't be used?

setheron · on June 5, 2020

You can escape chroot easily with relative paths pretty sure.

In Linux it never surprises me that there's X ways to do Y. A side effect of the OSS system and wanting to not break comparability.

This guide was meant for newbies so it doesn't broach on these security concerns.

dataflow · on June 5, 2020

I was so confused why Fareed Zakaria would be talking about containers until I Googled and read the name more carefully...

chrisweekly · on June 5, 2020

Heh, apparently you're not the only one. Way down at the bottom of the linked page:

"I'm a software engineer, father and wishful amateur surfer. If you've come seeking my political views; you've found the wrong Fareed."

setheron · on June 5, 2020

(I'm the author)

It's a long running joke that I've come to terms with.

chrisweekly · on June 5, 2020

Favorited; thanks for this useful, well-written post!

setheron · on June 6, 2020

Thank you (author). This was a written version of a live tutorial I have my peers.

It's always challenging to translate shell focused teaching in prose but I'm glad it struck a chord.

westurner · on June 5, 2020

> Many people might think the word “container” has a specific meaning within the Linux kernel; however the kernel has no notion of a “container”. The word has been synonymous with a variety of Linux tooling which when applied give the resemblance of what we expect a container to be.

Before LXC ( https://LinuxContainers.org ) and CNCF ( https://landscape.cncf.io/ ) and OCI ( https://opencontainers.org/ ), for shared-kernel VPS hosting ("virtual private server"; root on a shared box), there was OpenVZ (which requires a patched kernel and AFAIU still has features, like bursting, not present in cgroups).

Docker no longer has an LXC driver: libcontainer (opencontainers/runc) is the story now. The LXC docs have a great list of utilized kernel features that's also still true for docker-engine = runC + moby. The LXC docs: https://linuxcontainers.org/lxc/introduction/ :

> Current LXC uses the following kernel features to contain processes:

> ## Kernel namespaces (ipc, uts, mount, pid, network and user)

>> Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources. https://en.wikipedia.org/wiki/Linux_namespaces

> ## Apparmor and SELinux profiles https://en.wikipedia.org/wiki/AppArmor / https://en.wikipedia.org/wiki/Security-Enhanced_Linux

udica is an interesting tool for creating SELinux policies for containers.

Is it possible for each container to run confined with a different SELinux label?

> ## Seccomp policies https://en.wikipedia.org/wiki/Seccomp

See below re: Seccomp.

> ## Chroots (using pivot_root) https://en.wikipedia.org/wiki/Chroot

Chroots and symlinks, Chroots and bind mounts, Chroots and overlay filesystems, Chroots and SELinux context labels.

FWIU, Chroots are a native feature of filesystem syscalls in Fuchsia.

> ## Kernel capabilities

https://wiki.archlinux.org/index.php/Capabilities :

>> "Capabilities (POSIX 1003.1e, capabilities(7)) provide fine-grained control over superuser permissions, allowing use of the root user to be avoided. Software developers are encouraged to replace uses of the powerful setuid attribute in a system binary with a more minimal set of capabilities. Many packages make use of capabilities, such as CAP_NET_RAW being used for the ping binary provided by iputils. This enables e.g. ping to be run by a normal user (as with the setuid method), while at the same time limiting the security consequences of a potential vulnerability in ping."

> ## CGroups (control groups)* https://en.wikipedia.org/wiki/Cgroups

Control groups enable per-process (and to thus per-container) resource quotas. Other than limiting the impact of resource exhaustion, cgroups are not a security feature of the Linux kernel.

Here's a helpful explainer of the differences between some of these kernel features; which, combined, have become somewhat ubiquitous:

From "Formally add support for SELinux" (k3s #1372) https://github.com/rancher/k3s/issues/1372#issuecomment-5817... :

> https://blog.openshift.com/securing-kubernetes/*

>> The main thing to understand about SELinux integration with OpenShift is that, by default, OpenShift runs each container as a random uid and is isolated with SELinux MCS labels. The easiest way of thinking about MCS labels is they are a dynamic way of getting SELinux separation without having to create policy files and run restorecon.*

>> If you are wondering why we need SELinux and namespaces at the same time, the way I view it is namespaces provide the nice abstraction but are not designed from a security first perspective. SELinux is the brick wall that’s going to stop you if you manage to break out of (accidentally or on purpose) from the namespace abstraction.

>> CGroups is the remaining piece of the puzzle. Its primary purpose isn’t security, but I list it because it regulates that different containers stay within their allotted space for compute resources (cpu, memory, I/O). So without cgroups, you can’t be confident your application won’t be stomped on by another application on the same node.

From Wikipedia: https://en.wikipedia.org/wiki/Seccomp ::

> seccomp (short for secure computing mode) is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit(), sigreturn(), read() and write() to already-open file descriptors. Should it attempt any other system calls, the kernel will terminate the process with SIGKILL or SIGSYS.[1][2] In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

... SELinux is one implementation of MAC (Mandatory Access Controls) that is built upon the LSM (Linux Security Modules) support in the Linux kernel. Some distros include policy sets for Docker hosts and lots of other packages that could be installed; see: "Formally add support for SELinux" (k3s #1372) https://github.com/rancher/k3s/issues/1372#issuecomment-5817...