> Each container should have at least the following isolated: network stack, filesystem, processes
I'm going to go out on a limb here and say that in 99% of cases, there is no benefit to the network isolation. It adds unnecessary overhead and complexity, all so we don't have to configure individual services with a unique listening port. But the host that exposes the virtual networks still has to route something to them, so you still need to assign an arbitrary port and then do some port forwarding.
Going even further in the unnecessary abstractions category is process isolation. In most cases we don't need that either. From a security perspective, I trust the Linux kernel about as far as I can throw it, so I don't care what guarantees there supposedly are, container breakout and local privesc are (in my opinion) a near-certainty. So why are we forcing ourselves to jump through tons of hoops just to heap dump or ptrace() an application? The regular-old security mechanisms in Linux are enough for every other process on the system, why not container processes?
Filesystem abstraction (copy-on-write overlays and chroots) has been the killer feature of containers since day one. That is the one thing about containers that makes them useful: a reproducible application snapshot without dependency management hell. If we strip everything else about containers away, this is the one thing we need to keep the useful purpose of a container.
Docker threw in a lot of extra incredibly extra features, such as the Dockerfile (no more configuration management!) and overlays and build cache and layers, etc. Nobody would be using containers if all these features weren't present in one solution, and we all owe them a big debt and thanks. But if we really strip the container down to its essential useful element, it's basically just a wrapper round chroot().
If you're just using docker and running containers yourself, manually, or via simple daemon scripts, you're not far wrong.
If you're deploying to a cloud environment and you just want your stuff to run, have no restrictions around being on the same box or different boxes, and make your stuff find the things it needs to via service names, and have stateless things that auto-scale and load-balance transparently, then you need more isolation because you can't have an environment which delivers that kind of functionality without it.
The thing it's solving is not needing to configure the software to bind on a particular port and IP at runtime. I will agree that this is the easiest possible solution from a usability standpoint. But this functionality adds a unique non-portable requirement to the host, in addition to making routing more difficult. An entrypoint could take care of making the port change at runtime, and then we wouldn't need the extra abstraction.
One of the biggest problems of running containers in the cloud is that they're not natively rootless. That largely comes from all the extra layers and abstractions and functionality, which again are super useful, but not necessary. The cloud would be a lot easier to manage if we didn't make it so complex.
I disagree. You'd still have to manage it, except instead of late binding and managing it in the image routing, you need to manage it in your apps instead.
Routing may be complex but IMO the system as a whole is a lot less complex.
And philosophically its the same arguments for file system isolation. You could just do the hard work to make sure your software knows not to impact other apps running on the server...but that's the whole thing you want to avoid.
We use containers because its hard to guess the environment they'll be run in and we made the decision to jump through some hoops to be able to do this configuration at deploy time not dev time.
You're really underestimating the complexity the network isolation is solving for.
If you had dynamic port binding, you'd still need to solve for security around open ports, conflict management, and you're adding this complexity into apps of varying quality.
I don't know what you're using docker for but in most of my cases, a non-isolated network would require rewrites and a lot of extra management.
I don't have a strong enough understanding of either Docker or the specific pieces you mention to form an opinion on your post. I would love if others could chime in to give their perspective on this position though.
I mostly agree with peterwwillis, but I regard the security isolation as not prevention against malicious attacks, but a convenience against accidental leakages.
Regarding network isolation, it's nice to not have to fiddle with e.g. changing ports in some bit of software to avoid conflicts.
Yes there are individual ways to achieve most of what Docker does without wrapping it all up but the wrapping up all the functionality into one is where its value is.
Also I love being able to create a container on my mac laptop and then deploying that as a unit to production for repeatability's sake.
That's not what I remember. Wikipedia backs up that recollection as well [0], marking systemd's initial release in 2010. Docker's initial release is listed as 2013 [1]. Maybe that was true of the dotCloud internal releases, and certainly, not all of the EL and other Linux distros had not adopted systemd during 2010 - 2013. Certainly after 2014 or 2015, systemd had spread to the major Linux distros so Docker could have chosen to take a systemd-based approach at that point.
- there's no docker-compose for just the [name-prefixed] containers in the docker-compose.yml,
- you can use systemd unit template parametrization
- it's not as easy to collect metrics on every container on the system without a read-only docker socket: how many containers are running, how much RAM quota are they assigned and utilizing? What are the filesystem and port mappings?
- you can run containers as non-root
- you can run containers in systemd timer units
- you use runC to handle seccomp
... You can do cgroups and namespaces with just systemd; but keeping chroots/images upgraded is outside the scope of systemd: where is the ideal boundary between systemd and containers?
> Many people might think the word “container” has a specific meaning within the Linux kernel; however the kernel has no notion of a “container”. The word has been synonymous with a variety of Linux tooling which when applied give the resemblance of what we expect a container to be.
Before LXC ( https://LinuxContainers.org ) and CNCF ( https://landscape.cncf.io/ )
and
OCI ( https://opencontainers.org/ ), for shared-kernel VPS hosting ("virtual private server"; root on a shared box), there was OpenVZ (which requires a patched kernel and AFAIU still has features, like bursting, not present in cgroups).
Docker no longer has an LXC driver: libcontainer (opencontainers/runc)
is the story now. The LXC docs have a great list of utilized kernel features that's also still true for docker-engine = runC + moby. The LXC docs: https://linuxcontainers.org/lxc/introduction/ :
> Current LXC uses the following kernel features to contain processes:
>> Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources.https://en.wikipedia.org/wiki/Linux_namespaces
>> "Capabilities (POSIX 1003.1e, capabilities(7)) provide fine-grained control over superuser permissions, allowing use of the root user to be avoided. Software developers are encouraged to replace uses of the powerful setuid attribute in a system binary with a more minimal set of capabilities. Many packages make use of capabilities, such as CAP_NET_RAW being used for the ping binary provided by iputils. This enables e.g. ping to be run by a normal user (as with the setuid method), while at the same time limiting the security consequences of a potential vulnerability in ping."
Control groups enable per-process (and to thus per-container) resource quotas. Other than limiting the impact of resource exhaustion, cgroups are not a security feature of the Linux kernel.
Here's a helpful explainer of the differences between some of these kernel features; which, combined, have become somewhat ubiquitous:
>> The main thing to understand about SELinux integration with OpenShift is that, by default, OpenShift runs each container as a random uid and is isolated with SELinux MCS labels. The easiest way of thinking about MCS labels is they are a dynamic way of getting SELinux separation without having to create policy files and run restorecon.*
>> If you are wondering why we need SELinux and namespaces at the same time, the way I view it is namespaces provide the nice abstraction but are not designed from a security first perspective. SELinux is the brick wall that’s going to stop you if you manage to break out of (accidentally or on purpose) from the namespace abstraction.
>> CGroups is the remaining piece of the puzzle. Its primary purpose isn’t security, but I list it because it regulates that different containers stay within their allotted space for compute resources (cpu, memory, I/O). So without cgroups, you can’t be confident your application won’t be stomped on by another application on the same node.
> seccomp (short for secure computing mode) is a computer security facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit(), sigreturn(), read() and write() to already-open file descriptors. Should it attempt any other system calls, the kernel will terminate the process with SIGKILL or SIGSYS.[1][2] In this sense, it does not virtualize the system's resources but isolates the process from them entirely.
... SELinux is one implementation of MAC (Mandatory Access Controls) that is built upon the LSM (Linux Security Modules) support in the Linux kernel. Some distros include policy sets for Docker hosts and lots of other packages that could be installed; see: "Formally add support for SELinux" (k3s #1372) https://github.com/rancher/k3s/issues/1372#issuecomment-5817...
I'm going to go out on a limb here and say that in 99% of cases, there is no benefit to the network isolation. It adds unnecessary overhead and complexity, all so we don't have to configure individual services with a unique listening port. But the host that exposes the virtual networks still has to route something to them, so you still need to assign an arbitrary port and then do some port forwarding.
Going even further in the unnecessary abstractions category is process isolation. In most cases we don't need that either. From a security perspective, I trust the Linux kernel about as far as I can throw it, so I don't care what guarantees there supposedly are, container breakout and local privesc are (in my opinion) a near-certainty. So why are we forcing ourselves to jump through tons of hoops just to heap dump or ptrace() an application? The regular-old security mechanisms in Linux are enough for every other process on the system, why not container processes?
Filesystem abstraction (copy-on-write overlays and chroots) has been the killer feature of containers since day one. That is the one thing about containers that makes them useful: a reproducible application snapshot without dependency management hell. If we strip everything else about containers away, this is the one thing we need to keep the useful purpose of a container.
Docker threw in a lot of extra incredibly extra features, such as the Dockerfile (no more configuration management!) and overlays and build cache and layers, etc. Nobody would be using containers if all these features weren't present in one solution, and we all owe them a big debt and thanks. But if we really strip the container down to its essential useful element, it's basically just a wrapper round chroot().