Sigh. 1. I would hope the default seccomp policy blocks AF_ALG in these containe...

jeroenhd · 2026-05-05T08:09:49 1777968589

> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

I see a lot of projects blocking those sockets in containers as a response to this exploit, but it seems rather strange to me. We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use. It's not like we're mass-disabling kernel modules everywhere every time someone discovers an EoP bug, do we? Did we blacklist OpenSSL's binaries after Heartbleed?

I suppose it makes sense as a default on vulnerable kernels (though people running vulnerable kernels should put effort into patching rather than workarounds in my opinion), but these defaults are going to be around ten years from now when copy.fail is a distant memory.

throw0101a · 2026-05-05T11:25:30 1777980330

> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use.

The need for this feature/functionality in the fist place is questioned by some:

> As someone who works on the Linux kernel's cryptography code, the regularly occurring AF_ALG exploits are really frustrating. AF_ALG, which was added to the kernel many years ago without sufficient review, should not exist. It's very complex, and it exposes a massive attack surface to unprivileged userspace programs. And it's almost completely unnecessary, as userspace already has its own cryptography code to use. The kernel's cryptography code is just for in-kernel users (for example, dm-crypt).

> The algorithm being used in this [specific] exploit, "authencesn", is even an IPsec implementation detail, which never should have been exposed to userspace as a general-purpose en/decryption API. […]

* https://news.ycombinator.com/item?id=47952181#unv_47956312

staticassertion · 2026-05-05T12:14:52 1777983292

> a security bug in them that one time?

More than one time.

> a cryptographic performance enhancement feature

It's very rarely used.

> Did we blacklist OpenSSL's binaries after Heartbleed?

No, but lots of companies have since migrated away. OpenSSL was harder to move away from because there weren't as obvious drop-in replacements. Blocking a syscall that you never actually used is simple and effective.

e12e · 2026-05-05T09:53:36 1777974816

In fairness, after heartbleed - there was quite a push to move away from openSSL - like Google's boring ssl, openbsd libressl and Mozilla/nss or gnutls - but the alternative here would be moving to a different kernel, like freebsd or open Solaris/Illumos ...

PunchyHamster · 2026-05-05T09:57:33 1777975053

that's just moving to kernel that had 1000x less eyes on it. Yeah sure it will have less exploits but purely because nobody bothers to look when there are much juicer targets on Linux.

But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project

DarkUranium · 2026-05-05T10:38:58 1777977538

1000x less eyes is true, but also: Linux, even in the kernel, has a long history of "move fast and break things".

Yes, the syscall API is (famously) stable, but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).

PunchyHamster · 2026-05-07T15:07:34 1778166454

driver attitude in Linux could be summed up to "we'd rather have the hardware driver working than absent".

> but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).

or vote with your wallet and get device that has well supported card.

steve1977 · 2026-05-05T17:18:14 1778001494

Less eyes but also less problems like "it's been fixed in the kernel but not in distro XYZ"

wahern · 2026-05-05T16:48:31 1777999711

If you're using a container as a sandbox, one should use a default deny policy and allow only the facilities required by the container. Though, in practice containers are used to package a huge collection of software, most of which the container creator has no familiarity with and no ability to determine what runtime dependencies, beyond other package names, are required. This one of the reasons why containers, generally speaking, don't offer reliable security. If you can't or won't carefully design your components to sandbox themselves (e.g. by using seccomp and landlock with policies tailored to the specific component), like Chrome or various OpenBSD daemons, then it's far better to use VMs for isolation; and if you do design your components that way, containers are superfluous from a security perspective.

nubinetwork · 2026-05-05T09:36:40 1777973800

> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time?

To my knowledge, not many things were using the in-kernel code anyways, the recommended way is to use userland tools...

It's optional for openssl, systemd apparently needs it, but deleting the module from one of my systems didn't cause any issues. /shrug

PunchyHamster · 2026-05-05T09:50:37 1777974637

I haven't had it loaded on 100s of servers ranging kernel version from 5.10 to 6.14. The use is just that low

Retr0id · 2026-05-05T10:18:16 1777976296

iiuc the AF_ALG interface only offers real performance wins if you have specialized hardware that the kernel can offload computations to. If you're not using that hardware, there's little reason not to do the crypto in userspace.

hlieberman · 2026-05-05T04:45:21 1777956321

In fact, the authors specifically say on the very first line of their website that the copy/fail primitive can be used as a container escape. The entire premise of this article is flawed and irresponsible.

eqvinox · 2026-05-05T12:00:43 1777982443

AIUI they haven't shown a container escape and are just claiming it so far. Or did I miss something?

mjmas · 2026-05-05T12:41:23 1777984883

Having write access on anything you can read should be enough if libraries or binaries are shared (read-only) between the host and container.

eqvinox · 2026-05-05T14:36:33 1777991793

> if libraries or binaries are shared (read-only) between the host and container.

Yeah, exactly - that's a pretty big "if", and not how a lot of container automation does things. In particular you'd need to hit the base system, it's no help at all if some application files that the host does nothing with can be hit.

angry_octet · 2026-05-05T23:46:45 1778024805

It's not hard to see ways to escape the container with a cache write primative. I suspect the copy.fail team have held back on releasing a POC because of the disruption it could cause.

eqvinox · 2026-05-06T17:49:09 1778089749

It's not a cache write primitive though; it's a write-to-readable-mappings primitive. At least the way I understood it is, you need to be able to get a (read) file descriptor to the target in order to throw it into the splice() syscall.

Now, there are some "funky" no-fs things that could be opened and are mmap'able/spliceable (some stuff in /proc/*, no idea what exactly though), but it's not immediately obvious to me how this is a generic container escape.

fguerraz · 2026-05-05T07:15:05 1777965305

I just contributed this [1] which does what you want for seccomp. Well, not by default, but profiling is now effective against this attack.

Oh, an this [2] just happened

[1] https://github.com/containers/oci-seccomp-bpf-hook/pull/209 [2] https://github.com/moby/moby/pull/52501

Jasper_ · 2026-05-05T17:33:45 1778002425

Blanket blocking socketcall() caused regressions for all 32-bit applications trying to make sockets. In theory, glibc disables socketcall when running on kernel version >= 4.3. In practice, Debian/Fedora/Ubuntu all set glibc's "expected kernel version" to 3.2, so socketcall() is still used on most 32-bit glibc binaries shipped.

https://salsa.debian.org/glibc-team/glibc/-/blob/sid/debian/...

https://src.fedoraproject.org/rpms/glibc/blob/rawhide/f/glib...

fguerraz · 2026-05-05T18:52:32 1778007152

That’s… great. But who runs containerised 32 bit applications?

dwroberts · 2026-05-05T05:51:47 1777960307

There is an addendum at the bottom where they admit the page corruption is still problematic even with rootless podman.

Although using this to justify their migration to micro-VMs is very strange to me. Sure for this CVE it would have been better, but surely for a future attack it could hit a component shared across VMs but not containers? Are people really choosing technology based on CVE-of-the-week?

anygivnthursday · 2026-05-05T06:58:35 1777964315

Containers were never a security boundary. VMs have better isolation, which is why people choose them for security. Containers are convenience and usually have better performance.

dwroberts · 2026-05-05T08:00:17 1777968017

I see the ‘not a security boundary’ thing repeated constantly, and while it makes sense (eg. they’re sharing the underlying kernel or at least some access to it) if you think about it a little more, VMs are not magically different: they are better isolated, but VMs on the same host still share the host in common. A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers

throw0101c · 2026-05-05T12:44:08 1777985048

> […] VMs are not magically different: they are better isolated, but VMs on the same host still share the host in common.

VMs are not different due to 'magic' but through hardware assist with things like Intel VT-x and AMD-V:

* https://en.wikipedia.org/wiki/X86_virtualization#Hardware-as...

* https://blog.lyc8503.net/en/post/hypervisor-explore/

* https://binarydebt.wordpress.com/2018/10/14/intel-virtualisa...

amluto · 2026-05-05T17:01:49 1778000509

I disagree. VMs are better isolated to precisely the extent that (a) the attack surface is lower and (b) the implementation is simpler and thus less buggy.

Hardware virtualization has a strong effect on (b), but it’s not at all a foregone conclusion that it’s strictly in the direction of being more straightforward and thus more secure. And hardware features like fancy device passthrough encourages applications with a very, very large attack surface that has historically been full of holes.

necovek · 2026-05-05T08:22:35 1777969355

You are obviously right that these are similar in principle: VM isolation exploit would lead to the same exposure like container-related isolation exploits.

VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.

If you are arguing the latter is not true — and we are all collectively hand-waving away big chunk of the surface area so that may be the case — it would help to be explicit in why you believe an exploit in that area is similarly likely?

robertlagrant · 2026-05-05T10:19:54 1777976394

I would say it's the fact that "not a security boundary" appears to be a pass/fail statement, whereas the reality is more like a security continuum, along which VMs are further than containers.

necovek · 2026-05-05T22:24:38 1778019878

I believe that is tautologically true, and thus not a very useful framing.

Security is obviously a continuum (eg. you can even have a bug in your IPMI FW, and a network packet could break in without any interaction with the OS; or there could be a HW bug too), but there is a discrete "jump" between containers and VMs to the extent that it is useful to call one a security boundary and the other not. Just like a firewall is a security boundary even if it can have security bugs.

Whether this jump between exploitable surface area warrants this distinction is what the point is: many believe it does.

anygivnthursday · 2026-05-06T05:26:56 1778045216

But you also cannot just handwave the difference by "it's a continuum". I did not use absolutes, but said "VMs are _better_ for security", which already implicit about a "continuum".

Containers are mostly used as a deployment/packaging model where typically VMs are used where stronger security is needed. This has been the established industry standard for a while. Look at major cloud providers for example.

AWS:

> Unless explicitly stated, AWS does not consider a container or primitives such as an ECS task or a Kubernetes pod to be a security boundary. A notable exception to this is ECS tasks running AWS Fargate, where the isolation boundary is a task. To account for this, we recommend that you use Fargate with ECS if your applications have strict isolation requirements.

> When you’re using the Fargate launch type, each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.

They also further recommend that for even higher security requirements use different EC2 instances - which you can also run on dedicated hardware etc. But the fact that you can further increase isolation beyond VMs, does not make containers the same as VMs.

https://aws.amazon.com/blogs/security/security-consideration...

GCP:

> There’s one myth worth clearing up: containers do not provide an impermeable security boundary, nor do they aim to. They provide some restrictions on access to shared resources on a host, but they don’t necessarily prevent a malicious attacker from circumventing these restrictions. Although both containers and VMs encapsulate an application, the container is a boundary for the application, but the VM is a boundary for the application and its resources, including resource allocation.

> If you're running an untrusted workload on Kubernetes Engine and need a strong security boundary, you should fall back on the isolation provided by the Google Cloud Platform project. For workloads sharing the same level of trust, you may get by with multi-tenancy, where a container is run on the same node as other containers or another node in the same cluster.

https://cloud.google.com/blog/products/gcp/exploring-contain...

> Applications that run in traditional Linux containers access system resources in the same way that regular (non-containerized) applications do: by making system calls directly to the host kernel.

> One approach to improve container isolation is to run each container in its own virtual machine (VM). This gives each container its own "machine," including kernel and virtualized devices, completely separate from the host. Even if there is a vulnerability in the guest, the hypervisor still isolates the host, as well as other applications/containers running on the host.

> gVisor is more lightweight than a VM while maintaining a similar level of isolation. The core of gVisor is a kernel that runs as a normal, unprivileged process that supports most Linux system calls. This kernel is written in Go, which was chosen for its memory- and type-safety. Just like within a VM, an application running in a gVisor sandbox gets its own kernel and set of virtualized devices, distinct from the host and other sandboxes.

https://cloud.google.com/blog/products/identity-security/ope...

These guys are experts when it comes to securing workloads on shared infra and while there are different levels of isolation using various techniques, the current industry practice is to not consider regular Linux containers a security boundary.

staticassertion · 2026-05-05T12:12:21 1777983141

Containers are a security boundary, yes.

> A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers

Yeah this almost never happens though whereas Linux privesc is 10x a day.

graemep · 2026-05-05T10:53:20 1777978400

They may not provide isolation as VMs but they clearly do limit some attacks. VMs do not provide the same isolation as using physically separate hardware either.

I would have thought they provide better isolation than using multiple users which is the traditional security boundary.

It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?

anygivnthursday · 2026-05-06T05:37:46 1778045866

> It might depends on what you mean by a container?

The article was about Podman and Linux namespaces

graemep · 2026-05-06T11:35:45 1778067345

I understood the comment I replied to (and many similar comments that are regularly made on HN) as talking about containers in general.

Namespaces are used as a security mechanism.

ButlerianJihad · 2026-05-05T07:14:31 1777965271

Containers are a convenience boundary and they increase complexity of your risk assessments.

It is easy for security scanners to scan a Linux system, but will they inspect your containers, and snaps, and flatpaks, and VMs? It is easy for DevOps to ssh into your Linux server, but can they also get logged in to each container, and do useful things? Your patches and all dependencies are up-to-date on your server, but those containers are still dragging around legacy dependencies, by design. Is your backup system aware of containers and capable of creating backup images or files, that are suitable for restoring back to service?

necovek · 2026-05-05T08:23:35 1777969415

Security scanners already support most container and VM image formats in widespread use.

Does this increase complexity? Yes, it does. Is it worth the cost? Depends on each individual case IMO.

throw0101c · 2026-05-05T12:56:21 1777985781

> Security scanners already support most container and VM image formats in widespread use.

E.g.,

> Container Security stores and scans container images as the images are built, before production. It provides vulnerability and malware detection, along with continuous monitoring of container images. By integrating with the continuous integration and continuous deployment (CI/CD) systems that build container images, Container Security ensures every container reaching production is secure and compliant with enterprise policy.

* https://docs.tenable.com/enclave-security/container-security...

firesteelrain · 2026-05-05T08:49:48 1777970988

You need a tool like Anchore and PrismaCloud to scan the container images then monitor them in runtime with PrismaCloud. Trellix can “scan” however most people turn off or exclude container directories on the host because it can interfere with the running container.

staticassertion · 2026-05-05T12:11:45 1777983105

These sorts of vulns are extremely common on Linux. This one is making the rounds for various reasons but it's a good justification for a migration away from containers if your threat model is concerned about it.

MicroVMs have much lower attack surface and you can even toss a container into one if you'd like.

Or use gvisor, which mitigates this vulnerability.

PunchyHamster · 2026-05-05T09:53:39 1777974819

> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.

there is no reason it would be default policy. Else might as well block every socket and just multiplex everything on stdin/out

SV_BubbleTime · 2026-05-05T11:03:46 1777979026

>might as well block every socket and just multiplex everything on stdin/out

You may be on to something…

chadgpt2 · 2026-05-06T17:41:58 1778089318

They we can build an encoding to allow arbitrary syscalls via stdin/out for convenience

cduzz · 2026-05-05T10:59:39 1777978779

I'd have guessed that the default paranoia-first policy would be "drop everything; verify what you need" which would include AF_ALG.

share and enjoy!

tremon · 2026-05-05T14:02:16 1777989736

How do you propose to implement that "drop everything except what you need" policy? Do your containers come with a detailed list of which OS services and syscalls are required? I think your idea has the same issue as what held back the adoption of selinux: many developers think that having to enumerate their application's behaviour like that is an undue burden.

A compounding issue is that using AF_ALG doesn't require a separate syscall: it's just using SYS_socket with the first argument set to 38. Your container behaviour specification needs to be specific enough to not only enumerate allowed syscalls, but the allowed values for each syscall parameter.

cduzz · 2026-05-05T14:50:37 1777992637

There are those who are paranoid and those who are expedient. If you're truly paranoid, you spin up the thing you want to run, measure what it does, and open the holes to allow it to do what it needs to. It's tedious and sometimes error-prone, but in some environments it is necessary.

In the vast majority of the world, you set permissions to what's reasonable and trust that most of the time things will work out pretty well and have a plan for if you need to fix things on the fly.

I personally am not terribly paranoid, but I've worked places where we had to be pretty paranoid (shared hosting).

staticassertion · 2026-05-05T12:07:55 1777982875

The reason is that it's very rarely used and has a history of issues.

raesene9 · 2026-05-05T06:14:07 1777961647

I've not looked for podman but moby/docker I believe does now block this https://github.com/moby/profiles/commit/7158007a83005b14a24f...