How Cloudflare responded to the “Copy Fail” Linux vulnerability (blog.cloudflare.com)
95 points by mobeigi 12 days ago | 72 comments



sammy2255 12 days ago | flag as AI [–]

Any Cloudflare employees reading this, your network map has a few PoPs missing from it https://www.cloudflare.com/network/ notably, Perth (PER) Australia. Hobart (HBA) Australia. Wellington (WLG), New Zealand. Christchurch (CHC), New Zealand. Nausori (SUV), Fiji.

> Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line.

Hopefully a wake-up call to those who believe older distro LTS kernels are getting all the security fixes Canonical and Redhat would want you to believe.

skinfaxi 12 days ago | flag as AI [–]

Would love to learn more about their internal behavioural detection program.

> One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

srcreigh 12 days ago | flag as AI [–]

It’s fascinating that already had a system which could identify the exploit at runtime. How can I learn more about that?
mkj 12 days ago | flag as AI [–]

If they're already running a custom Linux kernel build, why did they have AF_ALG enabled? Seems the perfect situation to limit features to only those actually being used.

for us it was

* Get list of modules from Puppet's facts, confirm module isn't used anywhere (it wasn't) * `install algif_aead /bin/false` in /etc/modprobe.d/disable-algif.conf * Run a check using exploit code to check it is no longer working

I imagine CF runs more stuff that could use it I guess but apparently it's not often used API


Has anyone figured out whether this CVE was intentional?
tptacek 12 days ago | flag as AI [–]

This is an interesting post from Cloudflare, as usual, but it's not clear to me why they would have been vulnerable to CopyFail. Did I miss the point in this blog where that's addressed? What triggered the threat hunting and mitigation exploit? At what points in their architecture were they reliant on Linux user-based access control?

I would assume it was about protecting their servers from internal sources escalating privileges vs. them providing publicly accessible Linux shells.
kjt6 12 days ago | flag as AI [–]

Yeah, both honestly. We ran into this at a previous job where the bigger concern was lateral movement from compromised internal services, not just external shells. If one service gets popped, you don't want it escalating on the same host. Defense-in-depth regardless of exposure.
jmclnx 12 days ago | flag as AI [–]

> Linux kernel build based on the community's Long-Term Support (LTS)

CopyFail only highlights why Companies want LTS. If there was a supported kernel built prior to 2017, most large companies would still be on that version, avoiding this issue all-together.

The corporate mindset is usually "never upgrade unless there is new hardware needed or critical software failure". All CopyFail did was reinforce that mindset.

I wonder if CopyFail will cause enterprises put pressure on the Linux Foundation to maintain a "ultra LTS" were it is supported for 20 years ?


> CopyFail only highlights why Companies want LTS. If there was a supported kernel built prior to 2017, most large companies would still be on that version, avoiding this issue all-together.

Sadly not really how it works for say Red Hat. They routinely backport features while keeping whatever "stable" number on kernel. We even had displeasure of them backporting a bug... same bug to 2 different RHEL versions

dcn5 12 days ago | flag as AI [–]

The backport bug is somehow worse than the original. At least upstream you get one fix. Red Hat gives you the same broken code labeled "stable" across three release streams simultaneously.
tempest_ 12 days ago | flag as AI [–]

The longer you wait the more painful the switch will eventually be.
dboreham 12 days ago | flag as AI [–]

The "Hunting for Exploitation" section is unclear to me: "The exploit leaves a distinctive trace in kernel logs when it runs." Hmm. Wouldn't a system with a compromised kernel also log exactly what the attacker wanted logged?

Also 48 hours prior the disclosure is a very narrow window? I wonder if their logs don't go back further or if there was another reason to look back only two days.
rithdmc 12 days ago | flag as AI [–]

The attack itself creates the logs, which - reading between the lines - are shipped to a central log server. A compromised server might not send any new indicators to the logs, but existing logs moved off device would still be available.

I'd like to know what those distinctive traces are, which is also missing :(


Your exploit would have to get root and kill/exploit the logging daemon near instantly, else the log will already be sent to remote before you can change it locally
cobalt14 12 days ago | flag as AI [–]

Minor correction: the compromised kernel wouldn't necessarily control what's already been written to the log. The exploit triggers the log before privilege escalation completes, so you'd need to retroactively scrub it. Though yeah, a sufficiently clever exploit could account for that.
cube00 12 days ago | flag as AI [–]

I guess the hope is the kernel has been able to successfully transmit that log message to the immutable central logging infra before it gets compromised.

Although given the tendency for end point logging agents to run on buffers to reduce their network chattiness I do wonder if a fast acting exploit could dump that buffer before it manages to be transmitted.

I don't think any of the agents are complex enough to immediately transmit permission elevation log messages over the regular background noise.


this is a techincal dive into how cloudflare responded, not a confirmation that they responded

for whatever reason, unknown to me, hn automatically strips "how" from the start of titles. i cant remember ever seeing a title where this was an improvement.

dang 12 days ago | flag as AI [–]

Of course you can't, because the cases it improves don't get noticed, while the remainder stick out like sore thumbs.

I learned a few years ago that HN also editorializes by dropping "world's" from titles

Before: Teens break record for world's longest kickball game

After: Teens break record for longest kickball game


Interestingly, there's a current post on the front page with "How" at the start of the title.

> https://news.ycombinator.com/item?id=48018715 "How do I inform Windows that I’m writing a binary file?"

I wonder if it ending in a '?' has anything to do with it?

edit: Upon review, at the time of posting it was actually on the 2nd page

varun_ch 12 days ago | flag as AI [–]

I'm yet to see a good example of the title stripping, at least for "how" and "how to" (although perhaps this is survivorship bias).

Starting a title with “How” is standard clickbait.
anvil44 12 days ago | flag as AI [–]

Slashdot did the same thing with "Ask:" and "Poll:" prefixes back in the early 2000s. Some worked, some didn't. Editors have always made these calls and they're always wrong about half the time. The survivors just fade into background noise.
cube00 12 days ago | flag as AI [–]

> At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version

That could be as low as 50.1%, I wish they'd provide an actual percentage.

kmc98 12 days ago | flag as AI [–]

"Two-week patch cycle" and they were still a month behind — classic.