> If thorough investigation revealed poor quality control investment compared to what would be appropriate for a company like this, then we can say for sure.
We don't really need that thorough of an investigation. They had no staged deploys when servicing millions of machines. That alone is enough to say they're not running the company correctly.
I also fall on the side of "stagger the rollout" (or "give customers tools to stagger the rollout"), but at the same time I recognize that a lot of customers would not accept delays on the latest malware data.
Before the incident, if you asked a customer if they would like to get updates faster even if it means that there is a remote chance of a problem with them... I bet they'd still want to get updates faster.
I would say that canary release is an absolute must 100%. Except I can think of cases where it might still not be enough. So, I just don't feel comfortable judging them out of the box. Does all the evidence seem to point against them? For sure. But I just don't feel comfortable giving that final verdict without knowing for sure.
Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.
If there's deadlines that you can go over, and nothing bad happens, for sure. Always have canary releases, and perfect QA, monitoring everything thoroughly, but I'm just saying, there can be cases where damage that could be done if you don't act fast enough, is just so much worse.
And I don't know that it wasn't the case for them. I just don't know.
> Specifically because this is about fighting against malicious actors, where time can be of essence to deploy some sort of protection against a novel threat.
This is severely overstating the problem: an extra few minutes is not going to be the difference between their customers being compromised. Most of the devices they run on are never compromised, because anyone remotely serious has defense in depth.
If it was true, or even close to true, that would make the criticism more rather than less strong. If time is of the essence, you invest in things like reviewing test coverage (their most glaring lapse), fuzz testing, and common reliability engineering techniques like having the system roll back to the last known good configuration after it’s failed to load. We think of progressive rollouts as common now but they got to get that mainstream in large part because the Google Chrome team realized rapid updates are important but then asked what they needed to do to make them safe. CrowdStrike’s report suggests that they wanted rapid but weren’t willing to invest in the implementation because that isn’t a customer-visible feature – until it very painfully became one.
"CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.
...
The sensor release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.
The event of Friday, July 19, 2024 was not triggered by Sensor Content, which is only delivered with the release of an updated Falcon sensor. Customers have complete control over the deployment of the sensor — which includes Sensor Content and Template Types.
...
Rapid Response Content is used to perform a variety of behavioral pattern-matching operations on the sensor using a highly optimized engine.
Newly released Template Types are stress tested across many aspects, such as resource utilization, system performance impact and event volume. For each Template Type, a specific Template Instance is used to stress test the Template Type by matching against any possible value of the associated data fields to identify adverse system interactions.
Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.
On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production."
Do you seriously believe that all CrowdStrike on Windows customers were at such imminent risk of ransomware that one-two hours to run this on one internal setup and catch the critical error they released would have been dangerous?
This is a ludicrous position, and has been proven obviously false by the proceedings: all systems that were crashed by this critical failure were not, in fact, attacked with ransomware once the CS agent was un-installed (at great pain).
You don't want to be in a situation where you're taken hostage and asked hundred mills ransomeware just because you're too slow to mitigate the situation.
Mitigation: Validate the number of input fields in the Template Type at sensor compile time
Mitigation: Add runtime input array bounds checks to the Content Interpreter for Rapid Response Content in Channel File 291
- An additional check that the size of the input array matches the number of inputs expected by the Rapid Response Content was added at the same time.
- We have completed fuzz testing of the Channel 291 Template Type and are expanding it to additional Rapid Response Content handlers in the sensor.
Mitigation: Correct the number of inputs provided by the IPC Template Type
Mitigation: Increase test coverage during Template Type development
Mitigation: Create additional checks in the Content Validator
Mitigation: Prevent the creation of problematic Channel 291 files
Mitigation: Update Content Configuration System test procedures
Mitigation: The Content Configuration System has been updated with additional deployment layers and acceptance checks
Mitigation: Provide customer control over the deployment of Rapid Response Content updates
No, again, you've got it exactly backwards. "Solid precedent" can now be established via the development of jurisprudence under stare decisis, as is the judiciary's role.
Under Chevron, agencies themselves were interpreting and re-interpreting statute law as they saw fit, and there was no binding precedent at all.
Yes, sometimes that's necessary when bad rulings slip through. If there was no mechanism to reverse bad precedent, we'd be in a pretty bad situation -- we wouldn't have this ruling, we wouldn't have Brown v. Board of Education, etc.
But the courts do this carefully and infrequently, and create a level of stability in the law that's far stronger than allowing the ever-changing landscape of statutory law and administrative rule-making to provide the final say over complex questions.
That's exactly the point of this ruling -- the courts had delegated the responsibility for shaping jurisprudence that defined the limits of administrative power to the administrative agencies themselves, allowing those agencies to interpret and re-interpret the boundaries of their own authority in pursuit of whatever ephemeral issue they were focused on at the moment.
This was an abdication of the courts' own duty, and it's a very good thing that they've decided to take responsibility for it once more.
If I was starting from scratch, what resources should I start with to build up an understanding of what this code does and how to read it? It's quite dense and my knowledge of LLMs is quite minimal. Are these terse variable names standard in LLM-land?
“What resources would I need” -> you’re literally commenting on a teachers content. Karpathy (the author) has a very informative YouTube channel where he goes step by step through everything. He has a ton of repos and tutorials. Dig a little.
You’re not supposed to know that. You asked a question, and this is you being told the answer.
It’s very convenient that the author of the post is quite literally the world’s most prolific teacher on this topic. Makes it easy to find Karpathy. You shouldn’t be expected to otherwise know that (or else why ask if you knew).
> I didn't realize variables had to be so short in C. Glad I write C++ professionally where they've added support for longer variable names.
This feels like a joke but old C compilers did have variable length limits. This is part of why C historically had shorter variables than other more modern languages.
Sorry if it came off rude, the internet is hard to communicate over.
As siblings have said, his video series are quite good. But if you're just looking at this repo only, you probably want to look at the python reference implementation. (The C is designed to exactly replicate its functionality.)
We don't really need that thorough of an investigation. They had no staged deploys when servicing millions of machines. That alone is enough to say they're not running the company correctly.