Hacker Newsnew | past | comments | ask | show | jobs | submit | dehrmann's commentslogin

> training the LLM in violation of a license

Bartz v. Anthropic found that this is fair use, so the license doesn't play into it.


If the trained LLM spits out large, recognizable portions of licensed code and you use it in your product don’t count on that case to keep you from defending yourself in court. The court found in Bartz v. Anthropic that training was fair use. They also found that pirating content to train against was not fair use, and Anthropic paid $1,500,000,000 in a settlement.

There are licenses on most software source code. If you redistribute works derived from that code, you must abide by those licenses or you are violating the copyright. That’s what’s meant by “piracy" here.

Now if you have an LLM that has trained on code and learned to actually write new software, only small snippets too short to be protected by copyright should be identical between the training material and the output. However, if you’re getting output that is substantial in size and recognizably derivative from the original that’s an issue that hasn’t yet as far as I’m aware been settled in court. One would hope the major player LLMs don’t copy and paste large functional chunks of existing programs.

It would certainly seem to me that the code you sell after using an LLM should meet the same standards for difference in implementation as if it was written by a human. That should apply to both copyright protection and patent protection.


I find it pretty horrible that a company can pay a mere fine that is a small percentage of its total funding in exchange from materially benefiting from a conspiracy to commit a series of criminal acts.

If Anthropic hadn’t pirated training materials would they even exist? Would they still have been as competitive ?

Would they still have gotten every bit of VC funding in anticipation of future successes derived in part from past crimes?

What’s next ? Armed bank robbery when VC funding dries up?


Also fair use is much more limited in the EU. Don't know how it applies here or if there where any rulings. Are you going to stop doing business with the EU (and Japan etc.)?

The seller of the code has no visibility on the training set of the LLM. If the situation you're describing ends up being illegal, responsibility should fall on the LLM provider to provide tools to detect such overlap with their training sets, and on the clients to run the tools.

The provider of the LLM should want to enable this and to take on that responsibility (I mean take it from the clients), otherwise no one will want to use the tool. Maybe there could be AI tool-use lawsuit insurance, but I feel like that's worse than the copyright infringement detection tool for everyone involved.

I can see the tool happening in the EU, but nowhere else basically, especially in the US, the government sees "AI dominance" as a national priority and a national security priority.


I thought fair use was decided on a case by case basis, and could not be guaranteed? If true, wouldn't that mean that in other cases it could be ruled differently?

I don't have the exact ruling in front of me, but IIRC the judge pretty clearly said that training a model was fair use. IIRC, he declared it "quintessentially transformative".

The case by case basis was about acquisition and possession of the copyrighted material. Anthropic pirated a large number of books and illegally stored digital copies of many that they did purchase legally. The training being protected doesn't give them the right to violate copyright in that way.

Google, for example, purchased print versions of their training material and had a small army of employees digitize them and then delete the digital copies when they were done. That hasn't been challenged AFAIK, but would likely have been found to be not a violation. That's I think what was meant by case by case basis.

It's like if someone breaks into my house and I shoot them with my gun, that's very likely self defense, but if I'm not allowed to own a gun, I may still end up in trouble with the law.


Whether or not you’re pirating and making illegal copies of something depends greatly on the terms under which you’re allowed to make those copies. You can copy GPL-licensed code all day every day so long as you abide by the license. The same is true of the BSD licenses, MIT, ISC, Apache, et cetera.

If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright.


> If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright

I don't disagree with that.

What I'm saying is that the judge ruled that training a model using copyrighted books wasn't derivative. It was transformative, so the training wasn't a copyright violation.

He then went on to say that the way Anthropic acquired and handled that material was a copyright violation because Anthropic pirated and copied a large number of books that were not under a license like the ones you mentioned. The downloaded a bunch of books you would find at most bookstores and then actually purchased copies of them much later once they were accused of violation copyrights.

I'm just trying to make that clear because I've heard a lot of people who don't understand that the violation wasn't about the act of training or material they used, it was just how they acquired the training material.


That was one case in front of one judge. It’s weak precedent if it’s precedent at all.

Also, the reasoning behind it being transformative instead of derivative is that the output isn’t supposed to be large, unchanged chunks of the input. There’s no actual guarantee your small model run under OpenClaw won’t recreate whole modules of the input.


> AI can also be the best teacher in the world

I just ran this for just that purpose.

curl http://<local-ollama>:11434/api/generate -d "$(jq -n --arg hist "$(history)" '{ "model": "qwen3.5:35b-a3b-q4_K_M", "stream": false, "prompt": "The following is my bash shell history. Are there any bad patterns I should fix or commands I should learn or master? \($hist)" }')"


Fun trivia: Intel's PCI vendor ID is 8086.

Fun trivia: hexadecimal is written as "0x8086". Intel's PCI vendor ID is 32902.

https://pcisig.com/membership/member-companies

The decimal ID 8086 is not assigned, so it may be a reserved number. Nor is 6800 assigned in either notation.

0x8088 is, however, assigned to "Beijing Wangxun Technology Co., Ltd."

Decimal 8088 is assigned to "Akeana, Inc."

Decimal 8080 is assigned to "QUSIDE TECHNOLOGIES S.L."


As of a few years ago, some of the restrooms in Classic still had purple and yellow tiles.

These aren't related in the way you think they are. Stock price reacts quickly to broader market trends, but more slowly for company-specific trends where revenue is likely stable. The impact of AI in engineering work will take months to show up in the product, probably a year after that for customers and the market to take notice. An AI product is a different thing entirely.

You're thinking an algorithmic tradeoff, but this is an abstraction tradeoff.

Some of the algorithms are built deep into the runtime. E.g. languages that rely on malloc/free allocators (which require maintaining free lists) are making a pretty significnant tradoff of wasting CPU to save on RAM as opposed to languages using moving collectors.

Free lists aren't expensive for most usage patterns. For cases where they are we've got stuff like arena allocators. Meanwhile GC is hardly cheap.

Of course memory safety has a quality all its own.


hopefully not implying needing a gc for memory safety...

Yeah, there's always Fil-C (Rust isn't memory safe in practice).

> Free lists aren't expensive for most usage patterns.

Whatever little CPU they waste is often worth more than the RAM they save.

> For cases where they are we've got stuff like arena allocators.

... that work by using more RAM to save on CPU.


GC burns far more CPU cycles. Meanwhile I'm not sure where you got this idea about the value of CPU cycles relative to RAM. Most tasks stall on IO. Those that don't typically stall on either memory bandwidth or latency. Meanwhile CPU bound tasks typically don't perform allocations and if forced avoid the heap like the plague.

> GC burns far more CPU cycles

Far less for moving collectors. That's why they're used: to reduce the overhead of malloc/free based memory management. The whole point of moving collectors is that they can make the CPU cost of memory management arbitrarily low, even lower than stack allocation. In practice it's more complicated, but the principle stands.

The reason some programs "avoid the heap like the plague" is because their memory management is CPU-inefficient (as in the case of malloc/free allocators).

> Meanwhile I'm not sure where you got this idea about the value of CPU cycles relative to RAM

There is a fundamental relationship between CPU and RAM. As we learn in basic complexity theory, the power of what can be computed depends on how much memory an algorithm can use. On the flip side, using memory and managing memory requires CPU.

To get the most basic intuition, let's look at an extreme example. Consider a machine with 1 GB of free RAM and two programs that compute the same thing and consume 100% CPU for their duration. One uses 80MB of RAM and runs for 100s; the other uses 800MB of RAM and runs for 99s (perhaps thanks to a moving collector). Which is more efficient? It may seem that we need to compare the value of 1% CPU reduction vs a 10x increase in RAM consumption, but that's not necessary. The second program is more efficient. Why? Because when a program consumes 100% of the CPU, no other program can make use of any RAM, and so both programs effectively capture all 1GB, only the second program captures it for one second less.

This scales even to cases when the CPU consumption is less than 100% CPU, as the important thing to realise is that the two resources are coupled. The thing that needs to be optimised isn't CPU and RAM separately, but the RAM/CPU ratio. A program can be less efficient by using too little RAM if using more RAM can reduce its CPU consumption to get the right ratio (e.g. by using a moving collector) and vice versa.


There are (at least) two glaring issues with your analysis. First, the vast majority of workloads don't block on CPU (as I previously pointed out) and when they do they almost never do heap allocations in the hot path (again, as I previously pointed out). Second, we don't use single core single thread machines these days. Most workloads block on IO or memory access; the CPU pipeline is out of order and we have SMT for precisely this reason.

Anyway I'm not at all inclined to blindly believe your claim that malloc/free is particularly expensive relative to various GC algorithms. At present I believe the opposite (that malloc/free is quite cheap) but I'm open to the possibility that I'm misinformed about that. You're going to need to link to reputable benchmarks if you expect me to accept the efficiency claim, but even then that wouldn't convince me that any extra CPU cycles were actually an issue for the reasons articulated in the preceding paragraph.


> There are (at least) two glaring issues with your analysis. First, the vast majority of workloads don't block on CPU (as I previously pointed out) and when they do they almost never do heap allocations in the hot path (again, as I previously pointed out). Second, we don't use single core single thread machines these days. Most workloads block on IO or memory access; the CPU pipeline is out of order and we have SMT for precisely this reason.

This doesn't matter because if you're running a single program on a machine, it might as well use all the CPU and all the RAM. As long as you're under 100% on both, you're good. But we want to utilise the hardware well because we typically want to run multiple programs (or VMs) on a single machine, and the machine is exhausted when the first of CPU or RAM is exhausted. So the question is how should your CPU and RAM usage be balanced to offer optimal utilisation given that the machine is spent when the first of CPU and RAM is spent. E.g. you can only run two programs, each using 50% of CPU; if they each use only 5% of RAM, you've saved nothing as no third program can run. So if you spend either one of these resources in an unbalanced way, you're not using your hardware optimally. Using 2% more CPU to save 200MB of RAM could be suboptimal.

I'm not saying that for every program that uses X% CPU should also use exactly X% of RAM or it must be wasting one or the other, but that's the general perspective of how to think about efficiency. Using a lot of one and little of the other is, broadly speaking, not very efficient.

> Anyway I'm not at all inclined to blindly believe your claim that malloc/free is particularly expensive relative to various GC algorithms. At present I believe the opposite (that malloc/free is quite cheap) but I'm open to the possibility that I'm misinformed about that.

You are.

> You're going to need to link to reputable benchmarks if you expect me to accept the efficiency claim, but even then that wouldn't convince me that any extra CPU cycles were actually an issue for the reasons articulated in the preceding paragraph.

I don't believe there are any reputable benchmarks of full applications (which is where memory-management matters) that are apples-to-apples. I'm speaking from over two decades of experience with C++ and Java.

The important property of moving collectors is that they give you a knob that allows you to turn RAM into CPU and vice-versa (to some extent), and that's what you want to achieve the efficient balance.


Moving collectors as generally used are a huge waste of memory throughput, and this shows up consistently in the performance measurements. Moving data is very expensive! The whole point of ownership tracking in programming languages is so that large chunks of "owned" data can just stay put until freed, and only the owning handle (which is tiny) needs to move around. Most GC programming languages do a terrible job of supporting that pattern.

That's just not true. To give you a few pieces of the picture, moving collectors move little memory and do so rarely (relative to the allocation rate):

In the young generation, few objects survive and so few are moved (the very few that survive longer are moved into the old gen); in the old generation, most objects survive, but the allocation rate is so low that moving them is rare (although the memory management technique in the old gen doesn't matter as much precisely because the allocation rate is so low, so whether you want a moving algorithm or not in the old gen is less about speed and more about other concerns).

On top of that, the general principle of moving collectors (and why in theory they're cheaper than stack allocation) is that the cost of the overall work of moving memory is roughly constant for a specific workload, but its frequency can be made as low as you want by using more RAM.

The reason moving collectors are used in the first place is to reduce the high overhead of malloc/free allocators.

Anyway, the general point I was making above is that a machine is exhausted not when both CPU and RAM are exhausted, but when one of them is. Efficient hardware utilisation is when the program strikes some good balance between them. There's not much point to reducing RAM footprint when CPU utilisation is high or reducing CPU consumption when RAM consumption is high. Using much of one and little of the other is wasteful when you can reduce the higher one by increasing the other. Moving collectors give you a convenient knob to do that: if a program consumes a lot of CPU and little RAM, you can increase the heap and turn some RAM into CPU and vice versa.


I see some of the value in planning, but experimentation is so cheap, there's also a lot of value in trying it, seeing what works, and learning from it. The main drawback I see from experimentation is failing to understand why something worked.

The cheapest option in all of software development is to develop the program in your head

That includes experimentation.


> JanSport still advertises a lifetime warranty...Go try to use it.

Literally every product with a lifetime warranty does this same game. It's best to read them as puffery.


A job in tech will get you $10k MRR.

At least from this blog post, I wouldn't take his advice on MRR. He's optimizing in the wrong places; he'd be better off spending $100 per month on hosting and focusing on MRR than focusing on bringing hosting costs down to 0.2% of revenue. There's a middle ground without K8S where you use your cloud's autoscaling app hosting and a small, replicated DB. He strawmaned the enterprise approach, and he's trading off a lot of toil for uptime.


> Is violence HN worthy when it is directed upward on the org chart?

Generally, world news and politics are not supposed to be submitted unless there's a tech industry connection. The exception seems to be world-changing news, and there's a light touch on YC-affiliated news for conflict of interest reasons.

> Off-Topic: Most stories about politics, or crime, or sports, or celebrities, unless they're evidence of some interesting new phenomenon. If they'd cover it on TV news, it's probably off-topic.

https://news.ycombinator.com/newsguidelines.html


That's not really accurate in terms of how we moderate stories with political charge on HN. I've written about this many times: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so.... If you or anyone want to understand how we actually approach this, all the information is accessible through those links.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: