Ignoring the actual legality of using training data in machine learning, let's l...

Ignoring the actual legality of using training data in machine learning, let's look at these "options" in a purely objective-driven way. If you do that, you'll quickly realize that what you describe would strengthen megacorporations to an even greater extent. If training on public data is banned, then the entities who have complete ownership of all their data would be granted a de facto monopoly. Stock image services, media conglomerates, music labels, industry giants would start in the winning position. When one side just gets their way for free and the other has to beg for scraps, do you really think this is a fair proposition?

The reason why open-source AI exists at all is that we've always been allowing use of public data - it was okay when Google did it, it was okay when the Internet Archive did it, it was even okay when text translation services used that same data to train their models - or really, that applies to basically anything ML-driven before generative AI.

There's, like, a sea of reasons to criticize OpenAI for - but arguing for extending IP laws even further and calling out opponents for "literal theft" is one of the weaker options that caught on with many people.