Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effect is illusory. Mysteries!

I’m not even sure whether this is possible. The current corpus used for training includes virtually all known material. If we make it illegal for these companies to use copyrighted content without remuneration, either the task gets very expensive, indeed, or the corpus shrinks. We can certainly make the models larger, with more and more parameters, subject only to silicon’s ability to give us more transistors for RAM density and GPU parallelism. But it honestly feels like, without another “Attention is All You Need” level breakthrough, we’re starting to see the end of the runway.



I see a lot of researchers working on newer ideas so I wouldn't be surprised if we get a breakthrough in 5-10 years. After all, the gap between AlexNet and Attention is All You Need was only 6 years. And then Scaling Laws was about 3-4 years after that. It might seem like not much progress is being made but I think that's in part because AI labs are extremely secretive now when ideas are worth billions (and in the right hands, potentially more).

Of course 5-10 years is a long time to bang our heads against the wall with untenable costs but I don't know if we can solve our way out of that problem.


I think we will see models becoming small reasoning core which don't remember tonnes of facts but can reason with data fed to it or they can search.


The echoes of A.I. winter.


Yep, last time we got "a lot of researchers working on newer ideas", it took them 20 years to get into a working idea, and other 20 years to get it mature enough to make an AI boom.


> The current corpus used for training includes virtually all known material.

This is just totally incorrect. It's one of those things everyone just assumes, but there's an immense amount of known material that isn't even digitized, much less in the hands of tech companies.


What large caches of undigitized content exists? Surely, not everything has been digitized, but I can’t think it’s much in percentage terms.


The amount of private data that is locked up inside private internal databases is huge. This is especially true of regulated industries. There is a wealth of data - financial data showing how to budget for things, pricing data on various products that are B2B, standard operating procedures at mature companies that have gone through various revisions, designs for manufacturing plants so people don't keep reinventing and making the same mistakes again, and on and on.


I think it's implied that they're not talking about private data when they say they've run out.


fair. I want to +1 the fact that there is a large amount of data unseen by LLMs.


I think there are post training tweaks that can be done with corporate data to help fit an AI to a specific corporation. But I don’t think that private data will deliver us AGI. The knowledge for AGI is out in the world, not hidden inside corporations. Private data brings us knowledge of the XYZ project status and the division ABC budget and whether Bob wants a chocolate cake for his going away dinner or not.


I'm not seeing it the same way. Businesses in various industries have several types of moats - money, knowledge, experience, skills, etc. There is ton of competitive intelligence hidden in private data.

Its one of the reasons you can't use chatGPT and start manufacturing chips or vaccines, or anti-cancer medication. The gap between publicly available data that informs academic "core science" research versus specific product-based knowledge that shows you how to make a successful drug candidate that can withstand regulatory scrutiny or be a safe and effective drug for the worlds population.

We could iterate so quickly if this private data set was democratized.


The Vatican Library contains roughly 1.1 million printed books and around 75,000 codices, only a small percentage of which have been digitised.


Reddit alone contains about the same quantity of text (~10 billion posts * 10 words per post, vs 1 million books * 100k words per book). Messaging and document platforms (google docs, slack, discord, telegram, etc.) probably each have 1-3 orders of magnitude more than reddit. To your/GP's point though, those private platforms probably haven't been slurped up by LLMs yet.


Which is what percent of the world’s content? 0.000000001% or something similar. It’s nothing in the scheme of things. To put it another way, if we were to digitize that continent and train on it, our AIs would not get noticeably better in any way. It doesn’t move the needle.


1.1 million being 0.000000001% implies a total count of 1e17 books in the world - the real number is closer to 1e8.


You’re missing the point. And we’re not just talking about books, whatever that might mean. We’re talking about all documents ever made. Every magazine article, every blog and web page, every Word doc, etc. I’m pretty sure that whatever is in the Vatican archives is tiny by comparison. Given the age of the Vatican archives, I can also guarantee that many of those “books” are nothing more than page fragments. Very few will be full codices or long scrolls. Many will date before the printing press when document production was slow and laborious.


What makes you believe that most things have been digitised in the first place?


has the whole youtube been indexed?


I’m sure Gemini has done it at some level. Google was pretty much founded on the assumption that more data is better. That has driven them to build or buy data sets that they can mine (Gmail, YouTube, etc.).


I think in domains like Math and Software Engineering, they are less constrained by training data anyway. They can synthetically generate and validate programs. To what extent that scales into novel insights is a different matter, but I think they dream of the AlphaGo Zero moment at least in verifiable domains.


How can it ever play against itself on novel software tasks? First it has to come up with the task. Then it can write tests but then it needs to verify that the tests are correct, mixture of experts can come to wrong conclusions, etc...


> I’m not even sure whether this is possible.

Based on what's happened so far, maybe. At least that's exactly how we got to the current iteration back in 2022/2023, quite literally "lets see what happens when we throw an enormous amount data at them while training" worked out up until one point, then post-training seems to have taken over where labs currently differ.


It works as far as consolidating existing human knowledge, but general intelligence doesn't suddenly emerge from it. They're out of data now but if there was 10x more (assuming it's not rehashed from existing data) and they had a 10x larger model then it would be better but that's only because there's more that it can copy from.


Right, but we played the scaling card and it worked but is now reaching limits. What is the next card? You can surely argue that we can find a new one at any time. That’s the definition of a breakthrough. I just don’t see one at the moment.


> I just don’t see one at the moment.

Did you see the one before the current one was even found? Things tend to look easy in hindsight, and borderline impossible trying to look forward. Otherwise it sounds like you're in the same spot as before :)


That’s what I’m said. Breakthroughs happen. No doubt about it, and they are unpredictable. Hence a breakthrough. But right now we’re using up runway with nothing yet identified to take us to the next level. And while sometimes breakthroughs happen, sometimes they don’t.


better tooling and integration


We pay people to create more high quality tokens (mercor, turing) which are then fed into data generating processes (synthetic data) to create even more tokens to train on


But does that really help, or do you get distortion? The frequency distribution of human generated content moves slowly over time as new subjects are discussed. What frequency distribution do those “data generating processes” use? And at root, aren’t those “data generating processes” basically just another LLM (I.e., generating tokens according to a probability distribution)? Thus, aren’t we just sort of feeding AI slop into the next training run and humoring ourselves by renaming the slop as “synthetic data?” Not trying to be argumentative. I’m far from being an AI expert, so maybe I’m missing it. Feel free to explain why I’m wrong.


That's the problem in a nutshell. There is an art to how you generate the synthdata so that you don't get crappy trained models (especially when mistakes cost XX million dollars).

It's also theoretically why facebook paid 14bn for alex wang and scale ai




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: