Hey HN! We're Guide Labs and we just launched Clarity, an AI platform powered by our Steerling-8B model.
You can: Click any chunk of output the model generates and see the concepts/ideas behind it. Trace the outputs back to the training data. You can also: Amplify or suppress concepts to control behavior.
Would love your feedback: happy to answer questions.
Hey HN we recently released Steerling-8B, an 8B model designed to be interpretable from the ground up. The model has ~100K concept slots it fills on its own during training, and we can read off what each one means by projecting into vocabulary space.
The model figured out things like British vs. American spelling, second-person pronouns across 6+ languages, and even broken Unicode.
We train the model with `explanations`. Most training asks the model to predict the next token or group of tokens. Our training says, predict the next group of tokens (causal diffusion), but also these tokens should be about {sports/art/coding/etc}. So in addition to token supervision, the model gets concept level supervision. The model is forced to more quickly learn these high level concepts.
Take a look at the link in the blogposts. Here is a github link as well: https://github.com/guidelabs/steerling. The model weights are on huggingface, so you can play with it.
Great questions. We weren't quite explicit about the training data attribution process. We'll discuss this in more detail in future work. We can track down which parts of the training data were interpolated to create that sentence. For those training data sentences, we then compare the concepts between generated and training.
We can attribute to exact sentences and chunks in the training data. For the first release, we are sharing only concept similarities. Over the coming weeks, we'll share and discuss how you can actually map to the exact training sentence and chunk with the model.
That would be great because "I got it from Wikipedia and Arxiv" isn't exactly useful.
From reading your second link (and please tell me if I got it wrong) it sounds like it isn't actually tracking to training data but to prototypes which are then linked a posteriori to likely sections of the training data. The attribution isn't exact, right? It's more like "these are the likely texts that contributed to one of those prototypes that produced the final answer." Specifically the bit in PRISM titled "Nearest neighbour Search" sounds like you could have a prototype that takes from 1000 sources but 3 of them more than the others, so the model identify those 3, but the other ones might matter just as much in aggregate?
It says that the decomposition is linear. Can you remove a given prototype and infer again without it? That would be really cool.
This part of the claim is involved, so we have future posts to clarify this. And yes, you can remove a prototype and generate again. We show examples in that prism post.
In prism, for any token the model generates, you can say, it generated this token based on these sources. During training, the model is 'forced' to match all the prototypes to specific tokens (or group of tokens) in the data. The prototype itself can actually be exactly match to a training data point. Think of it like clustering, the prototype is a stand-in for training data that looks like that prototype, we force (and know) how much the model will rely on that prototype for any token the model generates.
The demo in the post is not as granular because we don't want to overwhelm folks. We'll show granular attribution in the future.
You got it exactly right :) And you can update the attribution.md to have it NOT rely on opensource projects that have been compromised. Imagine asking claude code to write a package/function in the style of a codebase that you care about or force it to ALWAYS rely on some internal packages that you care about. The possibilities are endless when you insert such knobs into models.
Doesn’t the nature of most open source licenses allow for AI training though?
Example — MIT:
> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions
I remember seeing some new licenses like Human license or something iirc but they all had the valid criticism that it would be unenforcable or hard to catch decision.
I haven't looked at the project that much but this could seem exciting to me if maybe these two things can get merged.
I don't think that license is necessarily the problem in here. Licenses can change and people can adopt new licenses.
Down to the very exact text chunk in a document! Check this out for an idea of what smaller versions of this style of model can do: https://www.guidelabs.ai/post/prism/. We'll have more to say soon about it. We can trace any generation to 11B chunks (not documents, but actual chunks in the training data).
Yes, that is the post that has the most up to date details of the model architecture. Take a look at this: https://github.com/guidelabs/steerling. It has the scaffolding for what you need :)
You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111
You can: Click any chunk of output the model generates and see the concepts/ideas behind it. Trace the outputs back to the training data. You can also: Amplify or suppress concepts to control behavior.
Would love your feedback: happy to answer questions.
reply