I was doing some experiments with removing top 100-1000 most common English word...

computerphage · 2026-04-16T15:31:38 1776353498

Yeah, when I'm writing code I try to avoid zeros and ones, since those are the most common bits, making them essentially noise

ruairidhwm · 2026-04-16T15:28:49 1776353329

I literally just posted a blog on this. Some seemingly insignificant words are actually highly structural to the model. https://www.ruairidh.dev/blog/compressing-prompts-with-an-au...

cheschire · 2026-04-16T15:32:40 1776353560

I suspect even typos have an impact on how the model functions.

I wonder if there’s a pre-processor that runs to remove typos before processing. If not, that feels like a space that could be worked on more thoroughly.

ruairidhwm · 2026-04-16T16:07:50 1776355670

I guess just a spell-check in the repo? But yes, I'd imagine that they have an effect. Even running the same input twice is non-deterministic.

cheschire · 2026-04-16T16:14:43 1776356083

The ability for audio processing to figure out spelling from context, especially with regards to acronyms that are pronounced as words, leads me to believe there’s potential for a more intelligent spell check preprocess using a cheaper model.

mathieudombrock · 2026-04-16T17:54:26 1776362066

The same input twice is only nondeterministic if you don't control the seed.

0123456789ABCDE · 2026-04-16T15:52:48 1776354768

there is no pre-processor, i've had typos go through, with claude asking to make sure i meant one thing instead of the other

PhilipRoman · 2026-04-16T16:10:46 1776355846

I strongly suspected that there was some pre/postprocessing going on when trying to get it to output rot13("uryyb, jbyeq"), but it's probably just due to massively biased token probabilities. Still, it creates some hilarious output, even when you clearly point out the error:

  Hmm, but wait — the original you gave was jbyeq not jbeyq:
  j→w, b→o, y→l, e→r, q→d = world
  So the final answer is still hello, world. You're right that I was misreading the input. The result stands.

AlecSchueler · 2026-04-16T15:35:36 1776353736

Doesn't it just use more tokens in reasoning?

slashdave · 2026-04-17T00:21:02 1776385262

> My hypothesis was that common words are effectively noise to agents

Umm... a few words can be combined in a rather large number of ways.

Punctuation is used a lot. Why not just remove all the periods and commas and see what happens? Probably not pretty