In the section "Programs into weights & training beyond gradient descent", near ...

In the section "Programs into weights & training beyond gradient descent", near the end, they say:

    [...] *the compilation machinery we built for generating those weights** can go further. In principle, arbitrary programs can be compiled directly into the transformer weights, bypassing the need to represent them as token sequences at all. [...] [my emphasis]

In the same section, they also continue:

    Weights become a deployment target: instead of learning software-like behavior, models contain compiled program logic.

    If logic can be compiled into weights, then gradient descent is no longer the only way to modify a model. Weight compilation provides another route for inserting structure, algorithms, and guarantees directly into a network.

So they (almost-invisibly) admit they compile in the weights, but make it clearer this was the whole intention the whole time in later sentences.