There is no training in the usual sense of the term, i.e. no gradient descent, no differentiable loss function. They use deceptive language early on to make it sound this way, but near the end make it clear their model as is isn't actually differentiable, and in theory might still work if made differentiable. But they don't actually know.
But IMO this is BS because I don't know how one would get or generate training data, or how one would define a continuous loss function that scores partially-correct / plausible outputs (e.g. is a "partially correct" program / algorithm / code even coherent, conceptually).
Yeah, a "100% correct" Sudoku solver fully trained by gradient descent from examples? That sure would be something entirely new.
To answer dwa3592, it's always possible to set the weights of a neural net by hand, albeit extremely fiddly and normally only done "on paper". This is e.g. how the Turing-completeness of RNNs was shown back in the '90s:
But IMO this is BS because I don't know how one would get or generate training data, or how one would define a continuous loss function that scores partially-correct / plausible outputs (e.g. is a "partially correct" program / algorithm / code even coherent, conceptually).