"But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to. "
What is the "very deep origin"? What is this "new way of thinking"? And what's so wrong with using argmax to make a classifier, if I don't care about estimating probabilities and just want the answer?
A lot of processes downstream to inference benefit from having a minimum of care put into the system design. We're talking 80/20 rule stuff here. It's a simple reorientation vs a janky argmax-classifier, but results in assumptions being obeyed broadly, in a max-entropy sense.
The key insight is that all prediction models can equally be framed as energy-based models (y = f(x) -> E = g(x, y)) and the job of ML is to estimate the joint distribution of x and y with suitable max-entropy surrogate distributions, and performing MLE on this variational distribution vs some training data. All the math in the theory follows from this (perhaps excluding causal stuff but actually I am not familiar enough with those techniques to say for sure). Things get a little more complicated when you consider e.g. autoencoders but above still holds.
Obviously with the choice of a poor surrogate distribution, your predictions will on average be worse. Yes, even if you don't care about probabilities and just want max-likelihood predictions -- your predictions will on average be worse. By construction, analysis proceeds by framing the problem as this and following through. A janky argmax-classifier is not exempted from this -- it, too, already implies a surrogate distribution, but you know, statistically speaking, it's probably a pretty bad one. So it makes sense to put a tiny bit more effort to get way closer to representing the space that your data lives in.
Naturally, you could easily find a janky model that outperforms some relatively unoptimized principled model on a specific use case, and many do get lucky with this. But the principled model has a lot more headroom specifically in terms of the information it can hold, because if the design is more or less correct to the problem specification then the inductive bias built into the model matches closely with the structure of the data which is observed.
Very few of ML is "principled" (e.g., taking account the probability distributions, priors, bounds on the value of parameters etc,), actually it is most of the time a brute-force approach that makes modelers avoid "thinking" about probability distributions, transformations etc.
I did a lot of the "principled" modeling you talk about, in Stan, TMB, and JAGS back in the day, but outside of the need for an "explanation" of model behavior—which is a scientific need much more than engineering need (mind you, here not having an explanation does not need having no idea what the model does, but it relative to the relationship between x and y, both in how we reach the estimation of parameters and the interpretation of the parameters themselves)—I would almost always favor a "brutish" for prediction in industry, out of (1) convenience, (2) accuracy that's almost always better for ML models even using un-principled methods, (3) outside of proper causal inference, predictions are what matters and even when people demand an "interpretation", causality when data and model are not up for that kind of analyses, is a just a guess anyway.
Scientific vs engineering needs is a false dichotomy. Explanation of model behavior matters a lot in many, many matters of engineering, but my point is trying to go further.
You may be thinking narrow-mindedly about what is meant by "interpretation". Or rather, conflating "interpretation of predictions of ML system", which is the common understanding in professional circles, with "interpretation of the real system whose aspects we are predicting with ML", which is a more colloquial frame. I hold you to no fault as I have been ambiguous in my usage and the two overlap quite substantially, particularly at the outputs of the ML system.
An alleged association between homosexuality and passport photos, for instance, is an interpretation of the ways humans exist and what they are fundamentally (read: physiognomy). Automating this association encodes a specific human-level interpretation about what is true about people into the ML system. But this joint distribution between homosexuality and the way a face looks when you record a picture of it is bogus in ways that are hard to put into words. The principle is lacking completely. And this kind of system can very easily be used for extreme harm in the wrong hands.
Nevertheless, surely someone motivated would (1) consider this approach convenient, (2) would have an accurate (vs data) model after the training completes, and (3) would use the raw predictions as they think those "are what matters".
I find, not only for myself but others as well, that being aware of the technical foundations opens the space of cognition to other perspectives of thinking about these issues which find synthesis between the technical and the social impacts of design decisions.
Do you have a reference to a paper that demonstrates the empirical superiority of energy-based models to well-tuned "janky argmax-classifiers"? I find it a little hard to believe there's a free lunch here given the relative popularity of basic argmax stuff – if energy-based models were obviously better, it seems like they'd be used more. But I am open to evidence on this point!
What is the "very deep origin"? What is this "new way of thinking"? And what's so wrong with using argmax to make a classifier, if I don't care about estimating probabilities and just want the answer?