Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The worst part about this:

> Running experiments until you get a hit

Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

Hence we are running experiments until we get a hit.

The only defense I know against this is to have a good perf CI. If your patch seemed like a speed-up before committing, but perf CI doesn't see the speed-up, then you just p-hacked yourself. But that's not even fool proof.

You just have to accept that statistics lie and that you will fool yourself. Prepare accordingly.



> Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

I don't think that is what it is saying. It is saying you would write one particular optimization (your hypothesis), and then you would run the experiment (measuring speed-up) multiple times until you see a good number.

It's fine to keep trying more optimizations and use the ones that have a genuine speedup.

Of course the real world is a lot more nuanced -- often times measuring the performance speed up involves hypothesis as well ("Does this change to the allocator improve network packet transmission performance?"), you might find that it does not, but you might run the same change on disk IO tests to see if it helps that case. That is presumably okay too if you're careful.


"Multiple times" doesn't have to mean "no modifications". Suppose the software is currently on version A. You think that changing it to a version B might make it more performant, so you implement and profile it. You find no difference, so you figure that your B implementation isn't good enough, and write a slight variation B', perhaps moving around some loops or function calls. If that makes no difference, you keep writing variations B'', B''', B'''', etc., until one of them finally comes out faster than version A. You finally declare that version B (when properly implemented) is better than version A, when you've really just tried a lot more samples.


Well it does mean "no modifications" to the hypothesis, hypothesis being about performance of code A and B. Code B' would be a change.

It's just semantics, but the point is that the article wasn't saying the same thing OP was worried about. There's nothing wrong with testing B, B', B'', etc. until you find a significant performance improvement. You just wouldn't test B several times and take the last set of data when it looks good. Almost goes without saying really.


Sure, it may not be precise repetition, but my idea here is that none of B', B'', etc. are really different than B (they may even compile down to the exact same bytecode), they're just the same thing but written differently. And in fact, none of these are really faster than A, even if they're all "changes". But it's the same issue as any other form of p-hacking, where you keep trying more and more trivial B-variations until you eventually get the result that you're looking for, by random chance. (Cf. the example in xkcd 882, which does change the experimental protocol each time, but only trivially.)

There is, in fact, "something wrong" with this, which is what GP was pointing out. It's literally covered under "Playing with multiple comparisons" in TFA.

(Personally, to combat this, I've ignored the fancy p-values and resorted to the eyeball test of whether it very consistently produces a noticable speedup.)


Why is this bad for you? You're optimizing software, not trying to describe reality. Monte Carlo and Drunkard's Walk are fine.


You're churning the user experience for no reason. Maybe constant optimization churn is one of the reasons why UIs are so bad.


Perf, though? If a perf optimization changes the UI noticeably other than by making it smoother or otherwise less janky, someone is lying to someone about what "performance" means. Likely though that be, we needn't embarrass ourselves by following the sad example.

No, UIs churn because when they get good and stay that way, PMs start worrying no one will remember what they're for. Cf. 90% of UI changes in iOS since about version 12.


I thought languages such as Rust and flamegraphs and etc were supposed to help us avoid doing all this testing and optimization right? Like I use the built in analysis tools that come with cargo and such and what I have on my os, tools like cutter or reverse engineering tools. Even on python I use the default or standard profiling and optimization tools, I wonder sometimes if I am not doing something enough if the default tools thats recommended should cover most edge cases and performance cases right?


Yeah!

And software ultimately fails at perfect composability. So if you add code that purports to be an optimization then that code most likely makes it harder to add other optimizations.

Not to mention bugs. Security bugs even


heck even the ai by default doesnt start with security from the models I have tested its really really weird.


Well, what is the test you are using to measure performance? Maybe the optimizations help performance in some cases and hurts performance in others... your test might not fully match all real world workloads.


These seem like two different things. Testing many different optimizations is not the same experiment; it's many different experiments. The SE equivalent of the practice being described would be repeatedly benchmarking code without making any changes and reporting results only from the favorable runs.


Doesn’t matter if it’s the same experiment or not.

Say I’m after p<0.05. That means that if I try 40 different purported optimizations that are all actually neutral duds, one of them will seem like a speedup and one of them will seem like a slowdown, on average.


That's not p hacking. That's just the nature of p values. P hacking is when you do things to make a particular experiment more likely to show as a success.


There's another cheeky example of this where you select a pseudo-random seed that makes your result significant. I have a personal seed, I use it in every piece of research that uses random number generation. It keeps me honest!


what they’re referring to might be better put as applying a patch once and then running it 500 times until you get a benchmark thats better than baseline for some reason

which is understandably a bit more loony


Nah it could be 20 different patches.


how can I do this in python what modules?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: