It's a conference paper, from EuroSciPy 2013, distributed through a preprint service. There's no expectation it will be a high quality paper. Instead, it's an appropriate quality for where and how it was published.
Checking it now, it gives a reproducible way to download the specific packages used, and the benchmark framework. The actual code benchmarked is fatghol, from https://code.google.com/p/fatghol/ . There's also a link to a preprint describing the construction algorithm, at http://arxiv.org/pdf/1202.1820v2.pdf .
What you propose is an unrealistic expectation, and only possible for people with lots of money and time.
Instead, in real life what happens is people do A, and publish A, then do B (building on A), and publish B, then do C (building on B) and publish C. There's a trail of work backing up the final publication. It makes no sense for publication Q to revisit all of A-P, nor for the author to wait until Z before finally publishing everything. I also think knowledge transfer would be lower since someone interested in this paper's conclusions about the available documentation for the different Pythons (EuroSciPy is not a graph theory specialist conference) would almost certainly not be interested in the algorithm generation details.
You do realize the LINPACK is the "gold standard" benchmark used to rank the top 500 supercomputers, right? And all it does is solve A x = B. In any case, the performance suites like SPEC MPI still need to evaluate the individual benchmarks before assembling them in a suite. Even if you require a suite for something to be meaningful to you, this could be seen as a first step to building such a meaningful suite.
It appears to me, therefore, like you are needless harsh and critical.
I don't know about EuroSciPy 2013. I guess from your comment that such a conference does not require very high quality submissions.
It is typically not good style to simply say "here is a repository which contains the benchmark code". That is necessary, but not sufficient. (Although, I will say many papers do not include any link to where code can be found, so this was a distinct advantage of this paper.)
There's no need to regurgitate all previous work, but a bit more than a reference is extremely beneficial to legibility and allows for emphasis on particular aspects of what will be measured.
My problem with the paper is my answer to the following question, "What can I conclude from the paper?" What I gathered was approximately the following:
1. The author has a library for computing homologies. The abstract method for their computation is referenced in a (peer reviewed? published?) paper. The library is linked to, though no particular version is mentioned. (Can we really call it reproducible then?)
2. The author has given a very brief overview of the stages of the FatGHoL program, two of which are relevant to the benchmark. The author does not discuss the structure of the objects as implemented, so I must view it as a black box, unless I read source code.
3. The author, in a few sentences, summarizes (but does not delve into) which few very high level data structures used.
4. The author spends the rest of the paper showing CPU time and memory graphs.
5. The author makes conclusions from the data, with sometimes plausible explanations.
There is no outline as to what is actually being tested, except this black box library. There are no code samples as you would typically see in a survey or conference paper. As a reader, I've at best concluded, "a subset of some version of FatGHoL has the following time and space measurements for a few input parameters." Was this the conclusion the author wanted me to have?
But note the abstract says the tests "[are] an opportunity for every Python runtime to prove its strength in optimization." Is this true? The author has not even remotely convinced me that the code being run is even relevant optimization capabilities.
I don't think adding up to one or two pages more talking about these things would have cost excessive time or money on the author's part.
The unfortunate bit is elsewhere on HN and Reddit, people are now linking to this paper as almost the definitive resource for comparing the performance of Nuitka versus other implementations.
Lastly, I do realize LINPACK is among the benchmarks used for supercomputers (even though it's probably more appropriate LAPACK is used, which it sometimes is). I am very well aware of the details of the benchmark, having written an equivalent version before myself.
Quoting from the web site: "The annual EuroSciPy Conferences allows participants from academic, commercial, and governmental organizations to: showcase their latest Scientific Python projects, learn from skilled users and developers, and collaborate on code development." It isn't a conference which requires rigorous submissions.
You say "very high quality". I used "rigorous" because quality has many dimensions. I believe people go to EuroSciPy in part to learn which other tools exist, and to learn from the experience of others. This paper appears to have that audience in mind. It's partially an experience paper, and discusses things like available documentation and the stage of development of the tools (eg, Falcon is in early development, and crashed on the test code).
If someone came to the conference, interested in performance (which is most of the audience) but not in NumPy (which is a smaller number), then this is a high quality paper for this type of conference for guiding them on which Python implementations to prioritize, even if the benchmark per se were ignored.
You quoted where the abstract said "an opportunity for every Python runtime to prove its strength in optimization". I can see how that might be interpreted as a very broad benchmark. But it earlier mentioned "Python library FatGHol ... moduli space of Riemann surfaces" and later says "This paper compares the results and experiences from running FatGHol with different Python runtimes", so I think you're reading too much into that quote.
My code is also non-numeric scientific code. It's extremely unlikely that I would understand the algorithm in that code, or that the mix of instructions would match my code. I would skip the extra details as irrelevant to my interests. Whereas the other points, like how Nuitka's claim that it "create[s] the most efficient native code from this. This means to be fast with the basic Python object handling." has at least one real-world counter-example, and like how PyPy can use a lot of memory, again affects my weights about how I might evaluate the available options.
Do you seriously think that one or two pages more would have had a significant effect on the comments on HN or Reddit? For that matter, I see eight comments total on HN about the paper, including mine and your three. I don't see (in HN) peopling regard it as a 'definitive resource', but only a resource. I don't read Reddit so can't say anything what's going on there, but surely complaining here about Reddit doesn't help.
Also, the paper was 4 1/2 pages long. You want the author to spend about 30% longer to write the paper, which I think is excessive.
"What algorithms were used"? The papers says "The code used to install the software and run the experiments is available on GitHub at https://github.com/riccardomurri/python-runtimes-shootout "
Checking it now, it gives a reproducible way to download the specific packages used, and the benchmark framework. The actual code benchmarked is fatghol, from https://code.google.com/p/fatghol/ . There's also a link to a preprint describing the construction algorithm, at http://arxiv.org/pdf/1202.1820v2.pdf .
What you propose is an unrealistic expectation, and only possible for people with lots of money and time.
Instead, in real life what happens is people do A, and publish A, then do B (building on A), and publish B, then do C (building on B) and publish C. There's a trail of work backing up the final publication. It makes no sense for publication Q to revisit all of A-P, nor for the author to wait until Z before finally publishing everything. I also think knowledge transfer would be lower since someone interested in this paper's conclusions about the available documentation for the different Pythons (EuroSciPy is not a graph theory specialist conference) would almost certainly not be interested in the algorithm generation details.
You do realize the LINPACK is the "gold standard" benchmark used to rank the top 500 supercomputers, right? And all it does is solve A x = B. In any case, the performance suites like SPEC MPI still need to evaluate the individual benchmarks before assembling them in a suite. Even if you require a suite for something to be meaningful to you, this could be seen as a first step to building such a meaningful suite.
It appears to me, therefore, like you are needless harsh and critical.