Stephen Hawking's Ph.D. Thesis Crashes Cambridge Site After It's Posted Online

maxpert · on Oct 24, 2017

Ok I am confused, 60k hits in a day? What broght down the website? 72MB of size, network congestion, or 60k hits? Even with authentication for download what can bring the system down? I have handled more traffic on a RPi with 100MBps connection. I really don’t get it.

maxpert · on Oct 24, 2017

Here I made a quick copy and a quick domain, almost 10 LoC and it will be more reliable than what Cambridge had http://hawking.sibte.ml/

fnwx17 · on Oct 24, 2017

Love the irony on that! Which makes me wonder how much valuable content and research in locked away on technologically ancient servers in Cambridge and Oxford.

qurashee · on Oct 24, 2017

More modern and up to date than you'd think :) Some departments here in Oxford have a server life-cycle of 3-5 years. It's just that nobody bothers for higher than expected volume of traffic unfortunately (a practice that can be extrapolated to many things in academia).

chauhankiran · on Oct 24, 2017

Wow! Thanks. But, I do not understand the <script> tag in homepage contains lots of ![]+[].

alain_gilbert · on Oct 24, 2017

I think it's just a way to document.write his email address, so no bot/crawler can directly see it.

Edit: It evaluate to `document.write('HIS_EMAIL at gmail')`

pennaMan · on Oct 24, 2017

If the browser can do it what's stoping a headless browser bot from evaluating it all the same? Is it just to lower the attack surface?

poooogles · on Oct 24, 2017

>If the browser can do it what's stoping a headless browser bot from evaluating it all the same?

The vast majority of crawlers aren't that smart, headless browsers used to be painfully slow so no one used them.

contravariant · on Oct 24, 2017

Evaluating all javascript to see if it eventually writes an email address is not a good idea.

In fact doing so probably violates the halting problem somehow.

mortehu · on Oct 24, 2017

(60000 connections * 72 MB) / (24 hours * 40 Mbps) ~= 10 concurrent connections.

If your server is configured for a low number of concurrent connections, it can easily seem swamped if a small number of people are concurrently downloading a large file slowly. All that's happening is that it's not accept()ing new connections until existing ones finish.

gilleain · on Oct 24, 2017

Heh, I don't know what the current hardware is for the CUDL DSpace repository, but having helped upgrade it a few years ago, I can't say the software is the fastest.

It's a java application with XLST in the page rendering pipeline, so not exactly optimised for speed.

lstyls · on Oct 24, 2017

If you're hitting a single box, this seems perfectly possible - especially if the server or the network is poorly configured. This is what CDNs are for.

dsfyu404ed · on Oct 24, 2017

This isn't Joe's blog running from a desktop found in a dumpster. It's a university with actual infrastructure. They should be able to handle this. Registration is probably more load.

A 72MB file being served 500,000 times over a 24hr period is 3-4Gb/sec.

The department probably has one or more web servers serving content off a network drive. Your average "rack of network storage" won't even blink at that.

There was probably a gigabit switch somewhere or that was being a bottleneck or the web server was simply misconfigured for this task.

lstyls · on Oct 26, 2017

+1, thanks for the correction. Apparently I have a skewed perspective only working on web at big corporate joints.

mayank · on Oct 24, 2017

Probably a burst of traffic hitting a single box web server, rather than 60k hits spread out evenly over a day.

posterboy · on Oct 24, 2017

With a half time of 3 days of interest and a 100 MBps connection the average download time would be 3 hours. That would indeed seem unusual for today. but 30k page hits spaced out over 60 hours is not too much of a burden on top of that.

yreg · on Oct 24, 2017

It hit top of reddit yesterday, I don't think it was a mere 60k hits.

Vinnl · on Oct 24, 2017

More than 400k, actually: https://mobile.twitter.com/dannykay68/status/922745216481808...

maxpert · on Oct 24, 2017

Tweeted her :P let's see if she is bothered to respond

masonic · on Oct 24, 2017

  What brought down the website?

The Greys.

retox · on Oct 24, 2017

What percent of downloads will actually be read I wonder.

I'm guilt of downloading and hoarding things that seems interesting and never getting round to even opening them. "When I'm retired", I tell myself.

callinyouin · on Oct 24, 2017

Guilty as charged, as well. I download just about any programming manual/book that's free in PDF or similar form and I've probably read maybe 10% of them.

netsharc · on Oct 24, 2017

I'm glad NPR used this headline, another news site said it "broke the internet". Journalism!(tm)

ianopolous · on Oct 24, 2017

Here is an ipfs link to it: https://ipfs.io/ipfs/QmNwcSE8BYQmHSS99dtg2VdAez4uswVkvCssj4R...

imrehg · on Oct 24, 2017

Thanks a lot, and pinning it to help to seed :) `ipfs pin add /ipfs/QmNwcSE8BYQmHSS99dtg2VdAez4uswVkvCssj4Rgce4rLp`

F00Fbug · on Oct 24, 2017

Was just thinking, "This sounds like a job for IPFS!" I spent most of Sunday reading about and playing with IPFS. Great idea - I hope it gets some traction.

PhilWright · on Oct 24, 2017

The PDF is just a scan of actual book printed pages. No wonder it is a monstrous size.

ccvannorman · on Oct 24, 2017

So a man who uses a machine for synthesis into text for speech, wrote a book which was then printed, scanned, and uploaded.

I wonder if they faxed it at some point too?

ktta · on Oct 24, 2017

In 1966?

HN's guidelines say that one shouldn't ask if one has read the posted link, but I'm tempted to all the time.

vidarh · on Oct 24, 2017

It would seem to be a joke.

But even so, while 1966 was indeed early for "regular" use of fax - the first "user friendly" Xerox fax machines hit the market around then -, the first transmission of facsimiles of images dates to the 1840's, and the first fax that used similar methods to "modern" fax machines of scanning line by line (the "scanning phototelegraph") dates to 1880. Commercial fax machines have been around since around 1900.

So it would indeed be possible.

One weird and wonderful product of early faxes (fax over radio predates "wired" fax machines): Finch Facsimile's [1] were used to transmit "newspapers" via AM radio in the '30's, that was then printed on thermal paper at the home of the subscriber.

From [1]: "Six hours overnight was enough time to print a six page two column news bulletin, delivered in time for breakfast."

[1]: http://www.theradiohistorian.org/Radiofax/newspaper_of_the_a...

lallysingh · on Oct 24, 2017

What's this got to do with his speech problems? Did you want him to ready it to you?

ReverseCold · on Oct 24, 2017

I think the point was that he entered the information in digitally, so it should have been easier to provide a digital copy rather than scan actual printed pages.

Note: Didn't go to the link, so just guessing.

ktta · on Oct 24, 2017

>he entered the information in digitally

This is Stephen's Hawking's PhD thesis. It is dated 1966. How could he have entered it digitally? UNIX research didn't even start until 1970s.

I believe this comment chain thinks the thesis was written recently. So you guys thought Stephen Hawking didn't have a PhD till now?

seanmcdirmid · on Oct 24, 2017

Well, https://en.wikipedia.org/wiki/Expensive_Typewriter came out in 1961. Cool name also (because computer time was expensive back then!).

ktta · on Oct 24, 2017

That looks interesting!

I quoted UNIX to show how primitive technology was at the time, rather than say he could've possibly used a computer. With ALS, using a computer would as hard as writing I would presume.

aardvark291 · on Oct 24, 2017

His thesis is from 1966, an era in which (1) it is likely that he typed it on a typewriter, and (2) he was still almost completely able-bodied.

Sharlin · on Oct 24, 2017

Not only did he type it with a typewriter, all the math notation in the thesis is hand-written! There was no math typesetting available in 1966 outside major printing houses. (Or, I guess, any reasonable priced option for getting your PhD thesis properly typeset and printed at all, math or not.)

agency · on Oct 24, 2017

It's possible Hawking was able to type/write his thesis. It was published in 1965, two years after his ALS diagnosis. Wikipedia says he didn't begin to use crutches until the late 1960s.

colinbartlett · on Oct 24, 2017

Seems like a perfect use for a torrent. Is there a tracker link?

logicallee · on Oct 24, 2017

"I've watched everything"

"What do you mean everything?"

"TV shows. Movies. Even the japanese ones."

"How about older stuff"

"I've gone through early Chaplin work. I've seen Metropolis 17 times."

"I'm sure there's something"

(grabs his friend)

"You don't understand, Paul. I've been reading Shannon. A Mathematical Theory of Communication. I've run out. I've taken to begging strangers for a fix."

"That bad?"

(Guilty), "I just... " (resigned) "I just asked someone to put up a torrent of Stephen Hawking's Ph.D. thesis..."

B1FF_PSUVM · on Oct 24, 2017

> Movies. Even the japanese ones.

Even the Eastern European ones. And all of Ingmar Bergman.

(shudders)

scandinavegan · on Oct 24, 2017

That's the good stuff! I recommend Satantango, and all of Bergman.

ktta · on Oct 24, 2017

Info hash: e5878b9cdd55286310135419a69371a31195a32a

Magnet link:

magnet:?xt=urn:btih:e5878b9cdd55286310135419a69371a31195a32a&dn=PR-PHD-05437_CUDL2017-reduced.pdf&tr=udp%3A%2F%2Fexplodie.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.empire-js.us%3A1337&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org

If you don't have a torrent client: https://instant.io/#e5878b9cdd55286310135419a69371a31195a32a

endswapper · on Oct 23, 2017

Direct link to thesis: http://schema.lib.cam.ac.uk/PR-PHD-05437_CUDL2017-reduced.pd...

DrScump · on Oct 23, 2017

It would have been more helpful to post mirror site addresses rather than exacerbate the problem.

Anybody know of mirror sites? A basic web search doesn't show any, and archive.org doesn't show it.

matthewbadeau · on Oct 24, 2017

Mirror: http://web.archive.org/web/20171024005153/http://schema.lib....

its4tom · on Oct 24, 2017

Link to thesis on Scribd in case helpful: https://www.scribd.com/document/362484322/Stephen-Hawking-Ph...

eighthnate · on Oct 24, 2017

I just have to ask why this is news? Is this really something newsworthy?

Cambridge's network probably isn't as hardened to spikes in traffic since they don't get much traffic. But still, it isn't 1995. They should have some form of load balancing or distributed/clustered web/data/file systems to handle temporary spikes in traffic and data requests. Serving simple static data isn't something that should "crash the site".

Insanity · on Oct 24, 2017

The technology behind the repository itself is not great. (DSpace[1]), add to that the factor that it is not actually build to handle this many requests and scaling quickly is out of the question too because of the server set up.

Even without issues, it often felt a bit sluggish when serving locally. The pages are quite large, and the whole pipeline from content -> webpage is rather tedious.(Java, XSLT -> html)

It shouldn't have happened - but I assumed it would.

disclaimer: I am a former contributor to the project [1]: https://github.com/DSpace/DSpace

zamber · on Oct 24, 2017

Suggestions on getting this in audio form? I guess it requires transcribing the handwritten parts. The Chrome OCR fails there. Is there a better one?

Sample:

This implies that the universe is spatially homogeneous and isotropic since there is no direction defined in the 3- space orthogonal to Ua. In this universe we consider small perturbations of the motion of tl1e fluid and of the '.ifeyl tensore 1 Ne neglect products of small quantities and perform derivatives with respect to the undisturbed metric. Since all the quantities we are interested in with the exception of the scalars, µ, ~' e have unperturbed value zero, we avoid perturbations that merely represent coordinate transformation and have no physical significance. To the first order the equations (1) - (4) and (7) - (9) are

accurrent · on Oct 24, 2017

Article has a typo in the quote should be Olber not Older who described what's now known as Olbers paradox.

dredmorbius · on Oct 24, 2017

This points to challenges of digital information.

Stephen Hawking and his dissertation are high-profile as these things go. The NPR mentions other popular items generating 100s of requests per month. I've run across items with lifetime request counts in the double or triple digits frequently (and suspect I doubled the count on one particular item).

More often, though, the truth is that this material simply isn't available online. There are several thesis repositories (either Michigan State or University of Michigan are one, as I recall), and I can frequently turn up a shelf reference via WorldCat ... somewhere.

But there's work from surprisingly prominent names in numerous fields that simply isn't available in electronic format. The worst case is for materials from rougly 1924 - 1980: to late to be out of copyright, and too early to have been composed, or converted to, digital formats (and 1980 is an early cut-off date for that, though it's when material seems to start appearing in bulk).

This includes PhD dissertations, Masters theses, and numerous academic or other writings, often including government documents not under copyright. Thankfully with Sci-Hub, actual published academic journal articles can be found, freely, with a very high success rate. Particularly painful for me are popular magazine and newspaper items, for which even the indices are very frequently locked behind site-restricted or affiliate-only access.

The time-and-effort differential of being able to look something up online, vs. travelling many miles to a facility for access, is tremendous. And it absolutely stops a great many incidential queries dead.

See Rick Falkvinge's excellent rant about how the KRACK vulnerability was blocked behind corporate-only paywalls for over a decade:

https://www.privateinternetaccess.com/blog/2017/10/the-recen...

Note that the issues here are twofold. One element is the task of scanning and making available documents, and organising the results in a manner useful for search.

But much the harm is the direct consequence of the present regime of copyright and paid access to information, AS WELL AS the perverse incentives of advertising-backed media and media manipulation have created a media regime that is actively harmful to society.

I'd really like to see the elements of this addressed.

powerbook5300CS · on Oct 24, 2017

99% chance someone used apache’s default configuration.

Bromskloss · on Oct 24, 2017

What is the limitation in that default configuration?

jerf · on Oct 24, 2017

KeepAlive, probably: http://www.kalzumeus.com/2010/06/19/running-apache-on-a-memo...