> How the fuck does ext4 [...] still have problems with this? I'm increasingly r...

quanticle · on Jan 24, 2023

Dan Luu has a post (https://danluu.com/filesystem-errors/), which covers some of the same ground, and links to papers with more information on the failure modes of file systems in the face of errors from the underlying block device. Prabhakaran, et. al. (https://research.cs.wisc.edu/wind/Publications/iron-sosp05.p...), did a bunch of filesystem testing (in 2005!), and their paper includes discussion on how to generate "realistic" filesystem errors, as well as discussion of how the then state-of-the-art filesystems (ext3, Reiser (!), and JFS) perform in the face of these errors.

I'm unaware of any research newer than Dan Luu's post on filesystem error handling.

db48x · on Jan 24, 2023

I keep that link handy too! I wish there were newer research to quote, but I also don’t want to force anyone to do that job. It must be pretty depressing to rip all the bandaids off and really contemplate how bad the situation is.

tedd4u · on Jan 24, 2023

Kudos to you! But you're right, you're not. SQLite has extensive testing including out-of-memory, I/O error, crash and power loss, fuzzing etc. And 100% branch test coverage.

https://www.sqlite.org/testing.html

  3.2. I/O Error Testing

  I/O error testing seeks to verify that SQLite responds sanely to failed I/O operations. I/O errors might result from a full disk drive, malfunctioning disk hardware, network outages when using a network file system, system configuration or permission changes that occur in the middle of an SQL operation, or other hardware or operating system malfunctions. Whatever the cause, it is important that SQLite be able to respond correctly to these errors and I/O error testing seeks to verify that it does.
  
  I/O error testing is similar in concept to OOM testing; I/O errors are simulated and checks are made to verify that SQLite responds correctly to the simulated errors. I/O errors are simulated in both the TCL and TH3 test harnesses by inserting a new Virtual File System object that is specially rigged to simulate an I/O error after a set number of I/O operations. As with OOM error testing, the I/O error simulators can be set to fail just once, or to fail continuously after the first failure. Tests are run in a loop, slowly increasing the point of failure until the test case runs to completion without error. The loop is run twice, once with the I/O error simulator set to simulate only a single failure and a second time with it set to fail all I/O operations after the first failure.
  
  In I/O error tests, after the I/O error simulation failure mechanism is disabled, the database is examined using PRAGMA integrity_check to make sure that the I/O error has not introduced database corruption.

zaarn · on Jan 24, 2023

I can still get SQLite to trivially corrupt indexes and table data by running it on top of NFS and dropping the network. I wouldn't put much money on this statement on their website.

mschuster91 · on Jan 24, 2023

NFS is not a sane file system and it never has been. There are all sorts of issues surrounding it.

josephg · on Jan 24, 2023

Yeah; I mentally put SQLite in the tiny corner of software that works well.

Every time I play video games and see that "Don't turn off your console when you see this icon" I die a little inside. We've known how to write data atomically for decades. I find it pretty depressing that most video games just give up and ask the user to make sure they don't turn their console off at inopportune moments.

And I don't even blame the game developers'. Modern operating systems don't bother giving userland any simple & decent APIs for writing files atomically. Urgh.

guitarbill · on Jan 24, 2023

You may or may not be surprised at how many drives acknowledge a write, but instead of committing it to the physical storage put it in a cache, and then don't have enough power reserves to flush the cache on power failure... hard to design software around hardware that lies.

voiper1 · on Jan 24, 2023

Of note, even postgres developers expected fsync() (persist everything to disk) to behave differently than it did. Take a look here:

https://lwn.net/Articles/752063/

cwillu · on Jan 24, 2023

Years ago when I was doing database things, I rigged up a beagleboard to a lamp timer that cut the power every 30 minutes.

olau · on Jan 24, 2023

This is actually a narrow statement.

Look yourself in a mirror. Do you even comprehend yourself? You're an incredibly big bunch of cells, of which only very few of them have any chance of continuing on. If you're male, it's not even real continuation, it's just part of a molecule.

We're hardwired to seek out and go with the most superficial of models, I guess because that's the most efficient way to go about in life.

josephg · on Jan 24, 2023

> Look yourself in a mirror. Do you even comprehend yourself? You're an incredibly big bunch of cells

Sure; but thats the exact reason drug discovery is so difficult. If we understood the human body in its entirety like we understand computers, we could probably cure cancer & aging.

Our capacity to write correct software depends entirely on being able to build mental models of how the machine works. The deep stack of buggy crap that we just take for granted these days makes software development harder. The less understandable and the less deterministic our computers, the worse products we build. And the less effective craftsman we become.