https://dolthub.com is the cool kid right now. There is pacaderm, git lfs, IPFS....

wenc · on April 28, 2020

The utility of version controling production-sized (not sample training data) data (as opposed to code) is something I've having trouble grasping unless I'm missing something here -- and I may be, so please enlighten me.

It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool.

In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day.

But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool.

timsehn · on April 28, 2020

Tim , CEO of Liquidata, the company that built Dolt and DoltHub here. This is how we store the version controlled rows so that we get structural sharing across versions (ie. 50M + one row chgange becomes 50M+1 entries in the database not 100M with no need to replay logs):

https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...

wenc · on April 28, 2020

Thanks, that looks like an interesting approach. I may have missed this in the article, but let's say I have a SQL database with 600m records, and an ETL process does massive upserts (20m records) every day, with many UPDATEs on 1-2 fields.

Wouldn't discovering what those changes are still entail heavy database queries? Unless Dolt has a hook into most SQL databases' internal data structures? Or WALs?

timsehn · on April 28, 2020

You have to move your data to Dolt. Dolt is a database. It's got its own storage layer, query engine, and query parser. Diff queries are fast because of the way the storage layer works.

Right now, Dolt can't be distributed (ie. data must fit on one hard drive) easily so it's not meant for big data, more data that humans interact with, like mapping tables or daily summary tables. But, long term if we can get some traction, we plan on building "big dolt" which would be a distributed version that can scale to as big as you want.

wenc · on April 28, 2020

Ah now I understand!

So for most analytic workloads, typically a columnstore db is used due to the need for performance and advanced SQL features (windowing functions) for complex analytic queries -- which I don't expect Dolt to replace. Which means if we wanted to use Dolt's features, we would have to continuously ETL the data into Dolt, which would entail mirroring the entire database (or at least the parts we want to version control).

Dolt essentially becomes a derived database specifically used for versioning. I see how this might work for some use cases.

seddonm1 · on April 28, 2020

If you are working within the Apache Spark ecosystem you can us DeltaLake https://delta.io/ to create 'merge' datasets which are transactional, versioned and allow time travel by both version number and timestamp.

jamesblonde · on April 29, 2020

Another alternative to Deltalake is Apache Hudi, which also includes bloom filters for indexing time-travel queries (efficiently exclude any files given the supplied time constraint). Z-ordered indexing in Deltalake is not available yet in open-source deltalake, only in Databricks version.

zachmu · on April 28, 2020

One of the cool things about Dolt is that you can query the diff between two commits. This functionality is available through special system tables. You specify two commits in the WHERE clause, and the query only returns the rows that changed between the commits. The syntax looks like:

`SELECT * FROM dolt_diff_$table where from_commit = '230sadfo98' and to_commit = 'sadf9807sdf'`

jacques_chester · on April 28, 2020

> In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written".

Not quite, this is "transaction time". You also need "valid time" to be truly bitemporal. Recovering the database as of some point in time is not enough to answer questions like "when will this fact become false?" or "when did our belief about when it would become false change?", because you didn't preserve assertions about the time range over which the fact was held to be true.

In terms of implementations, ranges are better than double timestamps. They provide their own assertion of monotonicity and can be easily used in exclusion indices.

I found that Snodgrass's textbook was a good introduction to the concepts and it's available for free: https://www2.cs.arizona.edu/~rts/tdbbook.pdf

wenc · on April 28, 2020

Yes, you're correct -- an omission on my part. You need "valid time" (otherwise it's just "uni"-temporal modeling).

Thank you for the link to Snodgrass' book. I've not seen a formal book on temporal modeling in SQL before, so this is fascinating.

jacques_chester · on April 28, 2020

Glad I could help! The research seems to have puttered on for a while after this book was written, but appears to fizzle out by around the turn of the millennium.

Some notion of bitemporalism showed up in SQL 2011, but somewhat constrained compared to what Snodgrass describes.

sgt101 · on April 28, 2020

I worry about retraining every day. Isn't that a flag that says "It hasn't learned a thing and actually I'm just improving my backfitting score"?

wenc · on April 28, 2020

Not really -- in many forecasting applications in fast-changing markets, it is fairly common to dynamically retrain your recursive model to a moving window of historical data in order to adapt to your current environment (with some regularization). The length of the window depends on how fast the market changes.

For these types of recursive model applications, you cannot just fit the model once and forget about it.

slt2021 · on April 28, 2020

as long as it works well on out of sample data at deployment time, it is okay.

Until some major data drift happens, but you would notoce it anyways

sgt101 · on April 29, 2020

Honestly, I've heard people in Vegas tell me the same about their strategies vs. slots. Genuinely, if you have made money from this - well done, take it out now, congratulate yourself. If you haven't...

yanovskishai · on April 28, 2020

Thanks ! There are indeed players many new in the data versioning space (DVC and Quilt also probably worth mentioning).

I totally agree that data management problems are not just ML related. But I personally think that there are additional challenges in the space that are not just version control for data.. all the area of data quality management and monitoring for example. I liked the analogy to devops, source version was super critical problem to solve in software development, but it didn't stop there, with things like CI/CD etc. I believe we'll see similar evolution in the data space..