This is by no means a perfect match for your requirements, but I'll share a CLI tool I built, called Dud[0]. At the least it may spur some ideas.
Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].
With Dud I focused on speed and simplicity. To your three points above:
1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.
2) Dud checks out binaries as links by default, so it's super fast to switch between commits.
3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.
I hope this helps, and I'd be happy to chat about it.
I'd be curious to see if you've tried git-annex, I use it instead of git-lfs when I need to manage big binary blobs. It does the same trick with a "check out" being a mere symlink.
I haven't used it, no. Around the time Git LFS was released, my read from the community was that Git LFS was favored to supersede git-annex, so I focused my time investigating Git LFS. Given that git-annex is still alive and well, I may have discounted it too quickly :) Maybe I'll revisit it in the future. Thanks for sharing!
Neither is favored, git-annex solves problems that git LFS doesn't even try to address (distributed big files), at the cost of extra complexity.
Git LFS is intended more for a centralized "big repo" workflow, git annex's canonical usage is as a personal distributed backup system, but both can stretch into other domains.
In this case git-annex seems to have a feature that git LFS doesn't have that would be useful to you.
Dud is meant to be a companion to SCM (e.g. Git) for large files. I was turned off of Git LFS after a couple failed attempts at using it for data science work. DVC[1] is an improvement in many ways, but it has some rough edges and serious performance issues[2].
With Dud I focused on speed and simplicity. To your three points above:
1) Dud can comfortably track datasets in the 100s of GBs. In practice, the bottleneck is your disk I/O speed.
2) Dud checks out binaries as links by default, so it's super fast to switch between commits.
3) Dud includes a means to build data pipelines -- think Makefiles with less footguns. Dud can detect when outputs are up to date and skip executing a pipeline stage.
I hope this helps, and I'd be happy to chat about it.
[0]: https://github.com/kevin-hanselman/dud
[1]: https://dvc.org
[2]: https://github.com/kevin-hanselman/dud#concrete-differences-...