Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks!

Yes, we have around 6,000+ assertions in TigerBeetle. A few of these were overtight, hence some of the crashes. But those were the assertions doing their job, alerting us that we needed to adjust our mental model, which we did.

Otherwise, apart from a small correctness bug in an internal testing feature we added (only in our Java client and only for Jepsen to facilitate the audit) there was only one correctness bug found by Jepsen, and it didn’t affect durability. We’ve written about it here: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...

Finally, to be fair, TigerBeetle can (and is tested) to survive more faults than Postgres can, since it was designed with an explicit storage fault model and using research that was not available at the time when Postgres was released in ‘96. TB’s fault models are further tested with Deterministic Simulation Testing and we use techniques such as static memory allocation following NASA’s Power of Ten Rules for Safety-Critical Code. There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.

For more on this, see the section in Kyle’s report on helical fault injection (most Raft and Paxos implementations were not designed to survive this) as well as a talk we gave at QCon London: https://m.youtube.com/watch?v=_jfOk4L7CiY



Hi Joran,

I have followed TigerBeetle with interest for a while, and thank you for your inspirational work and informative presentations.

However, you have stated in several occasions that the lack of memory safety in Zig is not a concern since you don't dynamically allocate memory post startup. However, one of the defects uncovered here (#2435) was caused by dereferencing an uninitialized pointer. I find this pretty concerning, so I wonder if there is something that you will be doing differently to eliminate all similar bugs going forward?


TigerBeetle uses ReleaseSafe optimization mode, which means that the pointer was in fact initialized to 0xaaaaaaaaaaaaaaaa. Since nothing is mapped to this address, it reliably causes a segfault. This is equivalent to an assertion failure.


That’s good to hear! Thanks for the clarification.


Note that that's a bug in the client, in the Zig-java FFI code, which is inherently unsafe. We'd likely made an a similar bug in Rust.

Which is, yeah, one of the bigger technical challenges for us --- we ship language-native libraries for Go,node,Java,C#,Python and Rust, and, like in the Tolstoi novel, each one is peculiar in its own way. What's worse, they aren't directly covered by our deterministic simulator. That's one of the major reasons why we invest in full-system simulation with jepsen, antithesis and vortex (https://tigerbeetle.com/blog/2025-02-13-a-descent-into-the-v...). We are also toying with the idea of generating _more_ of that code, so there's less room for human error. Maybe one day we'll even do fully native client (eg, pure Java, pure Go), but we are not there yet.

One super-specific in-progress thing is that, at the moment, the _bulk_ of the client testing is duplicated per client, and also the _bulk_ of the testing is example-based. Building simulator/workload is a lot of work, and duplicating it for each client is unreasonable. What we want to do here is to use multi-process architecture, where there's a single Zig process that generates the workloads and generates interesting sequences of commands for clients, and than in each client we implement just a tiny "interpreter" for workload language, getting a test suite for free. This is still WIP though!

Regarding the broader memory safety issue in the database. We did have a couple of memory safety bugs, which were caught early in testing. We did have one very bad aliasing bug, which would have been totally prevented by Rust, which slipped through the bulk of our testing and into the release (it was caught in testing _after_ it was introduced): https://github.com/tigerbeetle/tigerbeetle/pull/2774. Notably, while the bug was bad enough to completely mess up our internal data structure, it was immediately caught by an assert down the line, and downgraded from correctness issues to a small availability issues (just restarting the replica would fix it). Curiously, the root cause for that bug was that we over-complicated our code. Long before the actual bug we felt uneasy about the data structure in question, and thought about refactoring it away (which refactor is underway. Hilariously, it looks that just "removing" the thing without any other code changes improves performance!).

So, on balance, yeah, Rust would've prevented a small number of easy bugs, and on gnarly bug, but then the entire thing would have to look completely different, as the architecture of TigerBeetle not at all Rust-friendly. I'd be curious to see someone replicating single-thread io-uring no malloc after startup architecture in Rust! I personally don't know off the top of my head whether that would work or not.


I remember reading a similar thing about FoundationDB with their DST a while back. Over time, they surfaced relatively few bugs in the core server, but found a bunch in the client libraries because the clients were more complicated and were not run under their DST.

Anyways, really interesting report and project. I also like your youtube show - keep up the great work! :)


Oh, important clarification from andrewrk(https://lobste.rs/c/tf6jng), which I totally missed myself: this isn't actually a dereference of uninitialized pointer, it's a defer of a pointer which is explicitly set to a specific, invalid value.


This is indeed an important point, the way I originally understood the bug was that the memory was not initialized at all. Thanks for the clarification


well, per the zig spec, any program that relies on that "explicitly set [and] specific" value of 0xAA isn't valid, so it's absolutely a bug


The correctness bug was due to combinations of features. I'm curious if you've looked into combinatorial testing which NIST claimed knocks out almost all bugs when 6-way testing was used.

https://csrc.nist.gov/projects/automated-combinatorial-testi...

My intro to other categories of test generation was usually this paper:

https://cs.stanford.edu/people/saswat/research/ASTJSS.pdf

Maybe see of your team can build combinatorial- or path-based testing in Zig next.


Edit to add 2019, NIST, intro slides on combinatorial testing since it's a better overview:

https://csrc.nist.gov/CSRC/media/Projects/Automated-Combinat...


Thanks!


> we have around 6,000+ assertions in TigerBeetle.

Are they enabled in production? Are there some expensive ones that aren’t?


Yes, we drive with the seat belts on.

It’s not expensive.

Because we batch, this naturally separates the control plane from the data plane, amortizing assertions against the (larger) buffers now flowing through the data plane.

We do also have some intensive online verification checks, and these are gated behind a comptime flag.

Finally, we compile Zig with ReleaseSafe and further have all Zig’s own assertions enabled. For example, checked arithmetic for bounds overflow, which is not something you see enabled by default in safe builds for most languages, but which is critically important for safety.

The reason why all this is so important, is because if your program does something wrong in production, with people’s money, you want to know about it immediately and shutdown safely.

In other words, production is where you most need the safety, not in development (although you obviously want them there too to find bugs faster). But again, it’s the bugs that make it to production that we’re trying to catch with assertions.


Thanks for your reply!

> it’s the bugs that make it to production that we’re trying to catch with assertions.

Nicely put, I think I’ll steal this!


Great to hear, I'll be using it in future too!


> There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.

What are you referencing here?


The scenarios described in our QCon London talk linked above.

This surveys the excellent storage fault research from UW-Madison, and in particular:

  “Can Applications Recover from fsync Failures?”

  “Protocol-Aware Recovery for Consensus-Based Storage”
Finally, I'd recommend watching “Consensus and the Art of Durability”, our talk from SD24 in NYC last year:

https://www.youtube.com/watch?v=tRgvaqpQPwE


    [disks are] somewhere between non-byzentine fault tolerance and
    Byzantine fault tolerance ... you expect the disk to be almost 
    an active adversary ...
    ...
    so you start to see just a single disk as a distributed system
My goodness, not at all! If you can't trust the interface to a local disk then you're lost just at a fundamental level. And even ignoring that, a disk is an implementation detail of a node in a distributed system, whatever properties that disk may have to that local node are irrelevant in the context of the broader system, and are the responsibility of the local node to manage before communicating anything with other nodes in that broader system.

Combined with https://www.youtube.com/watch?v=tRgvaqpQPwE it seems like the author/presenter is conflating local/disk-related properties/details with distributed/system-based requirements/guarantees. If consensus requires a node to have durably persisted some bit of state before it sends a particular message to other nodes in the distributed system, then it doesn't matter how that persistence is implemented, it only matters how that persistence is observable, disks and FS caches and etc. aren't requirements, they're just one of many possible implementation choices.


Recommend you first read the FAST18-winning “Protocol-Aware Recovery for Consensus-Based Storage”.

It’s a mindbender of a paradigm-shift for how to think about local recovery actions in the context of the global consensus protocol!


I've read that paper for sure!

> Disks and flash devices exhibit a subtle and complex failure model: a few blocks of data could become inaccessible or be silently corrupted [8, 9, 32, 59]. Although such storage faults are rare compared to whole-machine failures, in large-scale distributed systems, even rare failures become prevalent [60, 62]. Thus, it is critical to reliably detect and recover from storage faults.

It's not true in general that a node in a distributed system binds its persisted state to a local disk, or flash device, or any specific implementation of any specific kind of storage system. The storage layer is the responsibility of the node to manage, and irrelevant to the wider distributed system in which the node is a participant. Any of those kinds of storage faults need to be accommodated by the node that utilizes those storage layers, but their specific details don't need to be communicated beyond the specific node where they apply. And it's not at all critical for those nodes to detect and recover from any faults in their storage layer; those faults can easily be communicated to the broader distributed system, which necessarily must be able to handle node-specific failures like those without breaking everything down!

> how to think about local recovery actions in the context of the global consensus protocol!

Local recovery actions, or any other kinds of node-specific details, have no relevance or influence on the information communicated thru the the global consensus protocol. A node can persist its state to a local disk, or to RAM, or to an S3 object, or anything else, and none of these details matter at all to the details that node communicates to other nodes in its cluster.

tl;dr: nodes don't necessarily persist state to local disk


I’m not disagreeing that diskless crash recovery protocols exist.

In fact, an early version of TigerBeetle implemented one of these from Cowling and Liskov’s VSR’12 paper.

However, since then, we invested in TigerBeetle’s stable storage, for reasons which I won’t go into further here.

If you are curious to learn more about all the stable storage techniques in TigerBeetle in particular (again, this is not trying to suggest that VSR can’t also run without stable storage, or to deny other techniques such as object storage or tiering!), but if these things are interesting for you, and if you do want to learn more, then I’d recommend diving into “Durability and the Art of Consensus”.


I've read that paper, and all of the papers that TigerBeetle mentions in any and all of its docs/blog posts/etc. Stable storage doesn't matter to the points I'm trying to shine a light on here. It's not "diskless" that I'm talking about, in fact I'm not talking about "crash recovery" at all -- !


> I’ve read that paper

To be clear, “Durability and the Art of Consensus” is not a paper.

> tl;dr: nodes don't necessarily persist state to local disk

In the consensus literature, this is sometimes referred to as “diskless crash recovery”. For example, see work by Dan Ports.

> Local recovery actions, or any other kinds of node-specific details, have no relevance or influence on the information communicated thru the the global consensus protocol.

This goes directly against the central finding of PAR, which gives counter-examples where your statement does not universally hold true.

Again, it’s not intuitive. And that’s why it won FAST18—because it says “everything we know is wrong”. Complete red pill and mindbender.


PAR isn't any kind of panacea or golden rule for nodes in a distributed system, it describes properties of nodes that meet very narrowly-defined requirements, which are in no way universal, and which are in no way requirements for those nodes to participate in the distributed system.

More broadly, there's no concept of "crash recovery" at the system level, which has any meaningful utility. Nodes are either there or they're not there, exactly how they crash or recover from their crashes are irrelevant to the overall distributed system, insofar as if a crashed-and-recovered node comes back online, it's gonna need to re-sync with its peers before it can talk to anyone else, and that's not anything to do with "crash recovery" related to local disk or anything like that (which is all implementation details of the node itself) -- it's just normal node synchronization, orthogonal to any state storage stuff of the node.

PAR is something that your system can maybe implement, it's not any kind of rule or definition or requirement that all systems of some classification must satisfy..!


If I might add on to what you and Joran are both saying, after some time working with TigerBeetle, I found it useful to think of Protocol-Aware Recovery as similar to TAPIR (https://syslab.cs.washington.edu/papers/tapir-tr14.pdf). Normally we build distributed systems on top of clean abstraction layers, like "Nodes are pure state machines that do not corrupt or forget state", or "the transaction protocol assumes each key is backed by a sequentially-consistent system like a Paxos state machine". TAPIR and PAR show a path for building a more efficient, or more capable, system, by breaking the boundaries between those layers and coupling them together.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: