Say No to Paxos Overhead: Replacing Consensus with Network Ordering

mritun · on Dec 9, 2016

Interesting research paper but multicast is a non-starter for non-trivial production systems.

1. As implemented by most (read: all) network hw, it works fine when congestion is low. As soon as congestion starts increasing, the packet loss amplifies.

2. If you have large multicast groups, the bloom filters used for membership test become very full, hence false positive rate rapidly increases. This leads to multicast storms.

In my experience, anything which is not TCP (or, to lesser degree UDP) is a non-starter for building reliable distributed systems. Despite the claims of "lower overhead" the TCP protocol stack gets most love and hence has been optimized heavily on every significant operating system.

jerf · on Dec 9, 2016

But we're not implementing multicast in some general sense, then pushing it to its limits as we try to scale it to thousands of devices. Isn't that the multicast failure case? What we're doing here is replacing something that would right now be using Paxos, which has a lot of message chatter and therefore can become bandwidth limited probably sooner than the multicast channel would be anyhow. You wouldn't use this to try to coordinate thousands of machines either way, you'd be doing this to replace something that ran at Paxos-scale previously. There's a bound here, and might it be below the multicast failure bound?

(Honest question; I have no direct experience with multicast.)

mritun · on Dec 9, 2016

Multicast fails before bandwidth starts becoming an issue. Collisions and congestion are fact of life and do not necessarily mean that links are full (you may just have lots of packets stuck close together in time).

Most TCP implementations handles it well and Linux has well written guides on how to tune parameters when the defaults don't match well to your specific situation well. Multicast just fails catastrophically in arcane ways that depend upon your specific hardware, network topology and mix of your firmware version on your switch/routers. It makes it very hard to "design" a product around that kind of fragile system.

hardwaresofton · on Dec 9, 2016

Would you mind expanding a little more? If I'm understanding your point correctly -- you're saying that the sequencer failure case (whether it's modified switches or programmable switches, or software-based) is more common as congestion increases, and that's what makes this unusable in practice?

Wouldn't traditional Paxos already be dead in the water at that point?

Also regarding the point about multicast storms -- I was imagining something like <100 nodes in a multicast group -- is that naive? What obvious use case am I missing? I was basically thinking of this scheme in front of a database layer

polskibus · on Dec 9, 2016

Multicast has been used TIBCO Rendezvous to deliver high performance, high volume messaging in financial sector for ages. Can you elaborate on your claims?

revertts · on Dec 9, 2016

Messaging in the financial sector has very different requirements from messaging for a general service backend, which I assume is what the OP is referencing. As an example, a certain ecommerce site back in the day was heavily tibco based and migrated off due to inefficiencies at scale, multicast storms, debugability, and the general nature of multicast failures (for the most part you get no graceful degradation - which is fine when you have tight control of the network and do careful capacity planning, but isn't the mode that most tech companies live in).

Anecdotally, non-multicast setups are easier to scale in hardware (because you can push commodity farther) and most critically are easier for the normal type of engineer you'd find in the tech sector to reason about (because many aren't comfortable with networking equipment, so it's much easier moving that state out of the network and into layers they can easily debug).

otterley · on Dec 9, 2016

I worked on such a site in the mid-2000s. TIBCO Rendezvous was extremely efficient in the steady state, but boy, it had a catastrophic failure mode: during network congestion or receiver saturation, it would start repeating messages, causing the network congestion and saturation to worsen. The only way to stop the bleeding was to shut everything down.

At any rate, most cloud SDNs don't support IP multicast (unless you use an overlay like Weave), so it's sort of a nonstarter for them.

These days I might use Redis pub/sub along with a multi-level cache (e.g. LMDB + local Redis + remote Redis) if I need to distribute data to a large number of clients efficiently.

tptacek · on Dec 9, 2016

This is my experience with Rendezvous as well. The very largest financial organizations using Tibco tend to have a whole group of engineers whose only job is to keep Rendesvous from throwing a rod. Their networks are also overprovisioned and their traffic rates are pretty straightforward to characterize.

You can, of course, use Rendezvous without multicast.

And: does anyone still use it? The last exchange I worked with that did was migrating from it to JMS (which sort of gives lie to the idea that Rendezvous's multicast performance was that big a deal).

otterley · on Dec 9, 2016

Back in the 1990s and early 2000s, network bandwidth and server capacity was a lot lower, so IP multicast was the most efficient way to distribute data to hundreds or thousands of clients.

Now that 10Gb+ is commonplace and modern servers can service hundreds or thousands of message bus consumers with efficient kernel-managed event loops, yes, it's arguably less needed now. But nobody will claim it's efficient from a network perspective. :)

tptacek · on Dec 9, 2016

Depends on what you mean by "from a network perspective". Bandwidth? Sure. Network processing? Not so much: multicast is expensive for routers.

SEJeff · on Dec 9, 2016

Most financial firms (HFTs like I work for at least) purged Tibco with scorn and hellfire long ago due to reasons like the ones you just mentioned.

polskibus · on Dec 9, 2016

What did they replace it with?

SEJeff · on Dec 9, 2016

Non-open source proprietary systems, as one does :/

kasey_junk · on Dec 9, 2016

https://github.com/real-logic/Aeron

Is well regarded & open source

SEJeff · on Dec 10, 2016

Several firms (not the one I currently work for) have implemented their own variants of NASDAQ's "MOLD" reliable udp protocol:

https://www.nasdaqtrader.com/content/technicalsupport/specif...

It is actually quite good.

mrchicity · on Dec 9, 2016

By general service do you mean a system where response time isn't especially important? Using TCP to fan-out messages in a PubSub pattern is going to waste network and compute resources in a serious way. It can easily exhaust capacity or result in convoluted, unreliable systems with layers of proxies for managing load.

FWIW when you talk about different requirements, I'm not sure you're correct. For traders, sure, oftentimes they can just dump old data if they fall behind because it's mostly irrelevant. However, stock exchanges have performance and reliability/availability as top requirements. They must record all transactions and can't crash or lock-up. Most use multicast internally and externally. They would never be able to scale and maintain a low response time with TCP.

sseveran · on Dec 9, 2016

If its the same ecommerce site I am thinking of, then there was also non-trivial message loss.

mritun · on Dec 9, 2016

Yes, and with significant outages too. Again, I'm not saying multicast doesn't work. Multicast works well as long as you can guarantee that your network topology, components (switches and routers), servers and usage pattern will be predictable (and you test everything rigorously before going live). I can imagine that being true at some places with extremely tight focussed requirements.

In a dynamic environment where you'd be scaling continuously and adding components and servers and building new services, this becomes a non-starter. Multicast implementations doesn't fail gracefully -- there's very little warning before things come to a complete halt and need complete shutdown and reboot of entire network.

SEJeff · on Dec 9, 2016

Multicast storms can generally be avoided by IGMP Snooping, which virtually all modern high performance switches will support. Are you saying the above with an understanding of IGMP Snooping, or only in primitive network environments where the hardware / software switches lack it?

When IGMP Snooping is enabled, the switches themselves will limit multicast messages only to ports that have actually joined a group, which prevents multicast storms (if I'm understanding you correctly) in the first place.

mritun · on Dec 9, 2016

It depends on the network. If the network is 100-1000 nodes and you can guarantee that the switches and routers you use to build it (and upgrade in future) implement it correctly and won't break/introduce bugs in future firmware release then yes.

In the kind of environments I worked in, this was a non-starter.

edit: also "limit multicast messages only to ports that have actually joined a group" part is usually implemented by bloom-filters that I talked about above. They work well as long as there are not many groups and not many members.

SEJeff · on Dec 9, 2016

Sure in network topologies that size, you often have tiered access, leaf, and spine switches (and you can even build nonblocking topologies if you do things rihgt). Usually doesn't mean always. Many (most?) higher end switches use custom ASICs for IGMP Snooping, which means they scale fine until the ASIC limit at which they don't handle any more. In my career, I've worked in single networks with a max of maybe 4000 server nodes in 1 broadcast domain or so. Not huge, but I've seen no issues with big multicast uses provided you're using the right network cards / HBA and the right network switches. I've seen people build FPGAs to split market data into 10k+ multicast channels and was asked for help making Linux join so many channels (it is a PITA). It was amusing when we had to call around to find switch vendors which even supported it at all, most of them told us we were crazy (this was maybe 9 years ago). Take the Cisco Nexus switch as an example, it supports 32,000 IGMP groups, certain Aristas support 64,000, etc, and that is just for ethernet switches. Lets not get into Infiniband or some of the exotic Cray/IBM interconnects.

I would be curious in your experiences if you're able to share it, but do very much believe there are options other than software for implementing multicast filtering on switches. What type of environments that you worked in made IGMP Snooping a non-starter?

nickpsecurity · on Dec 9, 2016

Don't forget there's protocols like UDT available, too. HPC field particularly has been doing TCP alternatives either on UDP and custom (eg Active Messages) for a long time to help tackle some of issues you get with TCP.

http://udt.sourceforge.net/

liveoneggs · on Dec 9, 2016

Not only these failures but I've had nothing but trouble with multicast implementations on hardware and OS stacks just refusing to talk to each other. It's a highly unstable feature of the networking world and is not to be relied on.

otoburb · on Dec 9, 2016

Using the correct capitalization makes more sense for the HN title because "NO" in this instance and context means "network ordering", which the article then discusses (even though the expansion is at the end of the article title).

IshKebab · on Dec 9, 2016

It's a pun. Although you are right, they have kind of ruined it by using No instead of NO.

hardwaresofton · on Dec 9, 2016

"NOPaxos" might me a catchy name but is probably not good in the long term as it might make people think it's NOT paxos. Maybe "NeOPaxos" is better?

Also If my understanding is anywhere near correct, it's a mix of Paxos and a bit of a proxying sequencer (which may be implemented in a physical switch) and all replicas & master need to be on the same switch/subnet...

Going to read the paper now -- anyone have a better understanding of the tech they could share? Excited but still a bit skeptical

lepacheco · on Dec 9, 2016

Yep, I also think it's kind of a misleading name. It's still Consensus but with stronger assumptions on the network (kind of, it depends on an external master for correctness, which could itself use something like classic Paxos). Simplifying it a bit, they use a single sequencer that every packet must go through to get "ordered". As the authors suggest, that could be implemented in a programmable switch. Given that abstraction, they can simplify the consensus protocol itself.

qznc · on Dec 9, 2016

"NoPaxos" is not "Not Paxos" like "NoSQL" is "Not SQL" ;)

If you use a NoSQL database, the relations do not disappear. You just have to implement your joins in a higher layer. NoPaxos pushes some parts of Paxos into lower levels, namely the OUM primitive of the network.

hardwaresofton · on Dec 9, 2016

Forgive me if I'm wrong, but aren't they still pretty similar (in that they sound like misnomers?)

Relations ~= Consensus (traditional full-featured Paxos)

To use your sentence

If you use NOPaxos, the consensus doesn't disappear. You just have to perform it whenever your sequencer, or replica-leader becomes unreachable/fails.

The "OUM primitive" addition isn't much more than two monotonically increasing integers (session and order-number) which depend on a central (bottleneck) sequencer... NoSQL moved the responsibility for relation-management up, NoPaxos is moving the responsibility for consensus down/to the worst case... My point was that neither are fundamentally different from the "X" after the "NO"...

altendo · on Dec 9, 2016

I am reminded of the old joke about how there are 2 major problems in computer science: caching, naming things, and off-by-one errors.

mamcx · on Dec 9, 2016

I dislike NoSql because the connotation that relational databases are "bad/old", and this is better. Because this look like "Say NO to Paxos!". NO is a strong word.

SQL as language can have it issues but the relational model is far easier and more flexible than people assume.

meta_AU · on Dec 9, 2016

Use a single node to provide all-or-nothing and ordering, then simplify Paxos based on those assumptions.

hardwaresofton · on Dec 9, 2016

Thanks -- After reading the paper I think I have a better understanding, and it is pretty exciting, this does seem to be a pretty good throughput improvement, especially if things are mostly happy-path.

The paper seemed to be missing an exploration of worst-case scenarios and equivalencies though, but I guess I guess worst case is paxos speed (consistently failing sequencer, or a leading replica that receives one messages then is unreachable, bad quorums, etc)

toast0 · on Dec 9, 2016

I haven't done much with Paxos and friends and I only read the blog, not the paper, but this seems like elect a coordinator, send all changes through the coordinator which creates an ordering and sends out the ordered changes to the group. And by the way, hide the coordinator in your switch.

This seems pretty straightforward -- running MySQL replication in a sane way is essentially appoint a master server, send all writes through that, and hope you don't need to have a new election.

altendo · on Dec 9, 2016

This is characteristic of all distributed systems, so in that sense it isn't a unique concept to NOPaxos. The emphasis here is minimizing the traffic that a typical Paxos implementation incurs while changing the constraints of the distributed system.

siscia · on Dec 9, 2016

In the Second to the last figure (fig 5) you can see a behaviour that is extremely weird in my opinion, the latency is almost in O(1) and then hit a wall and goes up as exponential (I guess).

It this common? What cause this behavior?

mritun · on Dec 9, 2016

That's multicast for you. Once you congestion and/or group size increase beyond a certain value or group size, things go /really/ bad; as in you may need to "shutdown, wait and reboot" your network to recover.

tssva · on Dec 9, 2016

All of the tested protocols including unreplicated flows follow the same pattern with the proposed multicast protocol most closely following the same pattern of behavior as the unreplicated data; therefore some intrinsic property of multicast is not responsible.

qaq · on Dec 9, 2016

https://www.youtube.com/watch?v=gas2v1emubU

toolslive · on Dec 9, 2016

There are other consensus algorithms that start with a monotonous channel (ie a channel that might drop messages, but does not reorder them). Mencius fe does this and also avoids most of the overhead http://www.sysnet.ucsd.edu/sysnet/miscpapers/mencius-osdi.pd...

matthewaveryusa · on Dec 9, 2016

I'm actually really curious about this. If you are geo-diverse and need to go over the internet, will this work?

hardwaresofton · on Dec 9, 2016

It It depends on what your definition of "work" is.

if you're geo-diverse, and you want writes to happen are more than one place (as in the top nodes at each geographic location can all process requests), then you have to go the traditional consensus route. One server in one region makes a write, and the other servers in the region need to acknowledge the write before they move on, lest state become inconsistent. Pretty sure there's no way around it.

if you're geo-diverse, but are OK with possibly stale reads from time to time, and only writing to one geographic area (server), then it seems possible -- you'd just apply these same principles across a bigger scale rather than same-datacenter. You'd probably also lose a lot of the latency gains though.

zb · on Dec 9, 2016

Probably not. It depends on guaranteed in-order packet arrival. In general the Internet very much does not guarantee that, so you would need control over all of the networks involved.

Of course you can always tunnel over TCP (you need to tunnel over something, because you can't route multicast over the Internet either), but that would almost certainly eliminate the performance benefits.

altendo · on Dec 9, 2016

I'm in agreement with the other replies here - this seems more useful within a datacenter (and any unit smaller than that, such as an AWS Availability Zone) and not so much useful over the general internet. Too much unpredictability in how things get routed, what types of routers are hit, etc.

socmag · on Dec 9, 2016

This gets a huge thumbs up from me.

I've worked through a lot of the same reasoning as these people, and as technically correct as PAXOS might be, in the real world, causality at high frequency is a blur at best, and there is absolutely no need to impose the hard constraints that PAXOS and others imply.

The only cases where guaranteed constraint systems are needed are when you don't have traffic and at each time step all parties are capable of asking each other to confirm that from all frames of reference they all agree who and why someone committed first. Which is great, but makes zero sense in the real world.

At scale transactions are and should be committed on a best effort basis.

You don't look at your bank account and wonder why the T-Mobile payment went out slightly after your ATM withdrawal.

Why? Because... shrug. God said so. He works in mysterious ways.

This is awesome. Sign me up.

pron · on Dec 9, 2016

You have misunderstood (as the title is a pun). It is "Say NO to Paxos Overhead", and is about a new Paxos invariant called NOPaxos (Network Ordering Paxos), that relies on code running in network switches to provide ordering. The same consensus guarantees apply -- in fact, it is still Paxos -- but the latency is significantly improved (and is within 4% of no replication at all). So it's Paxos with (almost) no overhead thanks to NO.

You are also very mistaken about there being "no need" for consistency guarantees. Even systems that don't provide consistency for every data update, do enjoy consistency for some more rare events (like cluster membership). Without strong consistency at all, the range of distributed programs we can build is much reduced. The need for a greater range of distributed programs arises not from geographically separated applications (like ATMs), but from the mere fact that distributed computing is currently our only way to scale computational power (within that data center). If we cannot have consistent distributed computations, our computational power and the range of applications we can build in general would be greatly reduced.

socmag · on Dec 10, 2016

I was definitely NOt saying there is no need for consistency guarantees at all. I was saying that this is a much better way to do it.

lepacheco · on Dec 9, 2016

That's not what the paper is about. They still provide Consensus (i.e., total ordering).

MrBuddyCasino · on Dec 9, 2016

Maybe I didn't get it, but ordering is still guaranteed, it has just moved to the network layer and is kind of "hardware accelerared" by a smart switch or SDN node, or am I wrong?

hardwaresofton · on Dec 9, 2016

Ordering is definitely guaranteed -- it's the only thing that is, by this scheme (delivery is not, latency is unbounded).

They provide 3 approaches in the paper: SDN node, programmed hardware, and software-only. Software-only approach lost some of the latency-gains, as might be expected.

MrBuddyCasino · on Dec 9, 2016

Thanks, thats how I understood it.

The tradeoffs are quite seductive, in my experience most companies do not operate their software on a scale where they really have to scale out, but they run multiple instances simply to increase availability.

But then those are probably the same kind of companies that will probably not have the required network hardware, and won't for many years to come.

I'm thinking about stuff like Redis or distributed caches like Hazelcast, maybe even Zookeeper and friends. Any ideas where this could shine?

Cloud vendors could manage the hardware part, but is this model attractive to them, essentially being constrained to single-node performance?

hardwaresofton · on Dec 9, 2016

I thought of mostly the same things (redis/some other database), maybe also IoT? This, in combination with the ultra low power wifi post that recently went up (https://news.ycombinator.com/item?id=13135607) seem like they would go together.

I think cloud vendors would definitely charge a pretty high premium for this, since it requires more than zero work on their part.

Also, aren't most simply-distributed databases constrained to single node performance (for writes, assuming one master)? I feel like you could work in fast reads to this scheme too, if you required replicas to have a complete log in order to respond to a read, since it's all or nothing. The possibility of a slightly (1 op) stale read is possible though.

jon-wood · on Dec 9, 2016

> You don't look at your bank account and wonder why the T-Mobile payment went out slightly after your ATM withdrawal.

Your bank account isn't being used for even nearly the same things distributed data stores like this are being used for. And even then they have to make workarounds to ordering issues, otherwise if there were a day when my salary comes in and my rent went out I'd be hit with overdraft charges if they happened in the wrong order.

ansible · on Dec 9, 2016

For a while, banks were maliciously ordering such transactions, to create such scenarios.

That's why I'm real cautious about debit cards.

function_seven · on Dec 9, 2016

Yup. I don’t think they ever processed debits before credits in any given day. (The standard AFAIK has always been to process all credits first, then do the debits. This might be an antiquated understanding, though).

But what they did do was to process the largest debit first, then the next largest, and so on. You have $100 in the account, and completed four transactions at $4, $8, $15, and $90? In that order? Too bad, 90 goes first, followed by three overdrafts.

tinco · on Dec 9, 2016

You're cautious about debit cards? The only alternative being credit cards, where each transaction by definition is an overdraft? Maybe I'm getting this wrong, I'm from Europe where we don't really use credit cards.

What fee are you talking about? If you use a debit card, and you don't have the money it will simply refuse the transaction right?

klodolph · on Dec 9, 2016

I think you are using the wrong definition of "overdraft". On a credit line, an "overdraft" is when you've already exhausted the line of credit. This is often possible, and there are often not really any fees or penalties associated with it. They are just giving you more credit than they originally promised to give you.

American perspective here—I feel safer using a credit card because it's not my money. If someone fraudulently bills my credit card, it's the bank that they're stealing from, not me. There are legal protections in place which limit my personal liability.

With a debit card, if someone fraudulently uses it, the money is gone and you have to fight to get it back. There are legal protections which limit my liability here too, but when I invoke those protections I'm starting from an inferior position (the money is already missing from my account), and those protections are a little better for credit cards.

Our accounts are also often reconciled with ACH, which is not as fast as it could be. So it's not always true that the transaction will be refused if you don't have the money. It's often true, but credit cards are safer because there's not really a penalty for going over your line of credit.

ansible · on Dec 9, 2016

Yes, all this is also my perspective as a USA citizen.

At the very least, I'd recommend that you have a separate debit card account that is used for store purchases and such. This gets filled (manually!) from another account which you also use only for really important stuff like utility payments, housing, car payments, etc.

computerphage · on Dec 9, 2016

No, they will charge you a fee called an overdraft fee for the "service" of allowing you to purchase the item. They basically extend you a small line of credit (there's some upper limit before they will deny the purchase) and then charge you, say, $35.

Credit cards are always overdrawn in the sense that the bank has extended you a line of credit, but they don't have a fee associated with typical use.

jon-wood · on Dec 9, 2016

At least my UK bank will immediately allow my bank account to become overdrawn while also sending me a text message letting me know that. So long as the account is in credit by 11pm that day no charges or interest are due.