Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A State of Xen – Chaos Monkey and Cassandra (netflix.com)
151 points by tweakz on Oct 3, 2014 | hide | past | favorite | 53 comments


Is there a blog where they post about what happened during actual downtime ? Like the one on last September 21th ?


By trading away C (Consistency), we’ve made a conscious decision to design our applications with eventual consistency in mind

How does one go about writing a user-facing application with eventual consistency?


It depends on the application. For Netflix, transient glitches like having outdated play history or catalog entries that take a while to propagate aren't hugely detrimental -- the core value of serving the videos that users request doesn't depend at all on having a consistent view of quickly-changing user data.

For more complex applications (think Facebook), there are useful consistency models other than strong consistency, with causal consistency being one of the most promising: http://queue.acm.org/detail.cfm?id=2610533


Do you know whether the linked article is what Facebook is currently using, or a proposal? Facebook is an example of a webapp that fairly often has user-visible weird behavior, like "read" notifications becoming unread again. But I'm not sure if those are glitches, or a result of their consistency model.


Facebook wrote Cassandra for such scenarios.

https://m.facebook.com/note.php?note_id=24413138919


If they are often watching full episodes or movies, I wouldn't call it very quick. Though progress state would be changing near constantly.


Embrace idempotence. Being able to perform am operation multiple times with the same result adds a lot of flexibility.

For example, using the rails TODO list, if you add a task to "buy milk" and then on another client view the list and don't see "buy milk" there (because consistency hasn't been reached), you might be inclined to think you forgot to save it or something. Then you might enter it again. If this is not idempotent (as is the case on the classic tutorial) you will end up with 2 "buy milk" tasks instead of just one.


This is actually a great advice - but not all operations can be made idempotent. Adding and substracting money from bank accounts comes to mind (although you can make the operation "write transaction to some log" idempotent, you can't perform the actual computations in an idempotent fashion).


You handle this by modeling the account as a ledger with immutable entries containing unique identifiers rather than a single mutable balance. The ledger entries are facts. The balance is a derived computation. Only store facts, not derived computations.

(At some point for efficiency reasons you might need to prune the transactions, but this often involves trade-offs like not allowing any new "lost" transactions older than date X.)


and how would you make that idempotent? ;)


The key is to consistently derive a primary key. With the same key, last write wins. The second create turns into an update.

Making the primary key "buy-milk" would do the job. Getting a consistent key requires a bit of UX but isn't actually that difficult.


These days you can often get away with just using a cryptographic hash function. Blake2b is really fast. Makes life unbelievably simple.

I don't think it would be very useful in a TODO list situation though.


What if I want to enter buy-milk twice ;)


Different applications could use different schemas, you just keep in mind the eventual consistency. i.e. you can use event sourcing or just treating everything as a log that gets read and merged (think of it as a CRDT) should be able to handle those kind of scenarios. That use case matches Cassandra quite well with its wide rows.


Idempotence can handle that possibility by embedding the ordinal of the purchase you wish to make:

  buy-first-milk
  buy-second-milk


But now the user is burdened with having to be explicit.

This is exactly the type of case where I'd much rather have the app be explicit about it's actions: allow both to be added, and at most show a notice after synchronisation saying "I've noticed you've added two buy-milk; merge them or leave them separate?".

I've had nothing but bad experiences with apps that thinks they know best and try to merge records behind my back.


The idea behind idempotence is that you can apply multiple times, in any order, and still come up with the same result. http://en.wikipedia.org/wiki/Idempotence

If you say, "Buy_first_Milk" and "Buy_Second_Milk" - and the order is reversed, and the "Buy_Second_Milk" is applied first, and the "Buy_first_milk" is applied twice then everything happens as desired.

Being explicit is what allows everything to work without confusion.


You don't know what the users intent is. Maybe they meant to enter "buy milk" twice, maybe they didn't.


That's the point I'm trying to get across - you know the user's intent if they are explicit. If the user enters "Buy Milk" - you don't know their intent. But, if they enter "Buy First Milk" and then "Buy Second Milk" - you know precisely what they want.

From a user interface perspective, if the user ever enters "Buy Milk" - and there is no existing milk, that is always, "Buy First Milk". If they go to a different client, and the milk order is there, and they enter "Buy Milk" - that then becomes "Buy Second Milk". But, if "Buy Milk" hasn't synced yet, they can explicitly flag it as "Second Milk Order".


Right, you are shifting the burden of merging conflicts to the user. That may be the best approach, but you are giving up on achieving consistency purely through tech.


No, the app can manage this in most cases - if you enter buy milk twice on the same browser session, it will know how to construct the keys properly. If you buy milk once in two different sessions that are not yet consistent, in both cases you would see only one in the list and so adding another in either or both sessions will yield correct results.


But you are guessing what the users intent is, but you could be wrong.


Netflix has a couple talks about this, but it mostly comes down to for their specific use case, their users don't care about consistency.

For example, for your viewing history you might not entirely care that it is 100% up to date and the last video you watched isn't available. And for the times you do care, most users will simply usually refresh the page (and since, in almost all cases eventual consistency means resolution in seconds rather than instantly, most users will have the correct info after a page refresh).

I can't find them right now, but Christos Kalantzis (@chriskalan) has a number of talks on using Cassandra at Netflix w.r.t eventual consistency.


Here are the slides from Netflix architect Christos Kalantzis 's talk, "Eventual consistency != hopeful consistency:" http://www.slideshare.net/planetcassandra/c-summit-2013-even...

Video: https://www.youtube.com/watch?v=lwIA8tsDXXE

But, it's important to note that Cassandra does recognize that sometimes you do need linearizable consistency. Cassandra provides lightweight transactions for this scenario: http://www.datastax.com/dev/blog/lightweight-transactions-in...


I assume: serving content that doesn't change often (videos) and mostly needs to be available. Recording user scores and aggregating their effect later (last write win means you could update the score a second time and the first time only would be recorded - which is annoying but not end-of-the-world).

Likely if they have things that require stronger consistency they use another DB.


> rails new todolist

...seriously, though. All of our views are out of date by the time we see them, and all of the user inputs have to be dealt with for races and double-submits. Eventual consistency is just the patterns you already know, but bigger.


Eventual consistency is much more complicated than that. Patterns for dealing with nontrivial logical operations (i.e. anything that is bigger than a single 'object') are much easier on consistent DBs than eventually consistent ones. On typical RDBMSs you maintain a per-object version number and foreign keys (possibly also a per-object checksum if you're using more relaxed isolation modes). That's generally enough to prevent races, double submits, and other concurrency artifacts over multi-object operations, and it's pretty easy to implement.

The same is not true for eventually consistent systems - checking against a version number can't save you unless all your updates are trivial, single-object ones: some of your updates may succeed while others fail - and there's generally no way to perform an all or nothing operation. In general, there's no one good mechanism for maintaining eventual consistency on nontrivial operations - you have to think (hard) about it on a case by case basis.


If you really need consistency, you can use an external locking mechanism (Zookeeper comes to mind).


Are any of these anti-chaos tools open source or shared with the community? Would love to see more companies that I'm dependent on have this sort of testing and robustness...


Chaos Monkey (and related software) is open source: https://github.com/Netflix/SimianArmy/wiki


2700+ Cassandra nodes! Anyone know how big Facebook's Cassandra is?


AFAIK, Facebook no longer uses Cassandra, but apparently Apple has 75,000+ Cassandra nodes.


hm.

from: http://xenbits.xen.org/xsa/advisory-108.html

MITIGATION ==========

Running only PV guests will avoid this vulnerability.

Did amazon reboot all of it's VMs? or just the HVM VMs? why was neflix running on HVM VMs?


HVM is more performant in PV in most situations with modern virtualization technology.

PV drivers on HVM have been a thing for a while now, so your IO and network go through PV even with an HVM instance. SRIOV/"Enhanced Networking" is even better than PV networking drivers, so HVM has another win there if you have NICs that support it (The larger Amazon instances do)

PV is significantly slower with system calls than HVM on 64bit hardware, as the x86_64 spec removed two of the four CPU protection rings, one of which being where PV lived in 32bit architectures, below the guest, allowing it to 'intercept' these system calls. Now since it shares a ring with the guest, it cannot intercept these and you are left with doubling the amount of context switches that you were previously. Virtualization extensions from Intel and AMD allow HVM to bypass this and skip the context switch. Other hardware advantages like EPT also give HVM an edge when it comes to memory related performance.

About the only area where PV is still better from a performance standpoint is interrupts/timers


Amazon rebooted lots of PV guests. Presumably they collocate HVM and PV guests on the same box. If there were any HVM guests on the box, then there could be the possibility of an attack. (I guess they could forcibly kick off the HVM guests, but that wouldn't be very nice.)

Why shouldn't Netflix be running on HVM?


>Why shouldn't Netflix be running on HVM?

HVM, at least in the past, had a bunch more code that the guest DomU interacts with vs. fully pv guests. This has security implications.

Now, my knowledge of HVM is a few years... or more like half a decade out of date, for example, I don't even know how to force a HVM guest to only use PV drivers (which would solve 90% of the problem.) and i know that more and more of this has moved into hardware, so it's possible that what was true five years ago is not true now, but... yeah, I don't let untrusted users on HVM guests for the same reason I don't let untrusted users use pygrub or load untrusted kernels directly.


Most people agree HVM is the way to go on EC2:

http://www.brendangregg.com/blog/2014-05-07/what-color-is-yo...

Forcing an HVM guest to use only PV drivers sounds like PVH, which is coming in Xen 4.4:

https://blog.xenproject.org/2014/01/31/linux-3-14-and-pvh/


HVM guests at Amazon will default to using PV drivers for IO and networking. (Unless using SRIOV/"Enhanced Networking", which will not use the PV drivers)

PVH is actually PV on top of an HVM container and is a bit different. You can think of it as PV sitting on top of enough HVM bits to take advantage of the hardware extensions Intel and AMD have invested so heavily in while still being majority PV. This gives you the best of both worlds, including the remaining PV performance benefits related to interrupts and timers that PV drivers on HVM can't utilize.


hm. thanks. Yeah. I need to spend more time on this as I update my stuff.


For network and storage at least, Amazon says that their HVM guests use PV drivers according to this page:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-...

but on this page, they don't say that specifically, but mention the possibility in the PV on HVM section:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/virtualiz...

Perhaps/presumably it also depends on the AMI you're using.


All of the current generation of AWS instances (m3 type) are HVM.


Correct me if I'm wrong, but I don't think this is the case. In fact, I'm running PV on m3.medium instances (EDIT: and c3.large) right now. It depends on the AMI you use to launch the instance, though some instance types only support HVM (e.g. t2.* or r3.* ) or PV (old-gen t1.* or m1.* ), but some, like m3.* and c3.* which are current-gen, support both. See the Amazon Linux AMI availability matrix for example: http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/


Correct, to have the instance itself running under HVM virtualisation, the AMI needs to have been registered with --virtualization-type hvm.


My PV guests were not rebooted, but some others' were.


I'd love to see Netflix let the chaos monkey loose on their PostgreSQL servers.


I wonder if their chaos system causes network partitions as well as node failures.


The total toolkit they have is called the Simian Army -- https://github.com/Netflix/SimianArmy

They wrote a blog post about it here: http://techblog.netflix.com/2011/07/netflix-simian-army.html

I don't think they have one that introduces network partitions, but inside a datacenter, network partitions are rare.


Inside a single data center they're rare, but when running on AWS in multiple availability zones and regions they're a fact of life which you should be designing your systems to deal with. Given that it would be great to have automated tooling to simulate them and keep everyone on their toes.


Ever lost a switch? Rare, but happens on a regular enough case to prepare against it. Even if you use bonding/teaming, with dual power supply network kit + servers, it can still happen.


> ...but inside a datacenter, network partitions are rare

It's ... complicated.

http://aphyr.com/posts/288-the-network-is-reliable


Yeah, there was an excellent ACM article with him, and Bailis. I think if the network starts to partition, or fail in a datacenter, that's time to evacuate the datacenter / AZ. If a handful of machines fail, they should disengage. If more than say, 5% of the machines in the DC are having reachability issues at any given point (in a modern DC that's like ~2000 machines), it's time to shut it down.


I've seen them happen with the systems I work on a few times over the past five years. It isn't something you should ignore.


Not AFAIK - last time I looked it requires one or more AWS autoscaling groups, and works by semi-randomly terminating instances that match configured criteria.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: