Netflix open sources resilience engineering library

splicer · on Dec 2, 2012

I used roughly the same philosophy recently at work in dealing with a library that could block for intolerable periods of time (sometimes indefinitely). Basically, I used a pthread_mutex_trylock and a pthread_cond_signal to pass data to a worker thread that interacted with the library. If the worker thread was still holding the mutex when main thread wanted to send data to the worker thread, I simply say "f*ck it" and generate an error, rather than queueing up requests or locking up the system. The reason why it was okay to do this is that the library calls were to gather non-critical metrics, and calls into the library were fairly infrequent.

(I just drank an entire bottle of wine, so please excuse any apparent lack of reasoning, typing ability, or general coherence in the above comment.)

ryandotsmith · on Dec 1, 2012

This is a great set of reading materials. It is great to find out how teams are building in resiliency at the micro level. That being said, I would never actually use this project for building things. IMHO: Libraries > Frameworks.

benjchristensen · on Dec 2, 2012

Glad to hear the documentation is thought-provoking - it was intended to not just explain the project but also communicate our learnings in operating at scale and how the Hystrix library is used by Netflix.

I also agree that libraries are preferable to frameworks and Hystrix is in fact just a java library that can be used as little or as much as one wishes.

It purposefully tries to have a small number of dependencies so should be easy to pull in without significant impact.

@benjchristensen

elq · on Dec 2, 2012

I've had the "pleasure" of using the predecessor of this library at work.

The code written by someone who's just completed reading the GoF book feels remarkably similar...

jondot · on Dec 2, 2012

From the 10 minutes I glanced at the code I kinda got lost in the abstractions and that's where I decided it's worth a deep look later, but I wouldn't say it's over-engineered.

Can you elaborate a bit?

elq · on Dec 2, 2012

my problem with this library (well, actually the internal version, I haven't looked much at the code in github) is that it was written from the perspective of an edge service - the netflix API. The needs of that system are quite different from the needs of middle-tier systems.

The API has a hell of a lot more surface area but is trivial in complexity compared other systems at netflix, and therefor this library has some huge gaps in design.

The two biggest issues IMHO are putting the throttling/fallback handling at the outermost edge of an external service rather than at the lowest level (i.e. the actual rpc) and a very C/errno like method of handling errors.

I'm also quite unhappy with the API. It require creating boilerplate classes to implement the commands. Yuck. A bit of magic with annotations or code gen would've been much cleaner and much less prone to errors caused by programmer fatigue or boredom.

quotemstr · on Dec 3, 2012

Thanks - you expressed the uneasy feeling I had more eloquently than I could have. Between the library's architecture, the (IMHO) architectural inversion, and the (let's be honest) stilted writing style in the documentation, this project gives me an "I'm just out of college" feeling, which in turn usually makes me run like hell.

jondot · on Dec 2, 2012

Thanks

quotemstr · on Dec 1, 2012

So it's a wrapper that aborts an entire request if an API call made by the code processing that request times out? What am I missing?

Fluxx · on Dec 1, 2012

It's not revolutionary, but it does a lot of things right and has more features that what you mentioned above:

* maintains a semaphore of currently processing requests, and will immediately fail a request if the semaphore is full.

* tracks service latency and other statistics for you

* maintains a circuit breaker to immediately fail requests if the breaker is open (based on statistics)

* watches for health to be restored of the 3rd party service and allows requests to it again.

* built in request isolation via threadpools

* built in request collapsing and caching

Overall I like this cause it's a great demonstration of how to think about, and engineer for, failure in a distributed system.

dkarl · on Dec 2, 2012

Automatic fail fast based on monitoring sounds like an awesome alternative to having latency cascade through the system and then scrambling to "fix" the problem by reducing client timeouts.

Retric · on Dec 2, 2012

I agree, though my first thought is many things work better if you add a little randomness to the equation. Something like abort X% of the time where X is based on latency over an acceptable threshold which should find a happy medium where a resource is at max capacity, but not overloaded.

Of course this is all assuming you have some sort of useful redundancy.

benjchristensen · on Dec 2, 2012

That's an interesting thought and probably something that would make a good addition to the library.

It could be part of the default library or perhaps a custom strategy for the circuit breaker (once I finish abstracting it so it can be customized via a plugin: https://github.com/Netflix/Hystrix/issues/9).

At the scale Netflix clusters operate they basically get this randomness already because circuits open/close independently on each server (no cluster state or decision making).

In this screenshot https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... you'll see how in a cluster of 234 servers that about 1/3 of them are tripped and the rest are still letting traffic through.

Thus the cluster naturally levels out to how much traffic can be hitting the degraded backend as circuits flip open/closed in a rolling manner across the instances.

Also, doing this makes sense even when a dependency doesn't have a useful redundancy and must fail fast and return an error.

It is far better to fail fast and let the end client (such as a browser, iPad, PS3, XBox etc) retry and hopefully get to the 2/3s that are still able to respond rather than let the system queue (or going into server-side retry loops and DDOS the backend) and fail and not let anything through.

We prefer obviously to have valid fallbacks but many don't and in those cases that is what we do - fail fast (timeout, reject, short-circuit) on instances where it can't serve the request and let the clients retry which in a large cluster of hundreds of instances almost always get a different route through the instances of the API and backend dependencies.

@benjchristensen

mh- · on Dec 2, 2012

thanks for posting, those insights/experiences with this architecture helped me understand some of the design decisions.

also, huge thanks to you and your team (and your employer!) for releasing an amazing volume of production-quality open source projects this year.

tshtf · on Dec 1, 2012

taligent seems to be hellbanned. Here's his response:

https://github.com/Netflix/Hystrix/wiki/How-it-Works It's more than just aborting a request. It also handles caching with different fallback modes, collapsing multiple requests, monitoring and most importantly it provides a consistent model for handling service integration. Probably not the most innovative thing in the world but definitely useful if you do have a large SOA system.

gizmo686 · on Dec 2, 2012

Weird, his post is showing up as not dead for me. Looking at his comment history, all of the comments since three days ago are dead except this one (which is the most recent at the moment).

JOnAgain · on Dec 1, 2012

I was reading it trying to tease out it's core purpose as well. I reached a similar conclusion. The only piece that seems incremental is the way it remembers past performance and will fail-fast if the service is down or failing rather than keep sending every request to the downstream service. Like client-side back-off logic or throttling.

quotemstr · on Dec 1, 2012

Right. This behavior, and a few other things, is nice, but I'd be hesitant to say it belongs in its own library. If this kind of baroque architecture (and the stilted writing demonstrated by the documentation) is de rigueur at Netflix, I don't think I'd ever want to work there.

buddycasino · on Dec 2, 2012

Care to elaborate? I think Amazon does it similarly, so what's so baroque about it?

quotemstr · on Dec 3, 2012

I agree that the behavior Hysterix provides is a Good Thing, but I'm not convinced that packaging up this behavior into a separate library and open source project is architecturally elegant. Instead, the base-level web services client library should provide this functionality. I'm wary of adding too many modules and too many dependencies to a system.

benjchristensen · on Dec 2, 2012

At a basic level yes, it is to allow "giving up and moving on" and that is the point. There is more to it than that but basically it is to allow failing fast and if possible degrade gracefully via fallbacks instead of queueing requests on a latent backend dependency and saturating resources of all machines in a fleet which then causes everything to fail instead of just functionality related to the single dependency.

The concepts behind Hystrix are well known and written and spoken about by folks far better at communicating these topics than I such as Michael Nygard (http://pragprog.com/book/mnee/release-it) and John Allspaw (http://www.infoq.com/presentations/Anomaly-Detection-Fault-T...). Hystrix is a Java implementation of several different concepts and patterns in a manner that has worked well and been battle-tested at the scale Netflix operates.

Netflix has a large service oriented architecture and applications can communicate with dozens of different services (40+ isolation groups for dependent systems and 100+ unique commands are used by the Netflix API). Each incoming API call will on average touch 6-7 backend services.

Having a standardized implementation of fault and latency tolerance functionality that can be configured, monitored and relied upon for all dependencies has proven very valuable instead of each team and system reinventing the wheel and having different approaches to configuration, monitoring, alerting, etc - which is how we were before.

As for incremental backoff - I agree that this would be an interesting thing to pursue (which is why a plan is in place to allow different strategies to be applied for circuit breaker logic => https://github.com/Netflix/Hystrix/issues/9) but thus far the simple strategy taken by tripping a circuit completely for a short period of time has worked well.

This may be an artifact of the size of Netflix clusters and a smaller installation could possibly benefit more from incremental backoff - or perhaps this is functionality that could be a great win for us as well that we just haven't spent enough time on.

The following link shows a screen capture from the dashboard monitoring a particular backend service that was latent:

https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... (from this page https://github.com/Netflix/Hystrix/wiki/Operations)

Note how it shows 76 circuits open (tripped) and 158 closed across the cluster of 234 servers.

When we are running clusters of 200-1200 instances the "incremental backoff" naturally occurs as circuits are tripping and closing independently on different servers across the fleet. In essence the incremental backoff is done at a cluster level rather than within a single instance and naturally reduces the throughput to what the failing dependency can handle.

Feel free to send questions or requests at https://github.com/Netflix/Hystrix/issues or to me @benjchristensen on Twitter.

kami8845 · on Dec 1, 2012

taligent your comment is [dead]

Karunamon · on Dec 2, 2012

This has to be the most head-scratching hellban I've ever seen.

zheng · on Dec 2, 2012

I've been seeing quite a few of those recently. I could only speculate why, but the trend is concerning.

jorts · on Dec 2, 2012

What did I miss? taligent was banned from HN for his comment?

Karunamon · on Dec 2, 2012

No, this one was dead on arrival, I'm just curious why. Lots of karma, nice posting history.. Nothing even sort of objectionable in his history, just a line of comments and then suddenly they're all dead.

Karunamon · on Dec 2, 2012

..and now undead. The unwritten rules of this place confuse and concern me.

taligent · on Dec 1, 2012

https://github.com/Netflix/Hystrix/wiki/How-it-Works

It's more than just aborting a request. It also handles caching with different fallback modes, collapsing multiple requests, monitoring and most importantly it provides a consistent model for handling service integration.

Probably not the most innovative thing in the world but definitely useful if you do have a large SOA system.

dschiptsov · on Dec 2, 2012

Reinventing Erlang in Java for investor's money?)

People who don't know CS are doomed to poorly reinvent Lisp or Erlang again and again..))

But why not, if someone pays for it..

And the whole idea of using Java for serving media content, while there is a specialized, well-engineered solution, created especially for this purpose in the telecom world, is such a brilliant management decision.. In Java we trust.)

acdha · on Dec 2, 2012

What does predictable language advocacy flaming serve? Anyone who knows why Erlang or Lisp failed to gain widespread usage isn't going to be swayed by such an enormous assumption. Your time would be much better spent learning why few engineers use your language of choice and making Erlang/Lisp more competitive.

jlouis · on Dec 2, 2012

With due respect to the library, there are certain parts of it which is not supported out of the box in Erlang. On the other hand, building tools like what the library has is not going to take a long time.

davidw · on Dec 2, 2012

What in particular jumps out at you? I don't know Erlang or this library well enough to pick it out right away, but I think it'd be an interesting thing to look at. Erlang is often cited, but probably not as widely used, so sometimes gets some 'magic properties' associated with it.

justincormack · on Dec 2, 2012

Erlang and lisp are not CS. They are engineering.

dschiptsov · on Dec 2, 2012

..which is impossible to do without a CS background.) It is a mutual-recursive process of evaluation (CS) and application (engineering).