Architecting for Uptime

dpdp_ · on March 7, 2012

There are so a lot more patterns of high availability architectures other than load balancing.

Distributed Queues, Pub/Sub, Gossip just a few that come to mind.

In your example, you are using what is called a classical three-tier web architecture - a Load Balancer + Stateless Nodes + Scalable Storage. The most interesting part of HA setup in a three-tier web architecture is HA setup of the persistent storage component. It looks like you actually haven't figured that out that piece yet and are waiting for a vendor (Microsoft) to solve this for you.

You can improve availability of your persistent storage (MSSQL) in several simple (or not so simple ways):

1. Use a SQL proxy load balancer (or a cluster setup) - a similar load balancer HA pattern you are already using

2. Shard. You will scale writes and significantly reduce the probably of your system becoming completely unavailable.

ww520 · on March 7, 2012

Disk device level replication like DRBD, plus a failure control framework like Linux Heartbeat, goes a long way in providing HA cluster for database. Since the replication and failover are at the device level, the solution works with any disk-based system, including databases.

In my experience, failure can be detected in seconds and switched over. Adding a reverse ARP setup to share a virtual IP for the clustering servers, the clients won't even have to talk to a different IP in the case of failover. The only case the clients need to handle is to retry upon failed connection, which they should have handled to deal with the occasional network failure.

sausagefeet · on March 7, 2012

I found this to be an ironic read. "How to architect for uptime from someone who hasn't quite figured it out themselves"

trustfundbaby · on March 7, 2012

So should they not have posted at all or should they have used a different title?

sausagefeet · on March 7, 2012

If your article on how to architect for uptime ends with how the authors system isn't architected for uptime I don't think they should have written it. How do we know their advice applies once they actually institute the changes they talk about?

petergrace · on March 7, 2012

This is a really good point and I appreciate your viewpoint, but just because I wrote the article for/about stack exchange it doesn't preclude me from having done it differently at a prior employer.

sausagefeet · on March 7, 2012

Talk about that then? It make it clear you're not just talking out your tush some way?

gaius · on March 7, 2012

It looks like you actually haven't figured that out that piece yet and are waiting for a vendor (Microsoft) to solve this for you.

Grown-up vendors have had this solved for decades, literally. E.g. http://en.wikipedia.org/wiki/IBM_Parallel_Sysplex

gizzlon · on March 7, 2012

A handy summary of different solutions:

PostgreSQL: Comparison of Different Solutions http://www.postgresql.org/docs/current/static/different-repl...

as well as the rest of chapter 25..

petergrace · on March 7, 2012

Excellent points! You're quite right that there are a lot more things we could do to improve the high availability in the Stack Exchange environment. Unfortunately, I was hired well after the environment was designed so any suggested changes would need to not only pass the rigors of review by the team but also be able to handle the load that stackoverflow.com generates. Changing pieces at this point would be nontrivial and difficult to get approval on if it meant that we'd affect performance of our main property.

That being said, I'm 100% in agreement with you that the two options you supplied would be wonderful additions to our environment.

edwinnathaniel · on March 7, 2012

Whether you're developing for high-trafficked web-apps or not, I believe the practice of Continuous Integration has become a common bare-minimum these days.

The tricky part when it comes to CI often occurs in the area of test automation, archiving the builds, building for different environment (UAT, STAGE, DEV, PROD), and to deploy it immediately after build (definitely hard if it involves database and migration).

Some CIs handle archiving the builds (Atlassian Bamboo comes to mind).

Some build system handles multiple builds for different configurations fairly straightforward and easy (Maven Profiles, Rails to some extend)

Automated-deployment is still a bit tricky to setup.

Certain category of test-automation is hard to do in CI (typically around JS, UI code, and functional testing). This particular area requires high-level of discipline from the development team. This is one of the reasons why some teams/companies prefer to use GWT instead of pure JS due to the known best practices and the work-around for "headless" front-end testing (or at least that was the case in the past, maybe the JS testing landscape has changed ever since).

j-kidd · on March 7, 2012

> The main thing one needs to think about when coding an app in a load balanced environment is that there’s no guarantee that request ‘n+1′ is going to land on the same server as request ‘n’, so you need to handle sessions in a centralized/db manner so that the cookies in the browser link you up to the proper session regardless of what server you hit.

Instead of doing it in a centralized/db manner, Beaker allows sessions to be distributed client side:

http://techspot.zzzeek.org/2011/10/01/thoughts-on-beaker/

It was one of the best things I learned last year.

radagaisus · on March 7, 2012

Could we have a tl;dr version please?

mrgreenfur · on March 7, 2012

I just got their "we are overloaded" page, I shit you not. Is this part of a joke I'm lucky enough to be in on? :)