Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Firebase outages and misleading status reporting (medium.com/scosta)
143 points by sauldcosta on Oct 13, 2018 | hide | past | favorite | 49 comments


AppEngine had the same problems - seemingly every week some component of the service would be down for some non-negligible amount of time (laughably it was often search -- we're talking about Google here).

I've generally found AWS more reliable than GCP - even when GCP isn't having downtime, you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

AWS has had multiple instances of cascading EBS backplane failures, but outside of that I've found their core services pretty reliable -- 400+ days of uptime on a lot of VMs in systems I've worked on -- I avoid EBS when I can.

My advice is to keep your stuff simple - PaaS might seem attractive, but you have so little control as you mention when something goes down. Embrace multi-cloud by using the lowest common denominator of tech available - virtual machines, dns, networking, and instance storage if that suits your needs. Treat vms as disposable - and make sure you have system, service, and data redundancy at that level to survive the failure of an entire availability zone across your application.


AppEngine had some big failures early on, but I (and some friends) built a $$$$$$$$$$$$ company on AppEngine (and GCP) and couldn't have done it without it. The stability the last few years has been extremely good. Our base logic was that we trust Google to hire and train talented DevOps more than we can do it and it sure sucks carrying a pager.


Snapchat?

If your app is that big, someone is always carrying a pager for when there are problems. The difference is on PaaS, you can't do a damn thing about it if it's a problem with the platform.

I've helped multiple companies get off of app engine because even for companies losing money (startups), it's too unreliable -- and actually very slow (datastore) if your app is relational. Also, it's very very expensive if you hit the datastore hard.


Not quite, but first MVP in 3 months and $80m gross revenue in the first year. Selling t-shirts. We did it with 3 engineers, no devops or qa teams and definitely no pagers. We had zero downtime and the very rare bugs were fixed on the next push to master (CI/CD) and real testing.

I'm not saying the datastore is perfect, but using the datastore has well known and predictable limitations that need to be engineered for. It is definitely not something you can RTFM later on. Just like any database to be honest. It is not a relational database. It doesn't do aggregations. It is for storing data (using Objectify [0]) and memcache is for caching that data.

[0] https://github.com/objectify/objectify


>you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

No matter which cloud platform you're using you should do this[1]. I'm not familiar with the GCP SDK but I know the AWS SDK has it built in[2]. If you're not using the SDK then you have to build it yourself. There will always be a small percentage of transient errors due to the network, DNS, timeouts, hardware failure, etc.

[1] This is a blanket generalization, there are some situations where you shouldn't use the backoff/retry pattern even for retryable errors.

[2] https://docs.aws.amazon.com/general/latest/gr/api-retries.ht...


Pretty amazing how virtualization has come to the point that we even need to virtualize our reliance on cloud vendors across multiple vendors to ensure realiability


I'll say, I've only really done multi-cloud for cases where I liked different products on different clouds -- the main app and data on AWS, but using Google's data stack (BigQuery >> Redshift imho).

In terms of reliability, I think the first step is multi-region -- being able to failover to another region should your primary region have major failage. But assuming you can do that, doing multi-cloud for the same thing shouldn't be so hard provided you have some sort of common open source runtime to run on both platforms.


Appreciate the thoughtful reply and summary of experiences with GCP vs AWS.


Retrying localizable errors (such as 503 Service Unavailable) should be universally practiced in any RPC scheme. Nobody can make a backend that's 100% reliable.


AWS suffers from the same status page gaslighting though.


We had the same exact problem with Firebase Realtime Database. Our product uses it heavily and is dependent on its latency so we notice anytime an issue appear.

The unacceptable thing is : not only outages are fairly common, many smaller, briefer outages and disruptions are not even reported. For example the day after the 2 hour outage mentioned in the article, there was an issue where while writing to the database seemingly successful, but the clients listening to the changes would NOT receive the notification that the data their are observing was updated, for more than 30 minutes. It wasn't reported in Firebase's status dashboard.

Google bought Firebase back then, and to replace Firebase Realtime Database, Google developed Firebase Firestore (now in beta). I suspect that Firebase Realtime Database isn't receiving much attention these days and that the service will be closed after some time.


Have to say, having worked in a huge organization with multiple clients accessing services, I much prefer the firebase solution. You still have downtime in any polyglot solution and the problem is pretty clear here (it's firebase database, not one of dozens of legacy layers...). When you own the entire stack it is amazing how much of the organizational effort goes into obscuring who is responsible. And the stack is much more opaque.

It really is possible to design a system around firebase with a much smaller team. You give up control but control is a myth anyway. And, Firestore is actually designed to support offline mode, so wonder if they neglected to design for that feature which might help here.

The unfortunate reality is that we are in a moment where Firestore is beta and Firebase Database is not supported as it should be. Google should do a better job of helping people to migrate and explaining the roadmao. I imagine the writer of this article just doesn't have as much company clout to get that level of involvement from Google. This was probably an attempt to get that attention that other higher paying clients can get.


If you need to build a product that relies heavily on real time updates, I would look into using Elixir and Phoenix.[0] They nailed the channel abstraction which is the main entry point for realtime communication over websockets. It takes me hours to make scalable realtime applications in what would normally take me days using other systems. The language may take some time to get used to, and the ecosystem isn't as mature as other languages, but what is there is incredibly impressive.

[0]: https://phoenixframework.org/


Firebase does a lot more, including a slew of Auth options that make life much easier.

Add to that the ability to resolve connections dropping out (common on mobile) and that their libraries have been ported all over the place, and Firebase is a defacto answer for mobile developers. It can be up and running from in less than 30 minutes for someone who has 0 experience in cloud development.

It is hard to replicate that.


The common use cases for firebase can be easily reproduced with Phoenix. Phoenix also comes with a handy presence feature that allows you to track whether someone is currently using the product. (Think which present users in a chat room)

I understand the skepticism, but I would highly suggest taking a look and playing around. It's really, really good plus you get to fully own everything you build ;)


Firebase has this presence feature as well.

Also, "fully owning" everything isn't a selling point for everyone. Some don't want to own the uptime, manage infrastructure devops etc. Serverless/managed services have their use cases.

Small teams, individual developers, bootstrapping an app quickly, running a web app with no servers to manage... Often times much more valuable capabilities than being able to reimplement to functionality already availabile to you for very low cost.


Firebase is really awesome. However there kinds of reliability issues and the lack of integrity and communication with which Google handles such things are major reasons I would avoid committing to it. On top of that, Google's history of overlapping products (Firebase or Firestore?) and discontinuing or foot dragging support make decisions confusing and commitment harrowing.

Amazon on the other hand has a history of committing to clear product direction which makes committing to their platforms much easier. Amplify and AppSync for instance feel like safer choices.


The Amplify and AppSync models are also architecturally more scalable as you don't have one big opaque DB and endpoint in a single region.


I stopped using the realtime database once firestore was released in beta. So haven't experienced the downtime you have demonstrated in the status graphs, but Firebase's SLA [1] for realtime database apparently guarantees service credit for monthly uptime less than 99.95%. To corroborate your observations, check if you received this credit:

Less than 99.95% but equal to or greater than 99.0%: 10% credit

Less than 99.0%: 30% credit

[1] https://firebase.google.com/terms/service-level-agreement/


Is Firestore a more reliable version of Firebase's real-time database?


It's a different database altogether - document-oriented at that.

https://firebase.google.com/docs/firestore/rtdb-vs-firestore


Check out the AWS offerings (Amplify + AppSync) if you're rolling off Firebase: https://aws-amplify.github.io https://docs.aws.amazon.com/appsync/latest/devguide/welcome....


Amplify+AppSync client SDK support is pathetic compared with Firebase. No official support for Flutter, Xamarin and Unity apps.


I never heard of these offerings. Thanks for mentioning them!


Not really convinced firebase is “covering it up”.

The official status page breaks down availability by-service with descriptions of each outage and updates with timestamps.

https://status.firebase.google.com


From the article: "When I tell our customers something is wrong outside of our control"

I think this is both the issue with the article, and the issue with Firebase (ironically).

First of all, its an issue with Firebase. All software will break. This is inevitable. Its just a matter of time. Well engineered software/infrastructure gives you, the consumer, tools to mitigate this so your consumers never see it. If we look at amazon, they expose AZs and Regions; well architected applications use these failure domains to accept that an AZ, and possible even a region, will fail. So you can do fallover.

Firebase really doesn't expose these primitives, in an effort to be simple and easy to use. Maybe they're doing something in the backend to use them, but the proof is in the pudding; if their stability is bad, it means they're not doing a good enough job at abstracting away these unavoidable failure domain principals.

Which brings us to the second problem: Its Always Your Fault. Stop trying to pass blame to Firebase. Your customers, seriously, full stop, unequivocally, no exceptions, do not care that Firebase caused you to go down. They care that you went down. You don't get to say "its not our fault!"

Because Its Always Your Fault. Its your fault that you chose Firebase. Its your fault that you chose a service which doesn't expose core failure domain primitives that you can engineer to support. Its your fault for not getting off Firebase when you recognize these core issues with the platform.

Firebase's status page is for you, the engineer, to understand and diagnose issues. Its for you to interpret and surface on your own status page. Its not for you to link to your customers and say "see that red dot? that's why we went down."

And by the way, Yes: Even perfectly architected applications on AWS/Gcloud/Whatever, falling over AZs and Regions, can go down due to things outside of your control. AWS ain't perfect. Remember: All software breaks. But when you word that to your customer, You Always Take The Blame. Period. This is what "its always your fault" means; its not about saying that there are ways to write an application that never breaks. Its about accepting that when (not if) it does break, your customers will blame you, so you need to accept that blame wholly.


> The official status page breaks down availability by-service

That’s part of the problem, actually. I’ve noticed for years that some Firebase service distruptions go unreported, and it was clear that reporting individual services was a way to avoid showing the end-to-end summary. It doesn’t matter that all of Firebase’s servers are up and running, if the end-to-end service they provide isn’t working.


Firebase offers a variety of individual services, and most apps pick up only the services they need. So reporting service-by-service makes more sense.


That’s true, and beside the point. The problem isn’t reporting individual services, the problem is giving the impression that uptime for individual services equals uptime for Firebase as a whole.


I just build a service/website last weekend using App Engine and Firebase. After reading these comments and this blog post, I think I might migrate it over to AWS. I didn't realize that Firebase was so unreliable.


Firebase RTDB is basically the legacy product that barely works. Firestore is the post-acquisition product built on Google tech. It’s a rotten situation. I noticed the outage mentioned in this post because it took down Ford GoBike (and Citibike and all the other Motivate/Lyft bike share systems).


I've been thinking about implementing Firebase as part of Polar: https://getpolarized.io/

The idea is that you update your documents (PDF, HTML, etc) into Polar, tag them, and then we sync them to the cloud. Then when you go to another machine like work or home your documents are always synchronized.

At first I fell in love with Firebase and was very very excited to start implementing it.

They've spent a ton of time working on the initial implementation experience.

Their Firebase Auth support was amazingly simple to setup. Same with Firebase hosting. It's top notch. You can be up and running with a CDN hosting with SSL in like 2 minutes and the firebase tools are exceptional.

Cloud Firestore seems really interesting and easy to setup. It's basically designed for 'apps'. IE user-facing apps and works pretty well if all the data is private to the user.

I do struggle with these issues of reliability though. At Datastreamer (http://www.datastreamer.io/) we use Hetzner and have about a half petabyte stored there.

It's a blog content search engine which we license to other startups so high availability is critical.

Their infra is amazingly reliable. Very very happy here.

The problem of course is that you then have to manage your own software stack which of course requires extra effort on your part.


How does firestore fare in terms of reliability? I heard it is a cleaner and more scalable version of Firebase.


I think that PaaS and BaaS where you don't have access to the back end is a dead end. It's going to go the way of Windows Server. Open source solutions will always win in the end when it comes to developers.


People also balked during the transition away from FTP. SSHing into servers is precisely the thing you want to get away from whether it's to change code or to hotfix your nginx.conf or to do a quick apt-get install.

Doesn't mean that we don't need SSH ever, but 99% of the time it's something we use because we're too lazy to setup automation.

I reckon you're using open-source here to mean self-hosted, but that doesn't really change anything. For example, the reason every small company I've worked at didn't have a way to analyze their logs/stderr and coincide them with other events for debugging was because they didn't, not because they couldn't.


"FTP -> SSH -> proprietary console" does not look like evolution over a gradient of control to me. I don't understand why you're comparing FTP to SSH when SSH is lower level than FTP. FTP "throw it on the server and let mod_php deal with it" deployments were decidedly higher level than SSH-based ones. FTP deployments were often coupled with GUI-based steps, for instance database migrations run from Drupal web app.


You are always going to have to rely on other service providers for critical things - networking, power, etc. I don’t think there is going to be some massive move for every business to be in control of every aspect of their supply chain. It simply isn’t feasible.


It's different. If a business has the option to do something themselves and doing so would cost them less in the long run and give them more flexibility, then doing it themselves is a competitive advantage.

If having solar panels becomes consistently cheaper than buying electricity from the grid (per megawatt), then individuals and businesses will all switch to solar panels... Especially if the business uses a lot of electricity.

The main reason that PaaS solution are popular now is because of advertising and hype. It's a bubble.


That is interesting, because the advice I always hear for businesses is to keep in house their core business and contract/outsource everything else. That 'bring everything in house' strategy only works for the biggest companies that have enough scale.

For almost everyone else, the cost of providing a profit margin to the contracting company is dwarfed by the savings you get from the economy of scale the contractor is able to provide.

Really, this is the microeconomics version of the ideas behind free trade. It is better to produce what you are best at and trade for the rest.


I think that any business which hires developers in-house should consider their software systems as being their core business. That's were the competitive edge comes from.

Most companies who use BaaS or PaaS these days already hire developers and sometimes even DevOps engineers; for them it doesn't make sense to outsource huge parts of their software. Some open source systems just work really well.

For example, I tried several times to launch a business around my current popular open source project but it hasn't worked so far; the problem I have is that companies only need consulting for a short while at the beginning to adapt the system to their use case and then it just works perfectly so they dont need me anymore.

When I reach out to a previous customer who didn't contact me in months, it's very common to hear that the system has been running perfectly without any issues at all. Even a couple of companies which have millions of users. Most of them never needed any consulting at all. So yes, it's much cheaper for a company in the long run to self-host in many cases.


>> keep in house their core business and contract/outsource everything else

That makes sense, except we'd need some kind of working definition for "core business", since that's not necessarily self-evident.

Some businesses seem to think that the management of their brand is their only core business, and everything else is fair game for outsourcing.


I think that now, Firebase is build on Google Cloud Datastore. I have used Datastore in production since 2015, and have had no outages, but if I had to do it again I think I'd go normal RDB, just because query support is extremely limited (no full text search) and "schema change == data rebuild" issues.


Do you mean that you’d use something like a managed Postgres AND build and run a backend service that interfaces a web client to that database?


yeah exactly.


yeah, we have suffered too. Initially we were using firebase Real Time DB for authentication as well as delivering messages. Messages suffered outages every now and then (and we suffered more cos our backend is in Python Django and Pyrebase comes with its own set of issues on top of Firebase). When we found out messages arent being delivered, we switched to pusher as a backup first and then to websocket. Now we use Firebase only for authentication (via real time database) and Notification sending, and still have a backend/app trigger every time there is an error on firebase.

I have always wondered what a reliable backup to the realtime db could be. Havent found much till date.


AWS AppSync and Amplify


Serves you right for using a “real-time database” (whatever that is). I’m sure your chat product feature could have been designed using a flat file as a datastore and a simple web socket server.


Please don't be a jerk on Hacker News. The idea here is: if you have a substantive point to make, make it thoughtfully; if you don't, please don't comment until you do.

https://news.ycombinator.com/newsguidelines.html


> Serves you right for using a “real-time database” (whatever that is).

I see this is flagged, but FWIW, you might want to actually learn something about what they mean by “realtime database” because it’s incredibly useful, and people using Firebase aren’t the only people who think so.

https://en.m.wikipedia.org/wiki/Real-time_database

Firebase is also easy to use and scales to large sites and complex applications, despite the complaints here about reliability, reporting and control, or lack of. A flat file and simple web socket server crumbles under loads that Firebase handles easily.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: