Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone else cringe when someone suggestions using NFS in production?

I can't be the only one that has been woken up at 2am because of an NFS outage.



Although skeptical about it going into the project, I've deployed a virtualized oracle rac environment over 10G NFS and with some tuning it was stable and performant. If it is good enough for rac, which has some of the most stringent latency / performance requirements that I've seen, I'd say that it is probably good enough for quite a few production use cases, although to be fair this was only an 8 node cluster.


Some naive questions if you don't mind... (I'm really curious about RAC .. my only DB experience has been small MySQL and MS SQL Server clusters).

1) I thought the really high end DBs like to manage their own block storage. Your NFS comment suggests that the database data files were running on an NFS mount, and you had a 10 gig Ethernet connection to the file server.

2) What would you say is the average size of a RAC cluster in (your opinion)? Is 8 considered a small cluster in this realm?

3) DBs have stringent requirements when it comes to operations like sync. Can you actually get ACID in an NFS backed DB?

Thanks for satisfying my curiosity :)


Just to offer a little bit of information, we're currently running a 2 node RAC cluster. I'm not entirely sure about the storage mechanism though.


Does anyone else cringe when someone suggestions using XYZ in production? I can't be the only one that has been woken up at 2am because of an XYZ outage.

XYZ could be NFS, SCSI, MySQL, Rails, KVM, ..., you get the idea. Any technology that has seen wide use has caused someone to be woken up at 2am because of an outage. NFS has been very widely used for a very long time. As a distributed file system developer who once helped design a precursor of pNFS I think NFS has some pretty fundamental problems, but the fact that NFS servers sometimes go down is not one of them. Often that's more to do with the implementation and/or deployment than the protocol, and no functionally similar protocol would do much better under similar circumstances. People get woken up at 2am because of SMB failures too. My brother used to get woken up at 2am because of RFS failures. Nobody gets woken up at 2am because of 9p failures, but if 9p ever grew up enough to be deployed in environments with people on call I'm sure they'd lose sleep too. EBS failures have bitten more than a few people.

Citing the existence of failures, other than proportionally to usage, isn't very convincing. I'd actually be more concerned about the technology on the back end of EFS, not the protocol used on the front.


Yes I had my fair share of issues with NFS too.

It works, more than usually but the problem starts whenever it feels like not working. And you can never be sure whenever the tantrum will come.

With linux while using the nfs-kernel-server most of the time you need to restart if a problem occurs. And I don't like restarts.

A few years ago, slabtop helped me to troubleshoot some memory issues, turned out nfs-kernel-server was leaking. I had to upgrade the kernel.

The shame is, there is nothing like nfs to replace nfs. Easy to deploy, easy for clients, works everywhere.


I can't say I have any problems with NFS - we use it for shared storage on some pretty busy servers without any issue. I'm not saying they don't happen - just that we don't experience them. I'd be interested to hear the problems you've encountered - did you submit bug reports for them that you could perhaps link to?


Maybe it only speaks the NFSv4 protocol, but the implementation is different?


On a previous workplace, we had a pretty beefy VMware set up backed by NFS. Performance was excellent, and file level access offers a lot of functionality you can't have with for example iSCSI.


That sounds like an issue with the implementation, not the protocol. There are countless large environments I know of running NFS in production on NetApp without any issues at all.


I'd guess this is some sort of clustered solution with failover.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: