Although skeptical about it going into the project, I've deployed a virtualized oracle rac environment over 10G NFS and with some tuning it was stable and performant. If it is good enough for rac, which has some of the most stringent latency / performance requirements that I've seen, I'd say that it is probably good enough for quite a few production use cases, although to be fair this was only an 8 node cluster.
Some naive questions if you don't mind... (I'm really curious about RAC .. my only DB experience has been small MySQL and MS SQL Server clusters).
1) I thought the really high end DBs like to manage their own block storage. Your NFS comment suggests that the database data files were running on an NFS mount, and you had a 10 gig Ethernet connection to the file server.
2) What would you say is the average size of a RAC cluster in (your opinion)? Is 8 considered a small cluster in this realm?
3) DBs have stringent requirements when it comes to operations like sync. Can you actually get ACID in an NFS backed DB?
Does anyone else cringe when someone suggestions using XYZ in production?
I can't be the only one that has been woken up at 2am because of an XYZ outage.
XYZ could be NFS, SCSI, MySQL, Rails, KVM, ..., you get the idea. Any technology that has seen wide use has caused someone to be woken up at 2am because of an outage. NFS has been very widely used for a very long time. As a distributed file system developer who once helped design a precursor of pNFS I think NFS has some pretty fundamental problems, but the fact that NFS servers sometimes go down is not one of them. Often that's more to do with the implementation and/or deployment than the protocol, and no functionally similar protocol would do much better under similar circumstances. People get woken up at 2am because of SMB failures too. My brother used to get woken up at 2am because of RFS failures. Nobody gets woken up at 2am because of 9p failures, but if 9p ever grew up enough to be deployed in environments with people on call I'm sure they'd lose sleep too. EBS failures have bitten more than a few people.
Citing the existence of failures, other than proportionally to usage, isn't very convincing. I'd actually be more concerned about the technology on the back end of EFS, not the protocol used on the front.
I can't say I have any problems with NFS - we use it for shared storage on some pretty busy servers without any issue. I'm not saying they don't happen - just that we don't experience them. I'd be interested to hear the problems you've encountered - did you submit bug reports for them that you could perhaps link to?
On a previous workplace, we had a pretty beefy VMware set up backed by NFS. Performance was excellent, and file level access offers a lot of functionality you can't have with for example iSCSI.
That sounds like an issue with the implementation, not the protocol. There are countless large environments I know of running NFS in production on NetApp without any issues at all.
I can't be the only one that has been woken up at 2am because of an NFS outage.