To be fair, the FFTW/CUDA thing is due to fundamentally different hardware archi...

To be fair, the FFTW/CUDA thing is due to fundamentally different hardware architectures which drove design constraints for these types of libraries. FFTW was never meant to run on a dedicated, ultra-parallel processor with highly optimized floating point instructions (GPU), but it is incredibly fast considering it runs on general purpose hardware. I am sure the FFTW authors could have done something to squeeze out more performance if they controlled both the hardware and software as NVidia does. And the transfer time to/from the GPU does matter, especially for smaller/more frequent operations...

All that aside, the psychology of pure functional vs. pure OOP vs. some hybrid methodology is really interesting, and even the view of what a "clean solution" is becomes tainted based on past experiences with other code written in that style.