Thanks; I'll take a look. But OpenMP is CPU-only right? Apple's got their (currently less portable, admittedly) Grand Central Dispatch that does something similar. But as far as I know, if you want portable GPU code your only option is OpenCL, and even then it requires optimisation depending what device you're using it on (or so I've heard).
OpenMP 4.0 is likely to have support for accelerator devices (i.e, move the necessary data on to the device, run the computation, and move back to the host). in fact, that's one of the methods you can use the Phi right now (intel have extensions to OpenMP)
or if you can't be bothered to wait for such a standard, you should have a look at OpenACC[1], which does exactly this, and exists now. you end up adding code like
#pragma acc kernels for
on top of your for loops, it does the low level work for you.