PTX =
Parallel Thread Execution is a pseudo-assembly language used in nVidia's CUDA programming environment. The 'nvcc' compiler translates code written in CUDA, a C-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into something which can be run on the processing cores.
I see that "OpenCL" is listed above the LLVM box on the Ocelot page, but I'm not sure why; It is known that several OpenCL toolchains (Nvidia, ATI, RapidMind) make use of LLVM, but it is unclear in what capacity they are used. For the sequence you described (PTX->OpenCL->GPU), there would have to be an OpenCL backend for LLVM. As far as I know, no such backend is publically available, and another compiler would be necessary to take the OpenCL source code down to PTX (from whence it came) and then the driver would JIT that PTX for your specific GPU model.
> i am curious whether the analysis stages can improve the code.
Ihe conventional wisdom is that any series of analyses that takes you from representation X, through one or more other representations, and back to X can only make things worse, assuming that the JIT from PTX to GPU machine code isn't horrendous. This is because any analyses, optimizations, and transformations need to be conservative to maintain correctness, and high-level semantic information about the parallelism inherent in the application is usually lost in each translation step. In this particular case it might not be so bad, as long as the LLVM IR is rich enough to faithfully represent the Cooperative Thread Array (CTA) semantics in PTX and not flatten them to SPMD code. My intuition, however, is that it's not; LLVM was designed as a fairly generic virtual machine that would faithfully represent most CPU-like execution models, and hardware CTAs (also called 'warps' in Nvidia parliance) are mostly a GPU-only phenomenon. CPUs have SIMD units (e.g. SSE, MMX, Altivec, NEON), but the execution model there is fundamentally different than the GPU.
Once the GT300 series hits the shelves that problem will be largely mitigated though, they're supposed to be mostly independent, with a 'variable warp' size.
Of course that will introduce a new level of complexity to the optimization problem.
i don't understand this either. "wariable warp size" sounds like a small efficiency fix for when things aren't multiples of 32, or when they exceed 512. a "variable warp size" doesn't alter the fundamental problem SIMD approach - you've still got a multiprocessor with slave processors that are doing very similar work.
for me, the big advances in fermi are a unified address space and some kind of cache for the global memory. neither of those change the paradigm, but they may make life significantly simpler when programming the thing.
Multiples of 32 are nice, multiples of 1 are better :)
(and let's hope it goes that far down), that would make things a lot easier as well.
Unified address space I assume you mean across multiple GPUs ? Global memory cache is a double edged sword, that eats in to the transistor budget at a very rapid pace, effectively you already have a cache, you just have to fill it yourself.
GPU programming is definitely a step back in the ease with which you can write programs, but if your problem maps well on to a GPU the speed increases are simply astounding. What would have taken you a cluster with 100 boxes now sits under your desk and consumes 250 watt tops. That's really very impressive.
The way intel seems to edge in to gpu territory and nvidia into cpu territory will make for some interesting stuff happening in the next couple of years.
I believe the reference to unified address space is in reference to #6 in the PDF linked below and caches #4. I agree that the unified memory address space will be wonderful as managing all the various hierarchies by hand is a pain.
It's going to be really hard to graft that on there given the fact that a lot of the computational horsepower is directly related to the bandwidth to the 'local' memory store. That would mean that the local memory store somehow has to be turned in to a cache that stays coherent across many 100's of processing units.
I'm not sure that's impossible, it just seems very hard.
If nvidia manages to crack that nut then the only thing you'll still need to keep in mind is how big your cache footprint is (as on every other cpu with a cache) in order to maximize throughput.
my original comment (about unified address space) was poorly thought out (it's not clear how much fermi will help, and how much is down to opencl being "cross-platform"). but the ideas isn't that you no longer need to care about the memory hierarchy; only that pointers can be expected to work correctly. currently (particularly in opencl) there are various restrictions on pointers that make some code more complex than it needs to be. for example, you can only allocate 1/4 of the memory in a single chunk, and pointers are local to chunks, so patching together chunks of memory to get one large array is messy.
I've spent in total about 2 months now (spread out over the last year) understanding how this whole GPGPU thing fits in with the rest of computing, it is much like a specialty tool. It is harder to master, more work to get it right once you have mastered it, subject to change on shorter notice than most other solutions (because of the close tie to the hardware) but if you need it, you need it bad and the pay-off is tremendous.
I posted a link for PLANG (PTX frontend for LLVM) in my earlier comment above...that PDF has some good info on how nVidia is integrating LLVM into their toolchain.
so the analysis is in the llvm? the diagram is very confusing; it looks to me as though the analysis is before the llvm and applies to three different backends. but then looking some more, it seems to imply you can go ptx -> ocelot -> nvidia gpu (no llvm at all).
(ie i agree with everything else you say - i just don't understand what that graphic is trying to show).
My interpretation, which I will further inform later today by scraping through the source, is that Ocelot takes in PTX from the Nvidia compiler and spits out LLVM bytecode, which can be JITed to run on an x86 (using LLVM's JIT) or compiled by LLVM to run on any other architectures it supports now or in the future. I think the diagram may mean that Ocelot presents a single CUDA device which may do any of these things (x86, GPU, ???-via-LLVM) to execute a CUDA kernel behind-the-scenes, transparently to whatever made the kernel call.
So far it appears as though my initial suspicion was correct. Ocelot seems to present a single CUDA device interface which can use either the Nvidia driver or the LLVM JIT to run CUDA kernels. The eventual plan is to integrate Ocelot into a runtime system, Harmony, which presents a very high-level interface to parallel resources, and manages multiple backends to keep whatever mix of CPU and GPU cores you have busy. The plan is here (http://code.google.com/p/gpuocelot/wiki/Roadmap). Harmony is part of Greg Diamos' (http://gdiamos.net/) ongoing PhD thesis work at Georgia Tech.
PTX = Parallel Thread Execution is a pseudo-assembly language used in nVidia's CUDA programming environment. The 'nvcc' compiler translates code written in CUDA, a C-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into something which can be run on the processing cores.
(source - wikipedia)