Once the GT300 series hits the shelves that problem will be largely mitigated th...

andrewcooke · on Dec 26, 2009

i don't understand this either. "wariable warp size" sounds like a small efficiency fix for when things aren't multiples of 32, or when they exceed 512. a "variable warp size" doesn't alter the fundamental problem SIMD approach - you've still got a multiprocessor with slave processors that are doing very similar work.

for me, the big advances in fermi are a unified address space and some kind of cache for the global memory. neither of those change the paradigm, but they may make life significantly simpler when programming the thing.

jacquesm · on Dec 26, 2009

Multiples of 32 are nice, multiples of 1 are better :)

(and let's hope it goes that far down), that would make things a lot easier as well.

Unified address space I assume you mean across multiple GPUs ? Global memory cache is a double edged sword, that eats in to the transistor budget at a very rapid pace, effectively you already have a cache, you just have to fill it yourself.

GPU programming is definitely a step back in the ease with which you can write programs, but if your problem maps well on to a GPU the speed increases are simply astounding. What would have taken you a cluster with 100 boxes now sits under your desk and consumes 250 watt tops. That's really very impressive.

The way intel seems to edge in to gpu territory and nvidia into cpu territory will make for some interesting stuff happening in the next couple of years.

cjenkins · on Dec 27, 2009

I believe the reference to unified address space is in reference to #6 in the PDF linked below and caches #4. I agree that the unified memory address space will be wonderful as managing all the various hierarchies by hand is a pain.

http://www.nvidia.com/content/PDF/fermi_white_papers/D.Patte...

jacquesm · on Dec 27, 2009

It's going to be really hard to graft that on there given the fact that a lot of the computational horsepower is directly related to the bandwidth to the 'local' memory store. That would mean that the local memory store somehow has to be turned in to a cache that stays coherent across many 100's of processing units.

I'm not sure that's impossible, it just seems very hard.

If nvidia manages to crack that nut then the only thing you'll still need to keep in mind is how big your cache footprint is (as on every other cpu with a cache) in order to maximize throughput.

andrewcooke · on Dec 27, 2009

my original comment (about unified address space) was poorly thought out (it's not clear how much fermi will help, and how much is down to opencl being "cross-platform"). but the ideas isn't that you no longer need to care about the memory hierarchy; only that pointers can be expected to work correctly. currently (particularly in opencl) there are various restrictions on pointers that make some code more complex than it needs to be. for example, you can only allocate 1/4 of the memory in a single chunk, and pointers are local to chunks, so patching together chunks of memory to get one large array is messy.

jacquesm · on Dec 27, 2009

Ok, I see what you're getting at now.

That would definitely be a good thing.

I've spent in total about 2 months now (spread out over the last year) understanding how this whole GPGPU thing fits in with the rest of computing, it is much like a specialty tool. It is harder to master, more work to get it right once you have mastered it, subject to change on shorter notice than most other solutions (because of the close tie to the hardware) but if you need it, you need it bad and the pay-off is tremendous.