How GPU came to be used for general computation

wazoox · on March 18, 2010

Now I'd like to know more about what general uses there are to these beasts. I see the point for simulating fluidsn etc., but out of climate researchers and aerospace engineers, who needs these sort of tools nowadays? sincerely wondering.

bravura · on March 18, 2010

You can use GPUs to accelerate machine learning algorithms.

What can you do with machine learning? Basically everything: "Applications for machine learning include machine perception, computer vision, natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics, brain-machine interfaces and cheminformatics, detecting credit card fraud, stock market analysis, classifying DNA sequences, speech and handwriting recognition, object recognition in computer vision, game playing, software engineering, adaptive websites and robot locomotion." (http://en.wikipedia.org/wiki/Machine_learning)

In our lab, we have developed a package called Theano (http://www.deeplearning.net/software/theano/), which allows you to take Python numpy code, adapt it slightly, and then automatically compile the mathematical functions in a optimized function graph which is transformed into C code and compiled to target the CPU or GPU. Which means to say, your matrix mathematics and machine learning algorithms just got a lot faster, with little cost in programmer time.

torial · on March 18, 2010

Extremely fast string matching -- I've seen a few research papers on the move to this. Which can play into DNA sequence matching, or information retrieval.

GPU is also good for anything that requires extensive parallel number crunching, like password cracking.

If you search for "GPU String Matching" you should find some good results.

xenthral · on March 19, 2010

I've used it to make my conway's game of life implementation, and make it run really fast with a naive brute force approach. http://vimeo.com/9516535

So add toy programs to that list :)

anigbrowl · on March 19, 2010

Very nice! I would love to mess with that if/when you're ready to release source.

xenthral · on March 19, 2010

Done, MIT license knock yourself out :) http://github.com/xenthral/Conway-OpenCL You need opencl-enabled drivers, SDL library and OpenGL.

I suppose this is my first open source project. hurray for me (:

andrewcooke · on March 18, 2010

i write code for gpus, sometimes. they're nice for many inverse problems - anything that constructs a map or image from some kind of measurement. in my case i work on geophysical applications.

eru · on March 18, 2010

See http://www.wired.com/magazine/2010/02/ff_algorithm/all/1 for a fluffy introduction to what's possible. (I hope I got it right that compressed sensing is related to inverse problems.)

hassy · on March 18, 2010

As torial said many algorithms lend themselves well for running on GPUs. Algorithms can also be designed to be run on GPUs.

These guys are using GPUs to convert series of images into 3D models in biology http://vidiowiki.com/watch/m5xbpad/

These guys use GPUs in sound synthesis: http://vidiowiki.com/watch/w9h9mfe/

Aron · on March 18, 2010

This is a good introduction. I'm interested in finding articles that speculate about just how aligned the design requirements are between graphics, and most matrix-based scientific and data mining computation.

For instance, Nvidia has introduced double-precision support and L1 cache, which has marginal value in traditional graphics. This is going to hurt their profitability on the Fermi chip compared to the simpler ATI alternatives.

I am gonna enjoy watching how all this plays out.

liuliu · on March 18, 2010

I was puzzled how that can impact data mining or machine learning as a whole, too. The difference between data mining algorithms and image processing/string matching algorithms is huge. It requires more data to get some meaningful intermediate results on one computing kernel. For example, a typical scene in my research is to compare the performance of many proposed features (tens of thousands) against large volume of data, and pick the best one. It is an embarrassing parallelization problem, but the data throughput is huge. On supercomputers, it is easy since every computer can have a local copy of either feature set or data set. But for GPGPU, there is no way for each core to have a local copy of either set. Thus, to compare against each other, GPGPU must go back and forth to its shared memory, and the limited bandwidth may harm the performance badly.

Disclaimer: I am not very experienced in GPGPU field, so my worrying may be proved wrong.

andrewcooke · on March 18, 2010

it's horribly hard to accurately predict what will work and what not, and they do have some caching ability (what would cache a texture map when rendering an image).

but what you are perhaps missing is that it's ok for gpus to read memory, as long as you have enough threads. they can switch context very quickly, so one set of threads can request memory (hopefully a contiguous chunk) and then drop into the background and let another set of threads do some work (on the same processing unit). this is critical to their efficiency and is very different to a cpu, which instead relies on cache and "sits doing nothing" if it needs to read data from "afar" (obviously there are trade-offs - there's only so much local memory for state, for example).

i worked on a problem that was not as "nice" as you might hope - the memory access was unpredictable to some degree. but i still got a speed up of "tens" on a cheap ($200) graphics card, compared to a meaty xeon. it's more robust than you might expect.

liuliu · on March 18, 2010

That's true, it is hard to predict. But I really interested to see what is the most optimistic prediction for my particular problem to work on the GPGPU. In this case, I don't think LRU cache will do much help since it has a uniform access pattern (every piece of data has to be examined to every proposed feature). However, you do remind me that may be load-ahead fashion of caching strategy will help. And if the needed data is load to cache with some synchronization method to guarantee all current running kernel will use the piece of data to its examining feature, the performance gain may achieved. Actually, I gonna spend this weekend to try out.

andrewcooke · on March 18, 2010

i don't really get what you're doing, but have you considered making one dimension of your work vary over feature? if you arrange that correctly then you only need to scan the memory once (all features read the first byte of memory; then all features read the next...)

miloshh · on March 18, 2010

The GPU cores share RAM, so there is no need to have local copies. Though if you run out of the 4GB available on Tesla/Quadro cards, it gets more complicated (but so would it on the CPU).

liuliu · on March 18, 2010

I think that it is more relevant to the memory bandwidth (hundreds cores trying to access the same big chunk of memory).

andrewcooke · on March 18, 2010

i would guess double precision support is pretty much a no-brainer. it's a deal breaker for many numerical applications and isn't that expensive (amd/ati have 64 bit support too on the latest firestream cards).

fermi's cache, on the other hand, probably implies more trade-offs. but you can look at it as a necessary step in learning how to find a middle ground between gpus and cpus - which is the next big battle.