SIMD instructions are used to explicitly execute vector operations in parallel. The patent is about rearranging a stream of instructions across a loop at runtime to turn code that was not written with SIMD in mind into parallel code.
The across the loop part is the vaguely new part of it. To simplify it greatly, this can already be made parallel at runtime:
float a = 5.0;
float b = 10.0;
//The next two lines are executed in one clock tick
a*=2;
b*=2;
This patent deals with a method to turn this into parallel code:
float[] vals = { 1.0, 2.0, 3.0 }
for ( int i = 0; i < vals.length; i++ ) {
vals[ i ] *= 2;
}
Obviously it has to detect any data dependencies at runtime, which isn't trivial. This can't be rewritten, for instance, because i depends on i - 1:
float[] vals = { 1.0, 2.0, 3.0 }
for ( int i = 1; i < vals.length; i++ ) {
vals[ i ] *= vals[ i - 1 ];
}
Now, this is cool. I'd love to be working on stuff like this. But it's an extremely obvious step, once you've cleared the low hanging fruit. It's just that the extra complexity hasn't been worth it until now, because easier and simpler things to do have given enough speed up. That someone would do this was inevitable, which is why I say the patent is ridiculous.
Imagine history turning out completely differently. IBM is the largest CPU design firm in the world, after Shockley Semiconductor and HP who fab chips for the latest Commodore machine. Apple and Dell never existed. Every single person working on CPU design today for whatever reason ended up in another field.
We would still be seeing this technique first being used at the 1 billion transistor mark for CPUs. It's just a consequence of the nature of computation.
The biggest problem in our patent system is that obvious ideas get patents. That said, the current criterion that patent examiners use to decide if an idea is obvious is based on prior art. If an idea is clearly useful but no one is using it (or has publicly disclosed it), then it's considered non-obvious.
Isn't that already done with auto-vectorisation? http://gcc.gnu.org/projects/tree-ssa/vectorization.html It should take chunks of "vals" and automatically process X elements (depending on how many fit in the registers) at the same time via SIMD. It was available since ~2008 too.