Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not sure it is difficult to see why variable length SIMD makes sense. If you want to process 15 elements with a width of 8, you will need the function twice, once with SIMD processing whole batches of 8 elements and a scalar version of the same function to process the last 7 elements. This makes it inherently difficult to write SIMD code even in the simple and happy case of data parallelism. With RISC-V all you do is set vlen to 7 in the last iteration.

>what do you even pass to vsetvl in this case as you don’t know your string length.

I'm not sure what you are trying to say here. You must know the length of the buffer, if you don't know the length of the buffer, then processing the string is inherently sequential, just like reading from a linked list, since accessing even a single byte beyond the null terminator risks a buffer overflow. Why pick an example that can't be vectorized by definition?



I wonder if you’re using a different definition of ‘vectorized’ from the one I would use. For example glibc provides a vectorized strlen. Here is the sse version: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m...

It’s pretty simple to imagine how to write an unoptimized version: read a vector from the start of the string, compare it to 0, convert that to a bitvector, test for equal to zero, then loop or clz and finish.

I would call this vectorized because it operates on 16 bytes (sse) at a time.

There are a few issues:

1. You’re still spending a lot of time in the scalar code checking loop conditions.

2. You’re doing unaligned reads which are slower on old processors

3. You may read across a cache line forcing you to pull a second line into cache even if the string ends before then.

4. You may read across a page boundary which could cause a segfault if the next page is not accessible

So the fixes are to do 64-byte (ie cache line) aligned accesses which also means page-aligned (so you won’t read from a page until you know the string doesn’t end in the previous page). That deals with alignment problems. You read four vector registers at a time but this doesn’t really cost much more if the string is shorter as it all comes from one cache line. Another trick in the linked code is that it first finds the cache line by reading the first 16 bytes then merging in the next 3 groups with unsigned-min, so it only requires one test against a zero vector instead of 4. Then it finds the zero in the cache line. You need to do a bit of work in the first iteration to become aligned. With AVX, you can use mask registers on reads to handle that first step instead.


Another point is that the CPU can sequence the multiple calls to its internal SIMD unit internally without that having to be done by user code. This in extreme case degrades to the Cray-1-like vector unit, which still has measurable preformance impact and can be implementated even in very resource constrained environments.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: