I'm not sure it is difficult to see why variable length SIMD makes sense. If you...

dan-robertson · on Dec 25, 2023

I wonder if you’re using a different definition of ‘vectorized’ from the one I would use. For example glibc provides a vectorized strlen. Here is the sse version: https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/m...

It’s pretty simple to imagine how to write an unoptimized version: read a vector from the start of the string, compare it to 0, convert that to a bitvector, test for equal to zero, then loop or clz and finish.

I would call this vectorized because it operates on 16 bytes (sse) at a time.

There are a few issues:

1. You’re still spending a lot of time in the scalar code checking loop conditions.

2. You’re doing unaligned reads which are slower on old processors

3. You may read across a cache line forcing you to pull a second line into cache even if the string ends before then.

4. You may read across a page boundary which could cause a segfault if the next page is not accessible

So the fixes are to do 64-byte (ie cache line) aligned accesses which also means page-aligned (so you won’t read from a page until you know the string doesn’t end in the previous page). That deals with alignment problems. You read four vector registers at a time but this doesn’t really cost much more if the string is shorter as it all comes from one cache line. Another trick in the linked code is that it first finds the cache line by reading the first 16 bytes then merging in the next 3 groups with unsigned-min, so it only requires one test against a zero vector instead of 4. Then it finds the zero in the cache line. You need to do a bit of work in the first iteration to become aligned. With AVX, you can use mask registers on reads to handle that first step instead.

dfox · on Dec 25, 2023

Another point is that the CPU can sequence the multiple calls to its internal SIMD unit internally without that having to be done by user code. This in extreme case degrades to the Cray-1-like vector unit, which still has measurable preformance impact and can be implementated even in very resource constrained environments.