Go have the argument with glibc then (And all the other people who use optimizations like this).
"friendlier to the micro-architecture" doesn't even make sense. Check the chip timings for rep scasb. It's not friendly.
We're not really discussing wether you should be using strlen on large strings or not, but even if it's used say a million times on strings of length 80 or so, you'd see an improvement worth having.
Check any assembly language forum, book, etc and there will be discussion on why rep scasb/movsb/cmpsb are lame.
Would you implement string copy with rep movsb as well?
Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.
You know that VC++ does implement copies with movsd/movsb, right?
Sorry, I don't read a lot of books and forums on assembly programming. Just the PRM. I'm just stuck reading/writing a lot of assembly on projects.
>> Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.
If it's less clock cycles to do branching and comparing by dword (which it is for medium to long strings) than doing rep scasb, then what else matters...?
>> You know that VC++ does implement copies with movsd/movsb, right?
I've stepped through VC++ string copy code in softice many a time thanks.
Notice how I was asking about 'movsb', and you replied with 'movsd/movdb'. Notice the difference?
I have no idea what points you're trying to make here.
You can trade per-byte cycle counts for lower cost to invoke the routine, and for not evicting cache and BTB entries.
On your second point, I assumed it was the "rep" part of the instruction that you were railing against. Apparently it's the "not knowing the difference between a byte and a dword" part. That's awesome. You can have the last word, if you'd like.
rep movsb/rep movsd works well for moving data. However, you obviously can't use that approach for searching for a 0. That's why the code is optimized as it was. My point is that using rep scasb is suboptimal.
Don't know what you're talking about "lower cost to invoke the routine", and the cache/BTB entries would be negligible on a small routine like this.
You seem kinda angry and bitter whenever you reply to me :/ Chill out eh.
"friendlier to the micro-architecture" doesn't even make sense. Check the chip timings for rep scasb. It's not friendly.
We're not really discussing wether you should be using strlen on large strings or not, but even if it's used say a million times on strings of length 80 or so, you'd see an improvement worth having.
Check any assembly language forum, book, etc and there will be discussion on why rep scasb/movsb/cmpsb are lame.
Would you implement string copy with rep movsb as well?