Go have the argument with glibc then (And all the other people who use optimizat...

tptacek · on March 10, 2009

Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.

You know that VC++ does implement copies with movsd/movsb, right?

Sorry, I don't read a lot of books and forums on assembly programming. Just the PRM. I'm just stuck reading/writing a lot of assembly on projects.

axod · on March 10, 2009

>> Friendlier to the microarchitecture means: fewer branches, fewer BTB entries, less impact on the icache. Sorry, you wrote like you might have already known that.

If it's less clock cycles to do branching and comparing by dword (which it is for medium to long strings) than doing rep scasb, then what else matters...?

>> You know that VC++ does implement copies with movsd/movsb, right?

I've stepped through VC++ string copy code in softice many a time thanks.

Notice how I was asking about 'movsb', and you replied with 'movsd/movdb'. Notice the difference?

tptacek · on March 10, 2009

I have no idea what points you're trying to make here.

You can trade per-byte cycle counts for lower cost to invoke the routine, and for not evicting cache and BTB entries.

On your second point, I assumed it was the "rep" part of the instruction that you were railing against. Apparently it's the "not knowing the difference between a byte and a dword" part. That's awesome. You can have the last word, if you'd like.

axod · on March 10, 2009

rep movsb/rep movsd works well for moving data. However, you obviously can't use that approach for searching for a 0. That's why the code is optimized as it was. My point is that using rep scasb is suboptimal.

Don't know what you're talking about "lower cost to invoke the routine", and the cache/BTB entries would be negligible on a small routine like this.

You seem kinda angry and bitter whenever you reply to me :/ Chill out eh.

tptacek · on March 10, 2009

It costs cycles to call a C function. I seem angry and bitter all the time. But my point is just, there's an argument in favor of scasb.

axod · on March 10, 2009

So you're comparing inlined rep scasb, with non-inlined alternative. Interesting comparison I guess.

Sure, it would bloat the code a little to inline the optimized version, but it could be done in tight inner loops if required.

tptacek · on March 10, 2009

I'm assuming you're not inlining a function with a loop in it, but OK, you can also just expand the 7 insns everywhere you call strlen.