I have no idea what points you're trying to make here.
You can trade per-byte cycle counts for lower cost to invoke the routine, and for not evicting cache and BTB entries.
On your second point, I assumed it was the "rep" part of the instruction that you were railing against. Apparently it's the "not knowing the difference between a byte and a dword" part. That's awesome. You can have the last word, if you'd like.
rep movsb/rep movsd works well for moving data. However, you obviously can't use that approach for searching for a 0. That's why the code is optimized as it was. My point is that using rep scasb is suboptimal.
Don't know what you're talking about "lower cost to invoke the routine", and the cache/BTB entries would be negligible on a small routine like this.
You seem kinda angry and bitter whenever you reply to me :/ Chill out eh.
You can trade per-byte cycle counts for lower cost to invoke the routine, and for not evicting cache and BTB entries.
On your second point, I assumed it was the "rep" part of the instruction that you were railing against. Apparently it's the "not knowing the difference between a byte and a dword" part. That's awesome. You can have the last word, if you'd like.