This could've helped me last week. I was pulling my hair out debugging a CUDA implementation of the SHA3 candidate BLAKE, and after a full week of debugging the same 70 lines (and one complete rewrite) the issue turned out to be most-significant-bit padding with arithmetic right shifts. I just changed every 'char' type to 'uint8_t' and the code worked perfectly.