Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Inherently" might be too strong of a word, but the default implementations of a lot of key operations are nondeterministic on GPU. With the parallel nature of GPU compute, you can often do things faster if you're willing to be a bit loosey-goosey. PyTorch and TF will typically provide deterministic alternatives, but those come at a cost of efficiency, and might be impractical for LLM training runs that are already massively expensive.

https://pytorch.org/docs/stable/notes/randomness.html



I wonder what the actual speed difference is. I couldn't find any benchmarks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: