Yes but in practice, if you compute K=X.wk, Q=X.wq and then K.tQ you make three matrice multiplication.
Wouldn't be faster to compute W=wk.twq beforhand and then just X.W.tX which will be just two matrices multiplication ?
Is there something I am missing ?
Most models have a per-head dimension much smaller than the input dimension, so it's faster to multiply by the small wk and wk individually than to multiply by the large matrix W. Also, if you use rotary positional embeddings, the RoPE matrices need to be sandwiched in the middle and they're different for every token, so you could no longer premultiply just once.
> Contrast this with equivalent code that is full of logistics, where I’m using only basic Python language features and no special data wrangling package:
n = len(values)
# Calculate mean
mean = sum(values) / n
# Calculate standard deviation
variance = sum((x - mean) \* 2 for x in values) / (n - 1)
std_dev = math.sqrt(variance)
He doesn' t know about the statistics package in the standart library of Python (https://docs.python.org/3/library/statistics.html). Of course, if you do not know to use Python, you will have a lot of boilerplate.
- Take a list of n animals * m vehicule
- Ask a LLM to generate SVG for this n*m options
- Generate png from the svg
- Ask a Model with vision to grade the result
- Change your weight accordingly
No need to human to draw the dataset, no need of human to evaluate.