The analogy thing didn’t entirely hold up: the original demonstration was constrained not to return the same word in the prompt, so “man is to woman as doctor is to ____” had to return something that’s close to, but not the same as “doctor” in the embedding space. Hence, it returns nurse.
Ironically, this makes the original point nearly as well: we need to evaluate the hell out of machine learning systems to make sure that they’re doing what we think they are and that they’re not keying off something else instead, especially something biased. To date, the field has been...not great about this.
You can monkey around with it here: http://bionlp-www.utu.fi/wv_demo/ (choose the English Google News model, but this may not be exactly the same set/model as the original report).
Man is to Woman as Doctor is to ___ gives 1) gynecologist
2) nurse 3) doctors 4) physician 5) pediatrician
Woman is to Man as Doctor is to ___ gives:
1) physician 2) doctors 3) surgeon 4) dentist 5) cardiologist
These are just generally near "Doctor" though: the ten nearest terms are physician, doctors, gynecologist, surgeon, dentist, pediatrician, pharmacist, neurologist, cardiologist, and nurse.
Some gender differences may persist (nurse is #2 for `woman`, but #68 for `man`, but it's also near `woman` generally and you could imagine it gets a bit of a boost from the verb ("to feed a baby") being attached exclusively to women too.
Anyway, my point is not that there's no bias (there certainly can be--seed GTP-3 with a prompt about Muslims) but that one should be wary of thinking they know what the model is doing.
- man is to woman as doctor is to reprovingly (nurse is the first noun, on position 4)
- woman is to man as doctor is to snodgrass (after a couple nonsense/rare words)
The most important thing that teaches us is that big corpora (bigger than PG) are essential for this method.
Ironically, this makes the original point nearly as well: we need to evaluate the hell out of machine learning systems to make sure that they’re doing what we think they are and that they’re not keying off something else instead, especially something biased. To date, the field has been...not great about this.