I'd be much more hesitant to talk about "meaning" in this context. Are we working with abstract concepts? We're working with images and text, and creating connections between the two. It may be that this approach, ramped up in processing power and complexity, can completely mimic a human's response to images; we may also hit a wall where new techniques are needed to address what you'd call "meaning."
Yes, could be. Word vectors and other kinds of embeddings seem promising, a little too good to be true. There might be a glass ceiling we're not seeing yet.