One mode is natural language, the other is imagery. It is combination becuse the...

One mode is natural language, the other is imagery. It is combination becuse the model will learn statistical associations between the modes e.g. "text to image", "voice to text".

Within these respective modes are even more subgroups e.g. language translation, audio diarization. For sd you can consider animation and photographs as separate modes the model has to learn. Although the language is fuzzy and im not being statistically rigorous as it is a weak point of mine.