In my experience, when you encounter VLIW cores, there is almost no tooling for it as they tend be application specific DSP-ish cores. In that case hand optimized assembly is the way to go as there is no budget to produce optimizing compiler of anything.
My experience was with Texas Instruments C6000 DSPs. The compiler is excellent and if you use it well, you rarely have to resort to assembly. Even then, you normally write linear assembly, not parallel assembly, letting the assembler take care of the rest.