These kinds of optimizations always make me wonder whether they are worth it. Might it be more efficient to use these transistors for more, simple cores instead? Perhaps the property that most problems are so sequential makes timing/clock rate optimizations inevitable.
I would like to see a superscalar OoO CPU with a RISC-V ISA. Since RISC-V cores tend to be very small, I would expect to see a CPU with hundreds of cores.