Very nice. I wonder if training can be simplified by training pieces of the mode...

Very nice. I wonder if training can be simplified by training pieces of the model separately, instead of training all together. For example, the DeepSpeech model has three layers of feedforward neurons (where the inputs to the first layer are overlapping contexts of audio), followed by a bi-directional recurrent layer, followed by another feedforward layer. What would the results be if we trained the first layers (perhaps all three) on a different problem, such as autoencoding or fill-in-the-blank (as in word2vec), and then fixed those network weights to train the rest of the network?

Breaking the network up like this would reduce training time and perhaps reduce the needed training data. Since the first layers could be trained without supervision, less labeled data would be needed to train the last two layers. It would also facilitate transferring models between problems; the output of the first few layers, like a word2vec, could be fed into arbitrary other machine learning problems, e.g., translation.

If this does not work, then how about training the whole model together, but only once? The final results are reported for an ensemble of six independently trained networks. What if started by training one network, and then fixed the first three layers to train other networks? (Instead of fixing the first layers, you could also just give them a slower training rate, although it isn't clear whether that would save you much.)