It is true, though, that more powerful models are less explainable. But in return you have more compact modeling and training via gradient descent. This is faar faster than the combinatorial optimization involved in ensembles of trees. (Trust me, I've implemented both.)
there isn't a way to cluster similar cases together to analyze what rules bind different decisions together
"PCA analysis of activations of the hidden layer. Could also do k-means clustering of activations of hidden layer."
I concur. However, for 2-d or 3-d visualization, you should use the more recently developed t-SNE algorithm instead of PCA or other alternatives. t-SNE does waay better.
Software (in Matlab and Python) is available at:
http://ict.ewi.tudelft.nl/~lvandermaaten/t-SNE.html
Also, one cannot find the partial dependence of the output on a given input variable.
If your model assumes that the output is a non-linear combination of inputs, then yes, it is hard to express the output in terms of a linear decomposition. But that was your choice of modelling assumption, presumably because linear models are insufficiently powerful to fit the underlying variations.
"The random forest classifier looks interesting, I'll have to investigate that. Any suggestions for papers/tutorials?"
Random forests were developed by Leo Breiman (RIP).
"Ensemble methods in machine learning" by Diettrich (2000) compares different tree ensemble methods. He concludes that boosting an ensemble of decision trees is better, except when the data are very noisy, in which randomized trees are better. (Boosting is when you focus on the examples that the model is currently doing the worst. Randomized instead works on random subsets of examples.) The main reason boosting is worse than randomized trees in the noisy case is because the AdaBoost exponential loss is sensitive to outliers. Which is to say, AdaBoost boosts the wrong loss function. Boosting an appropriate choice of loss function (perhaps a regularized log-loss) is probably superior to randomized trees in most circumstances.
"Improved boosting algorithms using confidence-rated predictions" by Schapire and Singer (2000) is a great introduction to boosting.
Around the same time Llew Mason and Jerome Friedman independently demonstrated that boosting is essentially fitting an additive model using gradient-based methods to select the features that have the steepest loss gradient. So you should follow up by looking at their work.
"Calculate partial derivatives of inputs w.r.t. outputs."
Yes. There are other ways to interpret the output. See, for example, "Visualizing Higher Layer Features of a Deep Network." by Erhan et al 2009: http://www.iro.umontreal.ca/~lisa/publications/?page=publica...
It is true, though, that more powerful models are less explainable. But in return you have more compact modeling and training via gradient descent. This is faar faster than the combinatorial optimization involved in ensembles of trees. (Trust me, I've implemented both.)
there isn't a way to cluster similar cases together to analyze what rules bind different decisions together
"PCA analysis of activations of the hidden layer. Could also do k-means clustering of activations of hidden layer."
I concur. However, for 2-d or 3-d visualization, you should use the more recently developed t-SNE algorithm instead of PCA or other alternatives. t-SNE does waay better. Software (in Matlab and Python) is available at: http://ict.ewi.tudelft.nl/~lvandermaaten/t-SNE.html
Also, you should look at the JMLR paper (http://ict.ewi.tudelft.nl/~lvandermaaten/t-SNE_files/vanderm...) and supplemental material (http://ict.ewi.tudelft.nl/~lvandermaaten/t-SNE_files/Supplem...), to see the visualizations produced by t-SNE and competing methods. This qualitative evaluation by looking at pictures speaks for itself.
Also, one cannot find the partial dependence of the output on a given input variable.
If your model assumes that the output is a non-linear combination of inputs, then yes, it is hard to express the output in terms of a linear decomposition. But that was your choice of modelling assumption, presumably because linear models are insufficiently powerful to fit the underlying variations.
"The random forest classifier looks interesting, I'll have to investigate that. Any suggestions for papers/tutorials?"
Random forests were developed by Leo Breiman (RIP).
"Ensemble methods in machine learning" by Diettrich (2000) compares different tree ensemble methods. He concludes that boosting an ensemble of decision trees is better, except when the data are very noisy, in which randomized trees are better. (Boosting is when you focus on the examples that the model is currently doing the worst. Randomized instead works on random subsets of examples.) The main reason boosting is worse than randomized trees in the noisy case is because the AdaBoost exponential loss is sensitive to outliers. Which is to say, AdaBoost boosts the wrong loss function. Boosting an appropriate choice of loss function (perhaps a regularized log-loss) is probably superior to randomized trees in most circumstances.
"Improved boosting algorithms using confidence-rated predictions" by Schapire and Singer (2000) is a great introduction to boosting.
Around the same time Llew Mason and Jerome Friedman independently demonstrated that boosting is essentially fitting an additive model using gradient-based methods to select the features that have the steepest loss gradient. So you should follow up by looking at their work.