When I see work like these it get me the impression that ML hype is way too real.
The goodness of the result of a machine learning model like this one should be compared with the goodness of a simple "standard" model like linear regression.
Yeah it is kinda cool that we can use 10 lines of TF to spin up huge computation, but I guess that a simple linear regression would have provide results that are at least similar to the one of the neural network.
> Yeah it is kinda cool that we can use 10 lines of TF to spin up huge computation, but I guess that a simple linear regression would have provide results that are at least similar to the one of the neural network.
Plus, LR is not a black box, so it both brings you a class and a reason why that class was chosen, which is a very desirable property in many problems.
that's exactly the thing I'm seeing in my field (computational materials science). Basically a simple regression model (with very simple features) brings you 90% there, still people compete on publishing (on ONE shitty benchmark dataset) ever better results – the most-cited people are using KRR, e.g. each fit uses 2TB of RAM and "days" of CPU (features of length O(1000) and 100000 samples). The sample data is probably a 30s calculation (and still only a rough estimation). Sometimes wants you to question science, but hey, writing proposals with "ML" in it gives at least a chance on that grant...
This is great if you only need 90%. But often that extra 10% is the difference between a state of the art commercially viable product and vaporware that nobody will pay for.
It’s sort of like production software in general. Sometimes the core product is pretty easy to prototype... but serving it to all your users with very high reliability and uptime is not so easy, and that’s what actually gets them to pull out the credit card.
I haven't thought as much about it as you have probably, but ok you get 90% of the way there, but now what? How do you get that last 10%? It would be a huge amount of work right?
you get the last 10% by running a calculation for that (which itself is not necessarily accurate...). The whole situation is a little bit like Plato's cave... And also from what I've seen so far, those models either are 90% there or they are overfitting like hell.
Part of the issue is that many of the neurons in a neural network end up contributing little to the performance. This is why the practice of pruning, the removal of neurons that don't contribute much, exists [1]. Neurons can end up having so little gradient that they end up stuck (Big problem with RELU), or can end up being insignificant due to the subsequent layers weights of that neuron.
As the author is using RELU, he will have a decent number of neurons 'die'. So some 'over-provisioning' is not a bad idea in theory. Also, if my keras isn't so rusty, I think the author is using less parameters than you are stating.
Still a more reasonable dropout rate and maybe some regularization/batch normalization might help, but I would say not over fitting on only 700 samples is a hard task, even with a network much smaller than that.
I would be pretty shocked if this neural net wasn't over fit
I know this is a joke, but the theory of generalization in NNs is rapidly advancing and it's not quite that simplistic: https://arxiv.org/abs/2003.02139
For anyone interested, there's a tool http://www.aging.ai which does exactly that, using deep-learning algorithms on, and I quote, "hundreds of thousands anonymized human blood tests".
I've used it myself for fun after doing a blood test. It's a free alternative to InsideTracker's InnerAge product.
yes valid point haha. It's more fun to use a neural net :) but after many similar comments I plan on implementing a simpler approach and seeing how it compares
It occurs to me that antibodies for diseases would make an interesting approach to age estimation. In the 1918 flu older people were spared. This is presumed to be due to the fact that they had an immunity due to an exposure in their own youths.
Interesting, but strewn with potential challenges.
Over time, cell populations with BCR/TCR that recognize and bind such antigens will cease proliferation. Moreover, some cell populations will be localized to certain tissues and not in circulation.
Our startup (Chronomics) has built the most accurate epigenetic clock from Saliva (no needles..) which looks at 20 million positions (or features) https://www.chronomics.com/science
Really interesting area and we are starting to be able to define many more novel indicators of actionable health risks such as smoke exposure, alcohol consumption and metabolic status from DNA methylation.
When you post an article, please don't publish the code somewhere an account must be created in order to read it. In this case, I cannot check the full code because it is hosted at https://colab.research.google.com, a tarball attached to the article or a publicly accessible host like gitlab.com or github.com would have been fine.
Interesting write up. I'd be interested to see how it performs with k-folds validation as well as shuffling. Kind of worried its learning order or samples.
I'll try and get back to you with the performance of k-folds validation and shuffling.
I don't think it can be learning the order or samples because the train and test data sets are separated very early on. If it were learning order or samples of the training set it would have to perform very poorly on the test set.
This had been bugging me all day in the back of my head... turns out shuffle is enabled by default. Both in sklearn and in tf.keras (also original keras).
On a separate note, I think there may be a source file missing in your notebook. I kept getting an error when trying to load "GSE87571_series_matrix.csv". Might just be me.
Yes I agree you want number of samples to be greater than number of features. This is why the number of features were reduced from over 400,000 to 25. After this reduction the number of features is less than the number of samples (~700).
Honestly I didn't prioritize hyperparameter tuning enough. I pretty much went with one of the first models I identified.
Could you elaborate on the idea of not capturing dependent features please?
For reference, the generally accepted standard for determining age from blood is the Horvath clock [1]. It seems to be accurate and only uses a penalized regression. Keep in mind this represents what your age is in reference to a "healthy" person. For example, a 50-year-old who smokes may have the equivalent practical age of a 60-year-old who doesn't. The Horvath clock is useful for evaluating lifestyle changes and your overall healthspan.
If people want to learn more about how DNA methylation relates to aging, I recommend reading Lifespan by David Sinclair.
The idea of selecting the 25 features based on maximum correlation seems to be weak because it should introduce a lot of collinearity. In chapter 6 of the ISLR book there are many methods to work in high dimension, that is when number of features is bigger than number of samples. For example principal components regression, partial least squares, the lasso, ridge regression, forward stepwise selection and PCL. All of those methods can be used with 10 or so lines of R using the packages and examples described in the ISLR book, lab in chapter 6.
> Therefore I first split the data into training and testing sets at a ratio of 9:1, and selected the 25 most correlated features in the training set. Each of these features had a correlation with age between 0.83 and 0.94.
> The data was then split into training and testing sets at a ratio of 9:1, and fed into a sequential neural network.
What?? I thought it was split already.
(How training and test sets were obtained sounds fairly confusing. Did the author make sure there's no "data snooping" ?)
I apologize for the confusion, i've since removed this typo.
The data is only split once, before using a correlation test to select the features that the model would be trained on. As far as I can tell there is no data snooping occurring because the data is split into train and test sets before any decision are made.
Is the underlying concept related to 'By analyzing proteins in the blood, one can estimate a person's biological age, as well as weight, height, and hip circumference, mentioned in this article ?
The two concepts definitely may be related, but the approach used in this paper doesn't make use of proteins in the blood. Rather it uses the DNA methylation extracted from white blood cells in the blood.
I don’t know if you’re the OP but I didn’t mean it in a negative way. This is extremely well written and researched, better than most graduate student writings yet alone non-academics.
Far-reaching prediction: they're going to do facial prediction from blood samples as well. Law enforcement really wants to generate sketches from unknown DNA found at crime scenes.
That is, of course, in addition to the all-encompassing family trees we're providing them with 23andme.
What if I transfuse blood from healthy young subjects? That was a claim from some startups, a wacky VC or two, and even a joke in the Silicon Valley HBO show. The rumor mill says this is happening at a low level.
A linear model should always be compared to these DNNs.
There was a paper last year or so that compared correctly tuned linear models to various deep belief net papers and found that the performance "gains" suddenly evaporated or were not nearly as great as originally published.
Looks like a beginner-level sklearn task. Linear regression would probably be ok, if not, there is random forest or a 2-layer perceptron. No use for a deep network.
The goodness of the result of a machine learning model like this one should be compared with the goodness of a simple "standard" model like linear regression.
Yeah it is kinda cool that we can use 10 lines of TF to spin up huge computation, but I guess that a simple linear regression would have provide results that are at least similar to the one of the neural network.