Building an AI to predict human age from a blood sample

siscia · on March 9, 2020

When I see work like these it get me the impression that ML hype is way too real.

The goodness of the result of a machine learning model like this one should be compared with the goodness of a simple "standard" model like linear regression.

Yeah it is kinda cool that we can use 10 lines of TF to spin up huge computation, but I guess that a simple linear regression would have provide results that are at least similar to the one of the neural network.

sacado2 · on March 9, 2020

> Yeah it is kinda cool that we can use 10 lines of TF to spin up huge computation, but I guess that a simple linear regression would have provide results that are at least similar to the one of the neural network.

Plus, LR is not a black box, so it both brings you a class and a reason why that class was chosen, which is a very desirable property in many problems.

fock · on March 9, 2020

that's exactly the thing I'm seeing in my field (computational materials science). Basically a simple regression model (with very simple features) brings you 90% there, still people compete on publishing (on ONE shitty benchmark dataset) ever better results – the most-cited people are using KRR, e.g. each fit uses 2TB of RAM and "days" of CPU (features of length O(1000) and 100000 samples). The sample data is probably a 30s calculation (and still only a rough estimation). Sometimes wants you to question science, but hey, writing proposals with "ML" in it gives at least a chance on that grant...

nilkn · on March 9, 2020

This is great if you only need 90%. But often that extra 10% is the difference between a state of the art commercially viable product and vaporware that nobody will pay for.

It’s sort of like production software in general. Sometimes the core product is pretty easy to prototype... but serving it to all your users with very high reliability and uptime is not so easy, and that’s what actually gets them to pull out the credit card.

KaoruAoiShiho · on March 9, 2020

I haven't thought as much about it as you have probably, but ok you get 90% of the way there, but now what? How do you get that last 10%? It would be a huge amount of work right?

hef19898 · on March 9, 2020

Isn't that usually the case? That the first 90% are as difficult to achieve as the next 5%, which are again as hard as the next 3%?

KaoruAoiShiho · on March 9, 2020

Basically the argument is that ML would be able to get there easier.

fock · on March 10, 2020

you get the last 10% by running a calculation for that (which itself is not necessarily accurate...). The whole situation is a little bit like Plato's cave... And also from what I've seen so far, those models either are 90% there or they are overfitting like hell.

ruborcalor · on March 9, 2020

Yeah i'd agree that ML hype is too real I guess I'm part of the problem

It's more fun to use a neural net :) but after many similar comments I plan on implementing a simpler approach and seeing how it compares

chewxy · on March 9, 2020

>Back to the computer science: 470,000+ features sounds nice at first, but is a recipe for overfitting when we only have 700 samples at our disposal.

Proceeds to use (1024^2 * 2 + 1024) parameters in the neural network.

Buraksr · on March 9, 2020

Part of the issue is that many of the neurons in a neural network end up contributing little to the performance. This is why the practice of pruning, the removal of neurons that don't contribute much, exists [1]. Neurons can end up having so little gradient that they end up stuck (Big problem with RELU), or can end up being insignificant due to the subsequent layers weights of that neuron.

As the author is using RELU, he will have a decent number of neurons 'die'. So some 'over-provisioning' is not a bad idea in theory. Also, if my keras isn't so rusty, I think the author is using less parameters than you are stating.

Still a more reasonable dropout rate and maybe some regularization/batch normalization might help, but I would say not over fitting on only 700 samples is a hard task, even with a network much smaller than that.

I would be pretty shocked if this neural net wasn't over fit

[1]-https://towardsdatascience.com/pruning-deep-neural-network-5...

deepnotderp · on March 9, 2020

I know this is a joke, but the theory of generalization in NNs is rapidly advancing and it's not quite that simplistic: https://arxiv.org/abs/2003.02139

chewxy · on March 9, 2020

Ya. And the choice of optimizer (in this case adam) also imposes upon it some regularization scheme.

I just thought I'd highlight a bit of funniness.

profunctor · on March 9, 2020

How does Adam provide regularisation? I’d never heard of this before and I don’t recall it from when I read the paper.

jjoonathan · on March 9, 2020

Just use dropout, dropout prevents overfitting!

/s

/s for this post, I mean. I have had this very suggestion made to me non-sarcastically under similar circumstances.

starchild_3001 · on March 9, 2020

Yeah, that's what I thought too. Interesting work nonetheless!

ruborcalor · on March 9, 2020

what a meme haha :)

after many similar comments I plan on implementing a simpler model and seeing how it compares

chewxy · on March 9, 2020

I forked your colab. There was abit of code it couldn't run. `Gxxx_matrix.csv` not found.

Laurentvw · on March 9, 2020

For anyone interested, there's a tool http://www.aging.ai which does exactly that, using deep-learning algorithms on, and I quote, "hundreds of thousands anonymized human blood tests".

I've used it myself for fun after doing a blood test. It's a free alternative to InsideTracker's InnerAge product.

ruborcalor · on March 9, 2020

Wow very cool tool! I didn't know about this somehow. Thanks for sharing. Cool that you've tried it on yourself.

minimaxir · on March 9, 2020

With respect to having more features than samples, see also the Curse of Dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality

gmtx725 · on March 9, 2020

why would you use a neural net on a dataset with only 700 samples, smh

wyxuan · on March 9, 2020

The relationship in the dataset seems pretty clean and clear cut, so you don't really need as large a dataset.

0x1221 · on March 9, 2020

If it's clean and clear cut you probably don't need deep learning either.

gmtx725 · on March 9, 2020

Then just use a GBM or some simple linear classifier

ruborcalor · on March 9, 2020

yes valid point haha. It's more fun to use a neural net :) but after many similar comments I plan on implementing a simpler approach and seeing how it compares

all_blue_chucks · on March 9, 2020

It occurs to me that antibodies for diseases would make an interesting approach to age estimation. In the 1918 flu older people were spared. This is presumed to be due to the fact that they had an immunity due to an exposure in their own youths.

echelon · on March 9, 2020

Interesting, but strewn with potential challenges.

Over time, cell populations with BCR/TCR that recognize and bind such antigens will cease proliferation. Moreover, some cell populations will be localized to certain tissues and not in circulation.

wyxuan · on March 9, 2020

Yeah, and it would be impossible to determine the age with a resolution better than a decade (at best).

n_2 · on March 9, 2020

Good to see other people interested in this!

Our startup (Chronomics) has built the most accurate epigenetic clock from Saliva (no needles..) which looks at 20 million positions (or features) https://www.chronomics.com/science

Really interesting area and we are starting to be able to define many more novel indicators of actionable health risks such as smoke exposure, alcohol consumption and metabolic status from DNA methylation.

ruborcalor · on March 9, 2020

Wow very cool startup; wish you guys the best of luck!

Znafon · on March 9, 2020

When you post an article, please don't publish the code somewhere an account must be created in order to read it. In this case, I cannot check the full code because it is hosted at https://colab.research.google.com, a tarball attached to the article or a publicly accessible host like gitlab.com or github.com would have been fine.

ruborcalor · on March 9, 2020

Definitely a good tip. I'll try not to make that mistake again.

You can now find the jupyter notebook code here: https://github.com/Ruborcalor/Age-Prediction-Via-Blood-Sampl...

cdrake · on March 9, 2020

Interesting write up. I'd be interested to see how it performs with k-folds validation as well as shuffling. Kind of worried its learning order or samples.

ruborcalor · on March 9, 2020

Thanks I really appreciate it.

I'll try and get back to you with the performance of k-folds validation and shuffling.

I don't think it can be learning the order or samples because the train and test data sets are separated very early on. If it were learning order or samples of the training set it would have to perform very poorly on the test set.

cdrake · on March 10, 2020

This had been bugging me all day in the back of my head... turns out shuffle is enabled by default. Both in sklearn and in tf.keras (also original keras).

On a separate note, I think there may be a source file missing in your notebook. I kept getting an error when trying to load "GSE87571_series_matrix.csv". Might just be me.

[sklearn ref](https://scikit-learn.org/stable/modules/generated/sklearn.mo...)

[tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fi...)

cletus · on March 9, 2020

So I'm not ML guru or anything but what I learned was that if you have m features on n samples you want n > m to prevent over-fitting, no?

Also, with so few samples, how do you do your hyperparameter tuning and validation?

I mean you could eliminate certain features in isolation but that doesn't capture dependent features. And how would you do dimensionality reduction?

ruborcalor · on March 9, 2020

Yes I agree you want number of samples to be greater than number of features. This is why the number of features were reduced from over 400,000 to 25. After this reduction the number of features is less than the number of samples (~700).

Honestly I didn't prioritize hyperparameter tuning enough. I pretty much went with one of the first models I identified.

Could you elaborate on the idea of not capturing dependent features please?

ChaseT · on March 9, 2020

For reference, the generally accepted standard for determining age from blood is the Horvath clock [1]. It seems to be accurate and only uses a penalized regression. Keep in mind this represents what your age is in reference to a "healthy" person. For example, a 50-year-old who smokes may have the equivalent practical age of a 60-year-old who doesn't. The Horvath clock is useful for evaluating lifestyle changes and your overall healthspan.

If people want to learn more about how DNA methylation relates to aging, I recommend reading Lifespan by David Sinclair.

[1] https://www.semanticscholar.org/paper/DNA-methylation-aging-...

ruborcalor · on March 9, 2020

Yes the Horvath Clock seems to be the standard.

Thanks for sharing the book i'll have to sheck it out!

manthideaal · on March 9, 2020

The idea of selecting the 25 features based on maximum correlation seems to be weak because it should introduce a lot of collinearity. In chapter 6 of the ISLR book there are many methods to work in high dimension, that is when number of features is bigger than number of samples. For example principal components regression, partial least squares, the lasso, ridge regression, forward stepwise selection and PCL. All of those methods can be used with 10 or so lines of R using the packages and examples described in the ISLR book, lab in chapter 6.

starchild_3001 · on March 9, 2020

> Therefore I first split the data into training and testing sets at a ratio of 9:1, and selected the 25 most correlated features in the training set. Each of these features had a correlation with age between 0.83 and 0.94.

> The data was then split into training and testing sets at a ratio of 9:1, and fed into a sequential neural network.

What?? I thought it was split already.

(How training and test sets were obtained sounds fairly confusing. Did the author make sure there's no "data snooping" ?)

ruborcalor · on March 9, 2020

I apologize for the confusion, i've since removed this typo.

The data is only split once, before using a correlation test to select the features that the model would be trained on. As far as I can tell there is no data snooping occurring because the data is split into train and test sets before any decision are made.

imvetri · on March 9, 2020

Is the underlying concept related to 'By analyzing proteins in the blood, one can estimate a person's biological age, as well as weight, height, and hip circumference, mentioned in this article ?

https://www.dailymail.co.uk/sciencetech/article-3349739/Woul...

ruborcalor · on March 9, 2020

The two concepts definitely may be related, but the approach used in this paper doesn't make use of proteins in the blood. Rather it uses the DNA methylation extracted from white blood cells in the blood.

Interesting article thanks for sharing.

webo · on March 9, 2020

For some context, the author is an undergraduate student.

ruborcalor · on March 9, 2020

Haha yes good point take everything with a grain of salt

webo · on March 10, 2020

I don’t know if you’re the OP but I didn’t mean it in a negative way. This is extremely well written and researched, better than most graduate student writings yet alone non-academics.

echelon · on March 9, 2020

Far-reaching prediction: they're going to do facial prediction from blood samples as well. Law enforcement really wants to generate sketches from unknown DNA found at crime scenes.

That is, of course, in addition to the all-encompassing family trees we're providing them with 23andme.

cjbprime · on March 9, 2020

In what sense in this a prediction? The challenge of estimating faces from DNA is one that researchers have already been competing and publishing papers on for years: https://www.pnas.org/content/early/2017/08/29/1711125114

ruborcalor · on March 9, 2020

Haha yeah we better watch out. Another prediction could be that blood tests will become standard when signing up for various forms of insurance.

peter303 · on March 9, 2020

What if I transfuse blood from healthy young subjects? That was a claim from some startups, a wacky VC or two, and even a joke in the Silicon Valley HBO show. The rumor mill says this is happening at a low level.

ruborcalor · on March 9, 2020

Interesting point, the idea of transfusing blood from healthy subjects had never occurred to me.

I'm not sure I understand; what was the claim from the startups, and what does the rumor mill say is happening at a low level?

unwoundmouse · on March 9, 2020

Great work! I love how well presented all of the information is.

I'd be really interested to see how well a baseline linear model using those features would perform - it seems like it could do pretty well.

jonathankoren · on March 9, 2020

A linear model should always be compared to these DNNs.

There was a paper last year or so that compared correctly tuned linear models to various deep belief net papers and found that the performance "gains" suddenly evaporated or were not nearly as great as originally published.

If I can track down that paper, I'll post it.

ruborcalor · on March 9, 2020

Hey I really appreciate it mate. I'll try and get a baseline linear model going and get back to you with the results.

martopix · on March 9, 2020

Looks like a beginner-level sklearn task. Linear regression would probably be ok, if not, there is random forest or a 2-layer perceptron. No use for a deep network.

ruborcalor · on March 9, 2020

I'll try out these models and get back to you with their effectiveness.

modelzero · on March 9, 2020

Cool, but can we not call this AI?

ruborcalor · on March 9, 2020

What would you prefer to call it? Machine learning?

chengangcs · on March 9, 2020

methylation information by sequencing the DNA in the blood is good enough to predict age

keymone · on March 9, 2020

Is it faster/cheaper though?