Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Depends on your goal.

Outlier removal should be done much more carefully when the goal is inference; trying to test if your hypothesis is true. Here, outlier removal is a super easy way to accidentally p-hack. Current best-practice is to pre-register your analysis, including how you'll define and handle outliers.

For predictive goals, where the idea is to predict the class/value of unseen data, outlier removal is often a good way to keep your bias in check and not bias your model towards the outliers. The trade-off is that future outliers will be predicted as though they weren't, e.g. much closer to the mean than they should be. This is what the article is trying to do.

There's also a whole wide world of outlier & anomaly detection, where you want to say e.g. "this new data point is probably an outlier".



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: