Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Histogram vs. ECDF (brooker.co.za)
69 points by r4um on Sept 4, 2022 | hide | past | favorite | 24 comments


I think there's an issue with the histogram rendering in this post. The rapid descent from the spike on the left is not consistent with high ECDF impact and the apparent binning resolution visible in the piecewise line-segments. In general histograms should not be visualized with connected line-graphs in this way - the standard bar graph depiction makes the bin-width apparent and resolves some of the issues the article needs the ECDF for (e.g. relative impact can be assessed visually by comparing the relative areas of the associated bars). The bar visualization also makes it possible to use varying bin sizes, which is extremely useful with any distribution that has tails.


The ECDF is particularly useful to compare two distributions. And it has the nice connection to the Kolmogoroff-Smirnov test for testing if two distributions are different: It's test statistic is the maximum distance between the two ECDFs.


This test seems really underrated. It's my go-to for comparing computer system performance (e.g. between versions on CI) since they often have very peculiar distributions and are relatively cheap to produce enough samples from.


Hmm, I was trying to do the same with KS test on performance data but it seemed extremely sensitive to outliers even when an eyeball test of the two distributions look near exact. Have you ran into any of those issues?


Not with outliers, no, I think it handles them particularly well (conservatively.) But yes to small consistent differences (e.g. uniform 1%) that affect the relative order of results but not by an very important amount. So have to consider effect size even when the test statistic is strong.


I've stopped using it as I found it far too sensitive to small differences.


Perhaps this comes from a misunderstanding of what statistical significance means. A test reporting a statistically significant difference doesn't mean the difference is big, just that it's big enough to separate out as a "signal" from the underlying random noise.

It's basically saying "yes, given this data I am very confident that there is an underlying difference that is not just an artefact of random sampling".


Which if you have a million samples can be true without the statement “these are basically acting the same” being falsified in some human sense.


Yeah true dat.

Still, if you have 5-50 samples from an unknown distribution I think the KS test can be a very valuable protection against false-positive i.e. seeing patterns in random noise.

Even more so if you are making multiple comparisons, e.g. comparing multiple response variables, since you can easily make a Bonferroni correction to avoid "the xkcd 882 problem" which is hard to do by eyeball.

So I see a lot of utility in the KS test (and Wilcox, etc) when measuring computer systems on multiple metrics with unknown distributions (often not Gaussian) and where samples cost ~$0.10-$1.0 each. I see a lot of false positives in the absence of such tests, or otherwise generally low confidence in inferences based on e.g. eyeball/mean/median.


This is the kind of thing I brought up in a sibling comment about 'why do you care?' Do you care about the distribution of some metric, or do you care about location or scale?


What do you use instead?


If I care about 'was this data generated by this specific distribution' then I'll quantize and do a chi-square.

If I care about 'were these two datasets generated by the same distribution' then I'll step back and ask 'why do I care?' and 'what will I do with the answer?' and then use a more specific test.


This is a nice article, but one this that's not quite right is that you can go from a histogram to an eCDF (basically view the bucketing as a loss in measurement precision).

I mention this because histograms, especially HDR histograms, are a very compact way of measuring distributions, and it's nice that you can keep those benefits and still convert to an eCDF.


I'm a behavioral scientist and I find both are useful. If you never look at a histogram it's surprisingly easy to fool yourself about what exactly the ecdf is telling you in certain situations, particularly when comparing distributions.


While this is nice, it seems like without bucketing you would run into complexity issues with large amounts of data, right? i.e. to plot a true eCDF you need a sorted list of all the collected datapoints. I guess for actual plotting you have to effectively bucketize based on the number of pixels in your plot, but that seems fairly arbitrary.

Histograms are nice in that they effectively compress non-trivial datasets (at least those that have a reasonable bounded domain) to something quite manageable.

I guess there is nothing stopping you from doing the same thing here, but it does kind of discount the author's claim of not being able to go between histogram and eCDF.

Am I missing something?


If you have more data points than horizontal pixels, yes, you will bucket the data on your display resolution. That happens with any kind of plotting.

What is a completely different thing from the arbitrary bucketing for histograms. CDF doesn't go to zero or becomes misleading if you bucket it wrong. You just lose the details.


My point is more that you need to store n values (where n is the number of samples) or 2k if there are dupes (where k is the number of distinct values) for an eCDF, which if you did that anyways, you could generate a histogram from the same data.

If there are duplicate sample values, you can still store a sorted list of (sample,count) here and generate either a histogram OR an eCDF, or any other plot really.

Effectively it is not a fair comparison to compare the two methods since they both have storage tradeoffs that are not really discussed.


Nobody is discussing the computing performance of calculating those plots. I's taken for granted that you have more than enough resources to do anything you want with the data. If you don't, you are really in the big data region, and you will start to get all kinds of interesting tradeoffs that are completely different from one place to another.

The entire discussion is about the quality of the information the plot communicates to you. Histograms can be completely misleading, CDFs can't. Finding a bucketing so that an histogram communicates the background data is a non-trivial problem, for CDFs, it's not a problem at all.


Lookup t-digests, a streaming algorithm for ecdfs. I’m pretty sure you can do histograms with these as well.


Computing a sorted list online is an amortized O(1) operation, O(log n) worst case.


Adding elements one at a time is O(log n) into an already sorted list. But producing a complete sorted list requires doing that n times, so you end up with n log n anyways. Am I missing something?


I recommend Kernel Density Estimation as an alternative to histograms if you are specifically interested in the density - e.g. which values are particularly likely to occur (perhaps for multimodal distributions).

https://en.wikipedia.org/wiki/Kernel_density_estimation


I've been told box plot with Kernel density estimation on the side(axis) are very useful.


for anyone finding themselves doing a bit of analysis using a eCDF, seaborn[0] has a plot for it

https://seaborn.pydata.org/generated/seaborn.ecdfplot.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: