The Central Limit Theorem Makes Random Testing Hard

imurray · on June 19, 2012

Warning: slight pedantry ahead.

Specifically what this article is noticing is the law of large numbers [1], that sums or averages are "close" to their mean. The central limit theorem says that sums converge in distribution towards being normally distributed. But convergence in distribution isn't a strong statement about tail behavior, and so shouldn't be used in the sort of argument in the post.

[1] http://en.wikipedia.org/wiki/Law_of_large_numbers

streptomycin · on June 19, 2012

If the goal is to randomly probe the parameter space of a model, the author is correct that independent Gaussian random numbers are a poor choice (although it's not really related to the central limit theorem). However, this isn't a novel revelation. There are other/better ways to do it, for instance http://en.wikipedia.org/wiki/Latin_hypercube_sampling

mturmon · on June 19, 2012

Smart suggestion. There is a grain of truth in what the article says, but it's a little confused.

The article is also conflating two things: the (unknown) internal state of the system, which could live in a very high-dimensional space, and the system inputs, which move the internal state around in a not-understood way, but which you at least get to choose directly.

The cases the article examines use an internal state of a number, inputs of +1 or -1, and the system dynamics:

   state(t) = state(t-1) + input(t)

(E.g., "state" is "number of open files".) The CLT does apply in this case, but it's clear that this is a very narrow case.

I'm not aware of convergence results that hold in any generality here. There are concepts from control theory of observability and controllability (where can the internal state go, and how can you read it out). And there are the concepts from nonlinear dynamics of attractors.

Darmani · on June 19, 2012

While it is well known that uninformed random test generation often does a poor job at probing the parameter space, this has nothing to do with the CLT. Even his first example with the bounded queue, one where the CLT does make predictions, he is incorrect to apply it. That's because what we care about is not the difference between the number of enqueue and dequeue operations (which has expected value 0), but the magnitude of the difference (which has expected value sqrt(N)). And what we care about is not the magnitude of the difference after completing the tests, but the greatest magnitude at any point along the random walk. This is governed by a more advanced theorem called the Law of the Iterated Logarithm, which tells us that the probability of finding the bug is quite a bit higher than the author realizes.

lliiffee · on June 19, 2012

The central limit theorem only applies when random variables are combined linearly. I see no reason to suspect that this is the case with random inputs for testing. (Most functions are nonlinear!)

DaniFong · on June 19, 2012

Why the central limit theorem is usually off:

http://daniellefong.com/2008/01/28/outliers-why-the-central-...

scott_s · on June 19, 2012

That's a great essay, but the author is talking about random-walk situations, which are independent random variables. We know this, because that's how the automatic software testing tools are (currently) written.

DaniFong · on June 21, 2012

Correct, I probably should have pointed out more context here.

The reason that random testing in this fashion would be a problem is only if the actual distribution of requirements does not well match the central limit theorem -- if for example user demand is very correlated. Because the central limit theorem is usually off, it is not acceptable to assume that testing independent random variables will reproduce adequately the risky situations testing should test.

bcbrown · on June 19, 2012

I think one approach would be to use RNGs in a slightly different way. Instead of a 'random walk' of various API calls, define a possible outcome category, and choose a random number to be the quantity of that outcome.

So for the bounded queue problem, set the outcome category to be "elements in queue". Say you get 3, 12, and 7 as your first three random numbers. Then the first three test cases would be "get the load to 3 elements", "get the capacity to 12 elements", "get the capacity to 7 elements".

regehr · on June 19, 2012

This does indeed work. The problem is that there is a very large number of possible goals that a random tester might aim for, and we do not know ahead of time which ones are most important. Also, for complex testers like Csmith, it is not clear how to reach some goals. This kind of thing complicates the code a lot too.

bcbrown · on June 20, 2012

I'll state up front that I have no familiarity with Csmith or compilers.

You can characterize a random test as a random walk through a many-dimensional space, right? And the problem is that a random walk will naturally spend more time close to the starting point? What if, when choosing the next step to take, you bias the options towards moving away from the initial state? Is it even possible to talk about 'away' and 'towards' in the state space?

Going back to the queue example, this would mean that if the starting load of the queue is 5, whenever the load is above 5, bias the distribution of enqueues/dequeues towards enqueueing.

regehr · on June 20, 2012

Csmith is much less like a random walk; most program generation steps result in forward progress. The problem is that at the end of many forward steps, we may have a test case that is, in some sense, very similar to all of the other programs generated by Csmith. So when we compile this test case, nothing very interesting happens.

If we decided up front that, for example, the next program to generate would have 95% stores through pointers and only 5% scalar operations, then this would probably be pretty useful. But nothing in Csmith currently supports this kind of high-level goal-seeking behavior.

So overall there are engineering issues, but also there are some hard open questions about what it is that a random tester is trying to accomplish. By necessity these programs start with only very weak hypothesis about what the bugs that they are trying to find look like.

_csoz · on June 19, 2012

I can't understand the relation to CLT. When using an RNG you basically sample from the same distribution (thus have only 1 random variable) plus the CLT comes into play when you add (sum, average) different random variables. In the case of the queue, it has nothing to do with the number of cases missed.

Daniel_Newby · on June 19, 2012

Consider a coin labeled +1 on one side and -1 on the other side. A single coin toss is binary distributed with mean of 0.

Toss the coin N times and let S be the sum of the results. The individual tosses are binary distributed, but for larger N, S is approximately Gaussian distributed thanks to the central limit theorem.

_csoz · on June 20, 2012

Yes, but you re talking about the sum of the results, so the CLT is only relevant only to those programs that sum their input and nothing else.