I think this is a pretty correct approximation. An even better way is to assume vectors are uniformly distributed on the sphere. The dot product has expectation 0 and the variance is 1/300 (cf http://stats.stackexchange.com/questions/85916/distribution-...).
Chebyshev's inequality shows that the probability that the dot product be greater than 0.3 in absolute value is less than 4%.
If you want to generate vectors uniformly on a n-dimensional unit sphere, generate them as i.i.d vectors of n standard normal variates and renormalize.
x = np.random.randn(300)
x /= np.sqrt(np.sum(x*x))
To see why consider that the joint pdf depends solely on the vector's distance from the origin.
It's uniform if you reject points with length > 1 before normalizing (your link mentions this), but not uniform as written. It's a simple adjustment, but you do have to cull.
Edit: yeah oops, I goofed and I'm wrong. I didn't notice it was randn - normally distributed. No culling necessary.
Is it bad for a unit sphere specifically, or just a generally bad idea? I was just thinking the volume ratio of the hyper sphere to the hyper cube is always over 0.5 in any number of dimensions? But I'm not sure about that.
Or do you mean more that it sucks pretty bad to generate 300 rands and then throw them away, even once?
Rejection sampling is not necessarily a bad idea in general, but in this particular case it would be bad.
You are going to choose 300 random numbers between -1 and 1, and then throw away this batch if their norm is greater than 1. As you increase dimension (in this case 300), you are including more numbers that give a non-negative contribution to this norm, and so you are increasing your chance of rejection with each dimension.
Re-reading your comment, I notice you made a higher-level note regarding the volume of the hypercube vs. hypersphere. While my comment was right, I now think you are right that the volume ratio of hyper sphere to hypercube is, in fact, always over 0.5. I'm gonna run a quick simulation anyway, but you can convince yourself of this by considering that the line segment connecting (1,0) and (0,1) in 2D lies within the unit circle, but a similar construction will always work for high dimensions (I think).
Jeez, my intuition stinks. I initially thought it was obvious why the 3D sphere to unit cube ratio is > 0.5, by analogy to the 2d diamond embedded in a square, but in 3D the embedded octagon volume isn't 0.5. Obvious now, but that'll teach me not to guess.
Indeed, rejection sampling the 300 dimensional unit sphere will suck really bad.
Say we forget about dimensions and hyperspheres for a moment and just talk about quizzes. Specifically, a quiz of 300 true-false questions.
We have two students fill out the quiz at random, by flipping a coin. What's the probability that they answer all the questions the same way? 2^300 - vanishingly small. Similarly for answering all the questions in exactly the opposite way.
If we only have two or three dimensions, that's like a quiz with only two or three questions. If the possible answers to the questions are constrained, it's fairly likely that some students would get similar results just by chance. But the number of combinations goes up exponentially as we add more questions.
So, I'm not understanding something. If you take a vector formed only of ±1 coordinates, it's not going to be within the n-dimensional ball, nor on the n-1-dimensional sphere, it's going to be at the corner of the unit hypercube, right?
Secondly, I don't understand by which measure the distribution of these vectors would be uniformly distributed or random. Normalizing the vectors to get the coordinates on the unit hypersphere is only ever going to land on 2^n different coordinates. Clearly in low dimension this wouldn't be a good way to select random points on the sphere…
Does his approach make sense only because 2^300 is so much larger than 3 million?
> If you take a vector formed only of ±1 coordinates, it's not going to be within the n-dimensional ball, nor on the n-1-dimensional sphere
The points are going to be on a hypersphere, just not the unit one. The radius will be about 17, right?
> Does his approach make sense only because 2^300 is so much larger than 3 million?
It has to be larger, I don't think it needs to be crushingly larger. As long as you're still sparse, all those extra digits on your random numbers don't matter. You can round it all the way to a single bit. Each dimension is only a tiny factor on your position on the sphere, so you get perfectly good coverage for something like dot products.
300 dimensions seems like way too many to have any kind of spatial intuition about. For me, it's easier to think of a n-tuple where n is 300. I'm not sure really what the cardinality of each dimension in this problem is, but I'll assume its a) the same for every dimension and b) something like a subset of the natural numbers (like, one int per 100k english words or something).
The sphere part is a little confusing at first, but it's really just a fancy (and yes, elegant) way of specifying a constraint on the n-tuple. Given some radius R, it defines the constraint sum(n_i^2) = R^2 for i=1..300.
It might be useful to normalize R, and it is convenient to select R=300 so that the average contribution of each n_i^2 term is 1. At this point, you just have to pick 3M n-tuples. I would probably use a random number generator to a) fill in the n-tuple, b) adjust a constant factor to fit with real world R, and c) repeat 3M times.
This general area is called "concentration of measure" in case you want to Google around and find other examples. A simple but informative calculation is to compute the volume of a unit n-sphere and the volume contained between radius .9-1, and then see how the ratio of these scales with dimension. You'll see that that almost all of the volume is contained close to the boundary (exponential in dimension.)
Chebyshev's inequality shows that the probability that the dot product be greater than 0.3 in absolute value is less than 4%.