Quote:
Originally Posted by MechaMan
So nobody is exploring the possibility of certain stores or neighborhoods getting more winning tickets than others?
They make it difficult to analyze the winning location trends. There is a reason for that.

That's a much more complicated analysis than you might think. Assuming you have data on where the big winners were purchased, it can be done, but it likely would not show anything definite. Just to give you an example, suppose you theorized that a roulette wheel at a casino comes up with the number 36 more often than the other numbers. (If we ignore 0 and 00 on the wheel, this is mathematically analogous to numbering your neighborhoods 136 and theorizing that neighborhood 36 gets more big winners than the other neighborhoods). How would you go about proving it? Well, you could record the results from that wheel. There are problems here, though. Suppose you watch 10 spins and see 36 come up twice. Is that enough to prove your theory, or could that have just happened by random chance? Anyone who's ever gone to a casino and plays roulette would probably tell you that seeing the same number come up twice in 10 spins is not really all that uncommon. Mathematically, for any given number, it's about a 2% chance that you'd see two occurrences of that number in 10 spins.
That seems pretty low, but you have to consider, you haven't specified which neighborhood you think is the one with more winners. Therefore, the relevant probability is the probability of ANY number occurring twice in 10 spins, which is considerably higher, about 75%. Certainly, this provides no proof of anything.
The next step, of course, is to continue to watch the results and see if the frequency of 36 comes out higher than expected (1/36 is the expected value). How much higher does it have to be to become meaningful? Again, that's a tough question, and there always is the possibility that no matter how unlikely it would be, the wheel actually is unbiased and that the result you see is simply the result of random chance. Statistics can provide some guidelines, but it will always be probabilistic guidelines. That is, we can say there's a 0.01% chance of these results having come randomly from an unbiased roulette wheel (or some other number). We can never say "this roulette wheel is definitely biased."
There is a further complication that comes with the actual problem, rather than with the roulette wheel model. The roulette wheel model is simplified because the expected probability of each possible outcome is the same. That is, we have no reason to believe that 1, 2, 3, 36 or any other number should occur more often than any other value. In fact, real roulette wheels are intentionally designed with this in mind; a biased roulette wheel is a potential disaster for a casino (There actually was a real life case in a European casino where some mathematicians detected a bias and made a good bit of money from it). In the "which neighborhoods are there more winners" problem, by contrast we expect the frequencies of winners to be different even if there is no bias in the distribution of tickets.
Consider, let's divide NYC into five neighborhoods, corresponding to the five boroughs and assume for simplicity that all NYC residents are equally likely to buy tickets. Suppose we observe 50 lotteries and find that there were 8 winners in the Bronx, 13 winners in Queens, 11 winners in Manhattan, 14 winners in Brooklyn, and 4 winners in Staten Island. Does this prove that the lottery is biased against Staten Island? Of course not! We would expect fewer winners in Staten Island simply because we expect fewer tickets to be sold there. In fact, based on my assumptions, SI is overrepresentedwe should expect only about 3 winners there. Again, statistics can help us answer questions about how different from expected the distribution of winners would have to be before we would think something might be amiss (google the Chi squared distribution for more info), but again, the answer is only probabilistic; statistics can never say that something nonrandom is definitely happening, only that the probability is low of seeing these data if the distribution really is random.