Show full content
Introduction
T. Kelly writes in [1] about “What Every Experimenter Must Know About Randomization.” In a running example scenario throughout the paper, Kelly describes a hypothetical poultry farmer who, wondering whether exposing his turkeys to television will fatten them up, conducts a test using turkey subjects, from which
are randomly selected to receive the TV treatment, with the remaining
control turkeys getting no screen time.
How should the farmer randomize the assignment of turkeys to the treatment and control groups? The main point of the paper– very subtly emphasized in underlined, bold italics– is to “never use a PRNG [pseudo-random number generator] for random assignment.“
I don’t really agree with this guidance at all. I think the arguments presented in its favor are flawed, or at least potentially misleading, on practical, mathematical, and even philosophical grounds. But the paper does present at least one very interesting and subtle observation that I think deserves highlighting and clarification.
Balls into bins
To make the case against using a pseudo-random number generator, Kelly provides a C program, balls_into_bins.c, that simulates the random partition of turkeys into two groups of 16, with one group receiving the treatment and the other being the control group. If we start with all 32 turkeys arranged in a fixed initial order, then uniformly randomly shuffle them, we can designate the leftmost 16 turkeys in the treatment group, and the rightmost 16 turkeys in the control group.
(We don’t actually have to shuffle all 32 turkeys; once we have selected the treatment group– in a uniformly random order– we don’t need to continue rearranging the control turkeys. That is, Kelly effectively implements the first 16 iterations of the mirror_shuffle version of the Fisher-Yates shuffle.)
Kelly’s program then iterates over all possible seeds of the C standard library random number generator via
srand(seed), executing this simulated random assignment once for each seed. The idea is to show the distribution of possible assignments, assuming that our farmer conducts the experiment by selecting a “true random” value used as the seed for the PRNG to then (pseudo)randomly shuffle the turkeys to select an assignment.
For a 32-bit seed and turkeys, these numbers are small enough that we can manage both the time to simulate each seed (i.e., each “ball”), as well as the memory to track all
possible treatment assignments (i.e., “bins”), recording how many times each assignment is selected (i.e., how many balls end up in each bin). Kelly’s results are shown in red in the figure below, indicating the number of bins containing a given number of balls.

For example, the red point at (28,1) indicates that there was a single assignment that was generated 28 times by each of 28 different random seeds. The red point at (0,474467), on the other hand, indicates that 474,467 assignments were never generated, never appearing even once among all possible seeded treatment selections. As Kelly writes:
The resulting distribution of balls into bins will be far from uniform: Some bins end up with more balls than others, and some bins end up empty— even, perhaps surprisingly, when the number of balls is considerably larger than the number of bins.
That is, there are times as many balls as bins, so that the presumably ideal behavior would instead be to distribute the balls as evenly as possible, so that every one of the possible assignments would be selected 7 or 8 times.
Comparing generators
There is a useful point here, which we’ll get to shortly. But first, it wasn’t clear to me whether Kelly is arguing that the non-uniformity illustrated above is because we are using a PRNG. That is certainly not the case… but at the same time, some PRNGs are better than others, and at first glance it seemed like a rather disingenuous straw man to use the one PRNG that most everyone agrees shouldn’t be used “for serious random number generation needs,” namely the one in the C standard library.
For example, how do things look if we try this experiment on Windows, using the MSVCRT/UCRT implementation of rand()? This took some work, since Kelly’s code does not compile due to lines like this:
static_assert(RAND_MAX == INT_MAX); // would be very weird otherwise
This assertion is, well, weird. Not only is there no language standard guarantee that these constants be equal– and on Windows, they aren’t– neither is there anything in the code that depends on this assertion, so it’s unclear why it’s there in the first place.
The program also uses the stdc_count_ones function defined in the stdbit.h header, which is new enough that it may not be available depending on how up to date your environment is, on either Linux or Windows. But here as well, we don’t really need this function, or even an equivalent replacement, to get the job done.
So to get things working, I wrote balls_into_bins.cpp (code is on GitHub, where I’ve included Kelly’s original balls_into_bins.c for side-by-side comparison), with the same behavior, but which (1) works on both Linux and Windows, and (2) allows us to select which of several random number generation approaches we want to evaluate:
- The Linux GLIBC pseudo-random generator that reproduces Kelly’s original results exactly.
- The Windows MSVCRT/UCRT pseudo-random generator.
- The “true” random RDSEED hardware-based generator using the
_rdseed64_stepfunction.
The hardware random number generator is understandably much slower; on my laptop it produces only a little over half a million 64-bit words per second. But eventually we see the output shown in blue above, with “non-uniform” behavior very similar to the GLIBC PRNG.
In other words, we should expect to see some treatment assignments selected more often than others. And we should expect to see some assignments not selected at all; with seeds (“balls”) and
possible assignments (“bins”), the expected number of empty bins is
which matches pretty closely with the 474,467 empty bins using the GLIBC generator, and the 473,435 empty bins from my particular output using RDSEED.
However, there is clearly something different going on with the Windows UCRT implementation, shown in green above. One interesting “feature” of the UCRT behavior is that we only see even numbers. That is, each assignment (“bin”) is always randomly generated an even number of times– possibly zero times, or twice, or four times, etc., but never exactly once, nor exactly thrice, etc. (Obligatory puzzle: why?)
Repeatability
So what, exactly, is the point being made here? Certainly a poor PRNG like the Windows UCRT generator can cause unintended bias. But the GLIBC PRNG is better behaved, at least from the perspective of the balls-into-bins experiment described here. For example, for both the GLIBC output as well as the “true” random RDSEED output, a chi-squared test would fail to reject the null hypothesis that the selected assignments constitute a truly (uniformly) random sample.
However, there is a key difference between the two generators, namely their repeatability. That is, although it’s true that the RDSEED hardware random generator “misses” nearly half a million possible turkey treatment assignments in the above output, if we were to run the program again with the RDSEED generator, we would once again fail to select roughly half a million assignments… but it would be a different subset of missed assignments. But with the GLIBC PRNG, it is the same persistent subset of exactly 474,467 specific assignments that will never be selected, no matter what seed we use, no matter how many times we run the program. For any one of those persistently-missed assignments, the probability that the farmer selects it for his experiment should ideally be , but instead it’s zero.
“True” vs. pseudo-random numbers
I think it’s an interesting question when, and just how much, this matters in any practical sense for many real experimental scenarios. There are certainly scenarios where it can matter— but in this running example of the farmer experimenting on his turkeys, it’s less clear to me what actual problem might arise from what effectively amounts to bootstrapping from a single prior uniformly sampled multiset from the population of treatment assignments.
But if we did want to “fix” our farmer’s methodology, there are at least a couple of simple ways to do so. For example, Kelly’s setup assumes that the farmer starts by generating 32 “truly” random bits with which to seed a pseudo-random number generator. Maybe those seed bits are obtained by repeatedly flipping a fair coin; but if we’re willing to assume access to 32 “truly random” bits to seed the PRNG, then why subsequently bother with the PRNG at all? As Kelly points out, we only need bits of entropy, so we need on average at most 2 more, or at most 32 coin flips (on average), to select a random integer and unrank it to determine the corresponding assignment, rather than wasting a lot more coin flips Fisher-Yates shuffling turkeys around.
The above alternative abandons pseudo-random numbers altogether, and only uses a “true” random source of numbers to generate an assignment. I’ve been putting “true” in quotes here, since Kelly never defines or explains exactly what is meant by “true” randomness. It doesn’t seem to require anything quantum mechanical or otherwise theoretical, for example; indeed, the requirements seem pretty lax:
Random assignment does not require a computer. Physical coins or urns are reasonable options sanctioned by many statisticians today, for the same reasons that casinos and lotteries continue to use dice, roulette wheels, and other mechanical contraptions.
Kelly is even okay with “re-using” randomness for an experiment, as long as it’s “unrelated” to that past use:
Regarding “canned” entropy (e.g., from an old-fashioned table of random numbers), I’m aware of no reason why such entropy would lose its usefulness merely from being recorded and stored for a period of time. Similarly, regarding “recycled” entropy that has already been used for purposes unrelated to the one at hand, I’m aware of no reason why such entropy would be any less useful than if it had not been used before for any purpose.
This is what I find confusing and contradictory. A not-completely-terrible intuitive description of a PRNG is “an old-fashioned table of random numbers”… but one that is so large that we never need to re-use any of it. That is, for example, suppose that when we complete our undergraduate statistics education, we seed a pseudo-random number generator, one time, and use that single stream of random numbers for the duration of our lifetime, or at least for our entire career as an experimenter. When we need some random numbers, we extract them, then store the subsequent generator state on disk. When we need some more random numbers, we restore the generator state from disk, extract some more random numbers, rinse and repeat.
If our experiments are “slow” like testing turkeys, and not “fast” like computer simulations, then such a single pseudo-random stream is plenty long enough to support more experiments on more test subjects than our farmer can possibly conduct in their lifetime. And if we are worried, as Kelly writes, that “the outcome is a foregone conclusion, completely devoid of uncertainty, no matter how random the sequence might appear,” then why is an old-fashioned published book of random numbers acceptable?
Reference:
- Kelly, T., What Every Experimenter Must Know About Randomization, ACM Queue Magazine, 23(6), November/December 2025. [PDF]









