How to pick a random sample from a list
A problem that crops up fairly frequently is that we want to pick
n random elements from a given list or array. For example,
a quiz program might have 1000 possible questions and need to pick 20 of them.
We would generally like each element to have an equal chance of being picked.
Probably the most common method is what I call the "naive" approach:
continuously pick a random element from the whole list. Each time, if the element was already
picked on a previous run, then we just repeat until we find one that hasn't yet been picked. It
should be obvious that as the number of elements to pick relative to the number available increases,
this method becomes dreadfully inefficient. If you insist on using this method,
then at least store the currently selected sample in a HashSet or other structure giving
efficient random access. About the only saving grace of this method is that it is easy to understand.
The semi-naive approach: shuffle and slice
In Java, there's usually not much reason to use the naive approach, because at least one
more efficient approach can be coded in a couple of lines using calls from the collections library.
The technique, which I call "shuffle and slice", is simply the following, assuming the elements to choose from are in an array:
- shuffle the entire array using a correct algorithm such as Arrays.shuffle();
- take the first n elements of the shuffled array.
This technique is efficient in the sense that the time taken increases linearly as
either n or the number of elements to choose from increases. And it has the significant
advantage of being based on library methods— in other words, you pretty much
"can't go wrong". In just a few cases, a minor inconvenience of the method is that it disturbs the order of
the data that you are sampling from.
It turns out that there is another method available which is a little quicker on average,
and which preserves the order of the elements being sampled from.
(Slightly) more efficient sampling
Shuffling the entire array as in the previous method essentially scales linearly according to
the number of elements to choose from. This is because to shuffle the entire array, we generate
a random number at every position in the array.
Another technique, which also scales linearly with the number of elements being chosen from,
but on average requires fewer calculations per element, is as follows:
- for the first position in the array, generate a random number that gives that element
an n/N chance of being picked (where n is the number of elements to
pick and N is the total array size);
- repeat for the rest of the array, each time altering the probability of the element
being picked to reflect the number of elements picked so far and the number left.
At each position, we want the probability of selecting the given element to be
the number of elements still needed over the number of elements still left. For example, let's
say we want to pick 3 random elements from a total of 15. At the first index, the chance
of picking that element should be 3/15. If that element is indeed picked, then at the next
index, we need to pick 2 out of a remaining 14 elemnts, so the chance of that second element
being picked is 2/14. If the first
element wasn't picked, then we still need to pick 3 out of the remaining 14 elements, so
the chance of the second one being picked is 3/14, etc.
An example implementation in Java goes something like this:
public static <T> T[] pickSample(T[] population, int nSamplesNeeded, Random r) {
T[] ret = (T[]) Array.newInstance(population.getClass().getComponentType(),
nSamplesNeeded);
int nPicked = 0, i = 0, nLeft = population.length;
while (nSamplesNeeded > 0) {
int rand = r.nextInt(nLeft);
if (rand < nSamplesNeeded) {
ret[nPicked++] = population[i];
nSamplesNeeded--;
}
nLeft--;
i++;
}
return ret;
}
On average, this method is a little quicker than shuffling the entire array, since there
is a chance that we'll pick the number of required elements before we've scanned right to the end of the array.
Thus, fewer random numbers generally need to be generated. On average, the proportion of the
elements that we'll need to scan in order to pick n of them is roughly n/(n+1).
In other words, to pick 2 random elements, we need to scan roughly 2/3 of the data;
to pick 3, we need to scan roughly 3/4; to pick 4, roughly 4/5 etc. So the method is likely
to give a tangible advantage only when we need to pick a very small sample from a large number of elements.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.