Reservoir Sampling (online)

Given a array A read in as a data stream (one by one) and a sample size k , return a sample of size k in the data stream.

Example

Input: A = [3, 5, 11, 7], k = 2
Output: When read in A = [3, 5], samples are [3, 5]. Next when read in 11, samples could be [3, 5] or [3, 11] or [5, 11].

Solution

Idea:

  • Assume we already choose k samples from n data, now we read in the n+1 data. The new data has probability kn+1\frac{k}{n+1} to be sampled. And the existing samples have the same probability to be removed.

Time complexity: O(n)O(n)

Space complexity: O(k)O(k)

# Assumption: there are at least k elements in the stream
def online_random_sample (A, k) :
    # set up default values
    sample = A[:k]

    for i in range(k, len(A)) :
        temp = random.randrange(k+1)
        if temp < k:
            sample[temp] = A[i]

    return sample

Last updated