During the recent paper readings, I find that we need to estimate the size of a subset from time to time. This subset comes from a extreme large set which means this subset may also be extreme large. The naive way to know the size of this subset is to go through all the elements of this subset. However, because of the extreme large size, it is impossible for us in practise. So, what we need is a way to estimate the size of subset quickly. One of the typical scenarios is the Ranking Problem. In Ranking Problem problem, we need to know how many instances are there ranking before the current instance. And this operation will be applied to each instance. Apparently, it is impossible for us to go through the whole training set.

Sampling

A reasonable way is to estimate the size of subset based on sampling.

Let’s first set up some backgrounds. We assume the whole set is YY, its size is denoted as Y\vert Y\vert. Set SYS\subset Y is the subset and S\vert S\vert is what we need to estimate.

Geometric Distribution

In the Weston et al.[1], it mentions a sampling method based on the geometric distribution. The sampling process can be described as:

Keep randomly choose(with replacement) an element yy from YY until the element ySy\in S. Record the sampling times, denoted as NN. Then, the size of SS can be estimated as S=YE[N]YN\vert S \vert=\frac{\vert Y \vert} {E[N]} \approx\frac{\vert Y\vert}{N}.

Proof:

The proving process is very simple. We use pp to denote the probability that element yy belongs to set SS:

p=Pr[yS]=SY p=Pr[y\in S] = \frac{\vert S \vert}{\vert Y \vert}

Based on the sampling process, we can get:

Pr[N=i]=(1p)i1p Pr[N=i]=(1-p)^{i-1}p

When N=iN=i, we sampled ii times, which means the previous i1i-1 times are all failed to pick the element belonging to SS. This is actually a geometric distribution, so its expectation is:

E[N]=1p=YS E[N] = \frac{1}{p} = \frac{\vert Y\vert}{\vert S \vert}

Transform the equation above:

S=YE[N] \vert S \vert = \frac{\vert Y \vert}{E[N]}

Also, we know that when the sampling rounds become infinity, the expirical value equals the expecation:

E[N]=limm1mi=1mNi E[N] = \lim\limits_{m\rightarrow\infty}\frac{1}{m} \sum\limits_{i=1}^m N_i

where NiN_i represents the sampling times in the ithi^{th} round of sampling. In order to estimate quickly, we can set m=1m=1, then we finally get:

SYN \vert S\vert \approx \frac{\vert Y\vert}{N}

Bernoulli Distribution

Apart from the previous sampling method, there are actually other ways. One of the choices is to use bernoulli distribution. The process is as following:

Keep randomly choose(with replacement) an element yy from YY for NN times, then check how many elements in this NN elements that belong to SS, which is denoted as NN. Clearly, M<NM < N. Then, the size of SS can be estimated as S=E[M]YNMYN\vert S\vert=\frac{E[M]\vert Y \vert}{N}\approx \frac{M\vert Y \vert}{N}.

Proof:

Similarily, we use pp to denote the probability that any element yYy\in Y belongs to SS:

p=Pr[yS]=SY p=Pr[y\in S] = \frac{\vert S \vert}{\vert Y\vert}

Based on this sampling method, we know:

Pr[M=i]=(Ni)(1p)Nipi Pr[M=i] = \binom{N}{i}(1-p)^{N-i}p^i

This means we independently choose(with replacement) NN elements from YY, in which contains ii elements that belong to SS. This actually a bernoulli distribution, so its expecation is:

E[M]=pN=SYN E[M] = pN = \frac{\vert S \vert}{\vert Y\vert}\cdot N

Rewrite this as:

S=E[M]NY \vert S\vert = \frac{E[M]}{N}\cdot \vert Y\vert

Based on the defintion of expectation:

E[M]=limm1mi=1mMi E[M] = \lim\limits_{m\rightarrow\infty}\frac{1}{m} \sum\limits_{i=1}^m M_i

Similarily, we can set m=1m=1, then we can get:

SMYN \vert S\vert\approx \frac{M\vert Y\vert}{N}

Experiments

To compare the difference between these two different methods, I did some experiments. The results are showed as following.

Accuracy

I first compare the accuracy of estimation of these two sampling methods. I got 4 different plots based on different ratios of Y\vert Y\vert and S\vert S \vert.

From these four plots, we can conclude:

When the ratio of Y\vert Y\vert and S\vert S \vert is larger(0.50.5), bernoulli distribution is more stable; when the ratio becomes smaller(0.00050.0005), geometric distribution is becoming more stable.

Time

Another important aspect of sampling is the time. After all, all we want is to accelerate the computation. The following 4 polts show the comparison of time based on the same settings.

The conclusion is:

When the ratio of Y\vert Y\vert and S\vert S \vert is larger(0.50.5), geometric distribution is faster than bernoulli distribution; when the ratio becomes smaller(0.00050.0005), geometric distribution takes more and more time and bernoulli distribution will take the advantage.

References

  1. J. Weston, S. Bengio, and N. Usunier, WSABIE: Scaling up to large vocabulary image annotation, In Proc. IJCAI 2011, proceedings of the 22nd international joint conference on artificial intelligence, barcelona, catalonia, spain, july 16-22, 2011, 2011, pp. 2764–2770. doi:10.5591/978-1-57735-516-8/IJCAI11-460