Isn't that necessary? Can I directly use numpy's vectorization operation to directly achieve time O(1) and space O(n)?
Alias Method: discrete sampling method with time complexity O(1)
Recently, I saw some things about graph embedding and found that node2vec uses Alias method when sampling node paths and sampling edges in line. Here is a brief summary.
Problem definition
The probability distribution law of a discrete random variable is givenIt is hoped that the sampling method can follow a probability distribution as much as possible
O(N) method
Imagine that random events are distributed on a line segment with a length of 1 according to their probability. Then take a random point in the line segment and observe the interval corresponding to which the point falls, and then take the event corresponding to the interval. The specific implementation is as follows:
- Generate 01 uniform distribution
- Find subscript k
- Return k
Obviously, first construct an array to store the cumulative event probability, and the time complexity is, the time complexity of each linear search is
O(logN) method
After constructing the cumulative event probability array by accumulation, we can find that the array meets the non decreasing order. For the search of ordered array, it is easy to think of using binary search for optimization. The time complexity after using binary search is
O(1) method
Now let's introduce the method of using space for time optimization Alias.
Alias method compresses the whole probability distribution into oneRectangle for each event
, converted to the area in the corresponding rectangle is
.
Through the above operations, generally, the area of some locations is greater than 1, and the area of some locations is less than 1. By adding the additional area of events with an area greater than 1 to the events with an area less than 1, we can ensure that the area of each small square is 1, and ensure that each square can store at most two events.
Maintain two arrays accept and alias. accept[i] in the accept array indicates the proportion of event I in the area of the rectangle in column I. alias[i] indicates the number of another event in column I that is not event I.
During sampling, a random number is generated at a time, regenerate into a random number
, if
, it means to accept the event i; otherwise, it means to reject the event
Return alias[i]
At the end of the code article corresponding to the algorithm, we can see that the time complexity of preprocessing alias table is still, and the time complexity of events generated by each sampling is
.
Visual display of sampling results

Python code
import numpy as np
def gen_prob_dist(N):
p = np.random.randint(0,100,N)
return p/np.sum(p)
def create_alias_table(area_ratio):
l = len(area_ratio)
accept, alias = [0] * l, [0] * l
small, large = [], []
for i, prob in enumerate(area_ratio):
if prob < 1.0:
small.append(i)
else:
large.append(i)
while small and large:
small_idx, large_idx = small.pop(), large.pop()
accept[small_idx] = area_ratio[small_idx]
alias[small_idx] = large_idx
area_ratio[large_idx] = area_ratio[large_idx] - (1 - area_ratio[small_idx])
if area_ratio[large_idx] < 1.0:
small.append(large_idx)
else:
large.append(large_idx)
while large:
large_idx = large.pop()
accept[large_idx] = 1
while small:
small_idx = small.pop()
accept[small_idx] = 1
return accept,alias
def alias_sample(accept, alias):
N = len(accept)
i = int(np.random.random()*N)
r = np.random.random()
if r < accept[i]:
return i
else:
return