Alias Method: discrete sampling method with time complexity O

 

Isn't that necessary? Can I directly use numpy's vectorization operation to directly achieve time O(1) and space O(n)?

Modern CPUs have vector instructions, such as ADDPS in SSE. numpy vector operations, such as add, execute add directly through c, rather than through the for loop like python
 

Alias Method: discrete sampling method with time complexity O(1)

Recently, I saw some things about graph embedding and found that node2vec uses Alias method when sampling node paths and sampling edges in line. Here is a brief summary.

Problem definition

The probability distribution law of a discrete random variable is givenIt is hoped that the sampling method can follow a probability distribution as much as possible

O(N) method

Imagine that random events are distributed on a line segment with a length of 1 according to their probability. Then take a random point in the line segment and observe the interval corresponding to which the point falls, and then take the event corresponding to the interval. The specific implementation is as follows:

  1. Generate 01 uniform distribution
  2. Find subscript k
  3. Return k

Obviously, first construct an array to store the cumulative event probability, and the time complexity is, the time complexity of each linear search is

O(logN) method

After constructing the cumulative event probability array by accumulation, we can find that the array meets the non decreasing order. For the search of ordered array, it is easy to think of using binary search for optimization. The time complexity after using binary search is

O(1) method

Now let's introduce the method of using space for time optimization Alias.

Alias method compresses the whole probability distribution into oneRectangle for each event, converted to the area in the corresponding rectangle is .

Through the above operations, generally, the area of some locations is greater than 1, and the area of some locations is less than 1. By adding the additional area of events with an area greater than 1 to the events with an area less than 1, we can ensure that the area of each small square is 1, and ensure that each square can store at most two events.

Maintain two arrays accept and alias. accept[i] in the accept array indicates the proportion of event I in the area of the rectangle in column I. alias[i] indicates the number of another event in column I that is not event I.

During sampling, a random number is generated at a time, regenerate into a random number, if, it means to accept the event i; otherwise, it means to reject the eventReturn alias[i]

At the end of the code article corresponding to the algorithm, we can see that the time complexity of preprocessing alias table is still, and the time complexity of events generated by each sampling is .

Visual display of sampling results

Visual display of sampling results with N=30 and k=10000

Python code

import numpy as np
def gen_prob_dist(N):
    p = np.random.randint(0,100,N)
    return p/np.sum(p)

def create_alias_table(area_ratio):

    l = len(area_ratio)
    accept, alias = [0] * l, [0] * l
    small, large = [], []

    for i, prob in enumerate(area_ratio):
        if prob < 1.0:
            small.append(i)
        else:
            large.append(i)

    while small and large:
        small_idx, large_idx = small.pop(), large.pop()
        accept[small_idx] = area_ratio[small_idx]
        alias[small_idx] = large_idx
        area_ratio[large_idx] = area_ratio[large_idx] - (1 - area_ratio[small_idx])
        if area_ratio[large_idx] < 1.0:
            small.append(large_idx)
        else:
            large.append(large_idx)

    while large:
        large_idx = large.pop()
        accept[large_idx] = 1
    while small:
        small_idx = small.pop()
        accept[small_idx] = 1

    return accept,alias

def alias_sample(accept, alias):
    N = len(accept)
    i = int(np.random.random()*N)
    r = np.random.random()
    if r < accept[i]:
        return i
    else:
        return 

Posted by netzverwalter on Thu, 12 May 2022 16:32:33 +0300