Bayesian experiment report

1, Experimental purpose

Learn how to infer the behavior model and solve the problem by using the short message mca-3.

2, Experiment contents and steps (including problems, codes, results and conclusions)

2.1 SMS data inference behavior

Title:

code:

Data reading and display:

import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(12.5, 3.5))
count_data = np.loadtxt("./txtdata.csv")
n_count_data = len(count_data)
plt.bar(np.arange(n_count_data), count_data, color='#348ABD')
plt.xlabel("Time (days)")
plt.ylabel("Text messages received")
plt.title("Did the user's texting habits change over time?")
plt.xlim(0, n_count_data)

Build model:

with pm.Model() as model:
    alpha = 1.0 / count_data.mean()
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)
    
    with model:
        idx = np.arange(n_count_data)
        lambda_ = pm.math.switch(tau >= idx, lambda_1, lambda_2)
        #print(tau>=idx)
    with model:
        observation = pm.Poisson("obs", lambda_, observed=count_data)
    with model:
        step = pm.Metropolis()
        trace = pm.sample(10000, tune=5000, step=step)
    
        lambda_1_samples = trace['lambda_1']
        lambda_2_samples = trace['lambda_2']
        tau_samples = trace['tau']
        print(lambda_1_samples)
        print(lambda_2_samples)

Result display:

Draw a posteriori histogram of three parameters:

plt.figure(figsize=(12.5, 10))

ax = plt.subplot(311)

ax.set_autoscaley_on(False)
plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
        label='posterior of $\lambda_1$', color='#A60628', density=True)

plt.legend(loc='upper left')
plt.title(r"""Posterior distributions of the varibles $\lambda_1,\;\lambda_2,\;\tau$""")
plt.xlim([15, 30])
plt.xlabel("$\lambda_1$ value")

ax = plt.subplot(312)
ax.set_autoscaley_on(False)
plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
        label='posterior of $\lambda_2$', color="#7A68A6", density=True)
plt.legend(loc='upper left')
plt.xlim([15,30])
plt.xlabel("$\lambda_2$ value")

plt.subplot(313)
w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples)
plt.hist(tau_samples, bins=n_count_data, alpha=1, label=r"posterior of $\tau$",
        color='#467821', weights=w, rwidth=2)
plt.xticks(np.arange(n_count_data))

plt.legend(loc='upper left')
plt.ylim([0, .75])
plt.xlim([35, len(count_data) - 20])
plt.xlabel(r"$\tau$ (in days)")
plt.ylabel("probability")
plt.show()

Conclusion:

one λ 1 is about 18, λ 2 is about 23, these two λ There are significant differences in the posterior distribution, indicating that the behavior of users receiving short messages has indeed changed.

two λ The posterior distribution of is not like exponential distribution. In fact, the posterior distribution is not any distribution that we can distinguish from the original model, which can be well displayed by computer.

three τ It is a discrete variable with different posterior distribution. 50% assurance indicates that the user's behavior changes in 44 days.

Result display:

Draw the fitting result:

plt.figure(figsize=(12.5, 5))

N = tau_samples.shape[0]
excepted_texts_per_day = np.zeros(n_count_data)
for day in range(0, n_count_data):
    ix = day < tau_samples
    excepted_texts_per_day[day] = (lambda_1_samples[ix].sum()
                                  +lambda_2_samples[~ix].sum()) / N

plt.plot(range(n_count_data), excepted_texts_per_day, lw=4, color='#E24A33',
        label="expected number of text-messages received")
plt.xlim(0, n_count_data)
plt.xlabel("Day")
plt.ylabel("Excepted number of text-messages received")
plt.ylim(0, 60)
plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
       label='observed texts per day')
plt.legend(loc="upper left")

Conclusion:

Observing the results in the figure above, it is found that the analysis results are very consistent with the previous estimates. The user's behavior has indeed changed, and the change is very sudden. Therefore, it can be speculated that the reason for the situation may be: the reduction of SMS charges, or during the Spring Festival, or the weather reminder SMS subscription, etc.

2.2 SMS data inference behavior (expansion)

Title:

code:

Build model:

with pm.Model() as model:
    alpha = 1.0 / count_data.mean()
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    lambda_3 = pm.Exponential("lambda_3", alpha)
    
    tau_1 = pm.DiscreteUniform("tau_1", lower=0, upper=n_count_data)
    tau_2 = pm.DiscreteUniform("tau_2", lower=0, upper=n_count_data)
    with model:
        idx = np.arange(n_count_data)
        #pm.math.switch(tau_2>=idx, lambda_2, lambda_3)
        lambda_ = pm.math.switch(tau_1>=idx, lambda_1, pm.math.switch(tau_2>=idx, lambda_2, lambda_3))
        #print(lambda_)
    with model:
        observation = pm.Poisson("obs", lambda_, observed=count_data)
    with model:
        step = pm.Metropolis()
        trace = pm.sample(10000, tune=5000, step=step)
    
        lambda_1_samples = trace['lambda_1']
        lambda_2_samples = trace['lambda_2']
        lambda_3_samples = trace['lambda_3']
        tau_1_samples = trace['tau_1']
        tau_2_samples = trace['tau_2']
        print(tau_1_samples)
        print(lambda_1_samples.sum())
        print(lambda_2_samples.sum())
        print(lambda_3_samples.sum())

Result display:

Draw a posteriori histogram of five parameters:

plt.figure(figsize=(12.5, 10))

ax = plt.subplot(511)
ax.set_autoscaley_on(False)
plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85,
        label='posterior of $\lambda_1$', color='#A60628', density=True)

plt.legend(loc='upper left')
plt.title(r"""Posterior distributions of the varibles $\lambda_1,\;\lambda_2,\;\tau$""")
plt.xlim([15, 30])
plt.xlabel("$\lambda_1$ value")

ax = plt.subplot(512)
ax.set_autoscaley_on(False)
plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85,
        label='posterior of $\lambda_2$', color="#7A68A6", density=True)
plt.legend(loc='upper left')
plt.xlim([5,20])
plt.xlabel("$\lambda_2$ value")

ax = plt.subplot(513)
ax.set_autoscaley_on(False)
plt.hist(lambda_3_samples, histtype='stepfilled', bins=30, alpha=0.85,
        label='posterior of $\lambda_3$', color='#467821', density=True)

plt.legend(loc='upper left')
plt.xlim([15, 30])
plt.xlabel("$\lambda_3$ value")

plt.subplot(514)
w = 1.0 / tau_1_samples.shape[0] * np.ones_like(tau_1_samples)
plt.hist(tau_1_samples, bins=n_count_data, alpha=1, label=r"posterior of $\tau$_1",
        color='#467821', weights=w, rwidth=2)
plt.xticks(np.arange(n_count_data))
plt.legend(loc='upper left')
plt.ylim([0, .75])
plt.xlim([35, len(count_data) - 20])
plt.xlabel(r"$\tau$_1 (in days)")
plt.ylabel("probability")

plt.subplot(515)
w = 1.0 / tau_2_samples.shape[0] * np.ones_like(tau_2_samples)
plt.hist(tau_2_samples, bins=n_count_data, alpha=1, label=r"posterior of $\tau$_2",
        color='#467821', weights=w, rwidth=2)
plt.xticks(np.arange(n_count_data))
plt.legend(loc='upper left')
plt.ylim([0, .75])
plt.xlim([35, len(count_data) - 20])
plt.xlabel(r"$\tau$_2 (in days)")
plt.ylabel("probability")

plt.show()

Conclusion:

one λ 1 is about 20, λ 2 is about 10, λ 3 is about 23, these three λ There are significant differences in the posterior distribution, indicating that the behavior of users receiving short messages has indeed changed.

two λ The posterior distribution of is not like exponential distribution. In fact, the posterior distribution is not any distribution that we can distinguish from the original model, which can be well displayed by computer.

three τ It is a discrete variable, and the posterior distribution is different. 40% assurance indicates that the user's behavior changes in 43 days.

Result display:

Draw the fitting result:

plt.figure(figsize=(12.5, 5))

N = tau_1_samples.shape[0]
excepted_texts_per_day = np.zeros(n_count_data)
for day in range(0, n_count_data):
    ix = day<tau_1_samples
    iy = day>tau_2_samples
    iz = day>tau_1_samples
    if ix.sum()/len(ix) == 1:
        excepted_texts_per_day[day] = (lambda_1_samples.sum())/N
    elif iy.sum()/len(iy) == 1:
        excepted_texts_per_day[day] = (lambda_3_samples.sum())/N
    else:
        excepted_texts_per_day[day] = (lambda_2_samples.sum())/N
    
plt.plot(range(n_count_data), excepted_texts_per_day, lw=4, color='#E24A33',
        label="expected number of text-messages received")
plt.xlim(0, n_count_data)
plt.xlabel("Day")
plt.ylabel("Excepted number of text-messages received")
plt.ylim(0, 60)
plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65,
       label='observed texts per day')
plt.legend(loc="upper left")

Conclusion:

Observing the results in the figure above, it is found that the analysis results are very consistent with the previous estimates. The user's behavior has indeed changed, and the change is very sudden. Therefore, it can be speculated that the reason for the situation may be: the reduction of SMS charges, or during the Spring Festival, or the weather reminder SMS subscription, etc.

2.3 A-B test

Title:

2.3.1 method I

Beta distribution is a good idea for a priori distribution (the conversion rate is between 0 and 1, which is just consistent with the value range of beta distribution). When the sample is binomial distribution, the conjugate prior distribution family of parameters is beta distribution, so MCMC sampling is not required.

code:

Import required libraries:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pymc3 as pm
from scipy.stats import beta
from IPython.core.pylabtools import figsize

Data to be imported:

visitors_to_A=1300
visitors_to_B=1300
conversions_from_A=120
conversions_from_B=125
alpha_prior=1
beta_prior=1

Obtain the posterior distribution and sample:

posterior_A=beta(alpha_prior+conversions_from_A,
                beta_prior+visitors_to_A-conversions_from_A)
posterior_B=beta(alpha_prior+conversions_from_B,
                beta_prior+visitors_to_B-conversions_from_B)
samples=2000
samples_posterior_A=posterior_A.rvs(samples)
samples_posterior_B=posterior_B.rvs(samples)

Draw a posteriori histogram and estimate p A p_A pA​, p B p_B pB​:

plt.figure(figsize=(12.5,4))
plt.rcParams['savefig.dpi']=300
plt.rcParams['figure.dpi']=300
x=np.linspace(0,1,500)
plt.plot(x,posterior_A.pdf(x),label='posterior of A')
plt.plot(x,posterior_B.pdf(x),label='posterior of B')
plt.xlim(0.05,0.15)
plt.xlabel('value')

2.3.2 method II

MCMC method is used for sampling.

code:

Import required libraries:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pymc3 as pm
from scipy.stats import beta
from IPython.core.pylabtools import figsize

Generate analog data:

p_true=0.05
N=1500
occurrences=stats.bernoulli.rvs(p_true,size=N)
print("What is the observed frequency in Group A?%.4f"%np.mean(occurrences))
print("Does this equal the true frequency?%s"%(np.mean(occurrences)==p_true))

Build the model and import the required data:

with pm.Model() as model:
    p=pm.Uniform('p',lower=0,upper=1)
    obs=pm.Bernoulli('obs',p=p,observed=occurrences)
    step=pm.Metropolis()
    trace=pm.sample(12000,tune=1200,step=step,cores=2)
    burned_trace=trace[1000:]
    pm.traceplot(trace)

Check the sampling results:

 pm.traceplot(trace)

Draw the posterior distribution diagram:

plt.figure(figsize=(12.5,4))
plt.title("Posterior distribution of $p_A$,the true effectiveness of site A")
plt.vlines(p_true,0,90,linestyles="--",label='true $p_A$(unknown)')
plt.hist(burned_trace['p'],bins=25,histtype='stepfilled',density=True)
plt.legend()
plt.show()

2.4 A-B test (expansion)

Title:

code:

The uniform distribution is selected as a priori distribution.

Import required libraries:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pymc3 as pm
from scipy.stats import beta
from IPython.core.pylabtools import figsize

Generate analog data:

true_p_A=0.05
true_p_B=0.04
N_A=1500
N_B=750
observations_A=stats.bernoulli.rvs(true_p_A,size=N_A)
observations_B=stats.bernoulli.rvs(true_p_B,size=N_B)

Build the model and import the required data:

with pm.Model() as model:
    p_A=pm.Uniform('p_A',0,1)
    p_B=pm.Uniform('p_B',0,1)
    delta=pm.Deterministic('delta',p_A-p_B)
    obs_A=pm.Bernoulli('obs_A',p_A,observed=observations_A)    
    obs_B=pm.Bernoulli('obs_B',p_B,observed=observations_B)
    step=pm.Metropolis()
    trace=pm.sample(20000,step=step,cores=1)
    burned_trace=trace[1000:]

Result display:

p_A_samples=burned_trace["p_A"]
p_B_samples=burned_trace["p_B"]
delta_samples=burned_trace["delta"]
plt.figure(figsize=(12.5,4))
plt.figure(1)
ax=plt.subplot(311)
plt.xlim(0,.1)
plt.hist(p_A_samples,histtype='stepfilled',bins=25,alpha=0.85,
        label="posterior of $p_A$",color="#A60628",density=True)
plt.vlines(true_p_A,0,80,linestyle="--",label="true $p_A$(unknown)")
plt.legend(loc="upper right")
plt.title("Poserior distributions of $p_A$,$_B$,and delta unknowns")

ax=plt.subplot(312)
plt.xlim(0,.1)
plt.hist(p_B_samples,histtype='stepfilled',bins=25,alpha=0.85,
        label="posterior of $p_B$",color="#467821",density=True)
plt.vlines(true_p_B,0,80,linestyle="--",label="true $p_B$(unknown)")
plt.legend(loc="upper right")
ax=plt.subplot(313)
plt.hist(delta_samples,histtype='stepfilled',bins=30,alpha=0.85,
        label="posterior of delta",color="#7A68A6",density=True)
plt.vlines(true_p_A-true_p_B,0,60,linestyle="--",label="true delta(unknown)")
plt.vlines(0,0,60,color="black",alpha=0.2)
plt.legend(loc="upper right")


plt.figure(2)
plt.xlim(0,.1)
plt.hist(p_A_samples,histtype='stepfilled',bins=30,alpha=0.80,
        label="posterior of $p_A$",color='#A60628',density=True)
plt.hist(p_B_samples,histtype='stepfilled',bins=30,alpha=0.80,
        label="posterior of $p_B$",color='#467821',density=True)
plt.legend(loc='upper right')
plt.xlabel("Value")
plt.ylabel("Density")
plt.title('Posterior distribution of $p_A$ and $p_B$')
plt.ylim(0,80)
plt.show()

compare p A p_A pA​, p B p_B pB ^ size

print("Probability site A is WRSE than site B:%.3f"%np.mean(delta_samples<0))
print("Probability site A is BETTER than site B:%.3f"%np.mean(delta_samples>0))

3, Experimental experience

Through this experiment, I am more familiar with the use of pymc3 and have a better understanding of Bayesian statistical inference. In case of difficulties during the experiment, we should communicate with teachers and students in time, so that the experiment can be carried out smoothly. We hope that the follow-up experiment can master Bayesian statistical inference better.

Tags: Python Machine Learning Deep Learning

Posted by GYK on Mon, 16 May 2022 22:26:11 +0300