1, Experimental purpose
Learn how to infer the behavior model and solve the problem by using the short message mca-3.
2, Experiment contents and steps (including problems, codes, results and conclusions)
2.1 SMS data inference behavior
Title:
code:
Data reading and display:
import pymc3 as pm import matplotlib.pyplot as plt import numpy as np plt.figure(figsize=(12.5, 3.5)) count_data = np.loadtxt("./txtdata.csv") n_count_data = len(count_data) plt.bar(np.arange(n_count_data), count_data, color='#348ABD') plt.xlabel("Time (days)") plt.ylabel("Text messages received") plt.title("Did the user's texting habits change over time?") plt.xlim(0, n_count_data)
Build model:
with pm.Model() as model: alpha = 1.0 / count_data.mean() lambda_1 = pm.Exponential("lambda_1", alpha) lambda_2 = pm.Exponential("lambda_2", alpha) tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data) with model: idx = np.arange(n_count_data) lambda_ = pm.math.switch(tau >= idx, lambda_1, lambda_2) #print(tau>=idx) with model: observation = pm.Poisson("obs", lambda_, observed=count_data) with model: step = pm.Metropolis() trace = pm.sample(10000, tune=5000, step=step) lambda_1_samples = trace['lambda_1'] lambda_2_samples = trace['lambda_2'] tau_samples = trace['tau'] print(lambda_1_samples) print(lambda_2_samples)
Result display:
Draw a posteriori histogram of three parameters:
plt.figure(figsize=(12.5, 10)) ax = plt.subplot(311) ax.set_autoscaley_on(False) plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85, label='posterior of $\lambda_1$', color='#A60628', density=True) plt.legend(loc='upper left') plt.title(r"""Posterior distributions of the varibles $\lambda_1,\;\lambda_2,\;\tau$""") plt.xlim([15, 30]) plt.xlabel("$\lambda_1$ value") ax = plt.subplot(312) ax.set_autoscaley_on(False) plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85, label='posterior of $\lambda_2$', color="#7A68A6", density=True) plt.legend(loc='upper left') plt.xlim([15,30]) plt.xlabel("$\lambda_2$ value") plt.subplot(313) w = 1.0 / tau_samples.shape[0] * np.ones_like(tau_samples) plt.hist(tau_samples, bins=n_count_data, alpha=1, label=r"posterior of $\tau$", color='#467821', weights=w, rwidth=2) plt.xticks(np.arange(n_count_data)) plt.legend(loc='upper left') plt.ylim([0, .75]) plt.xlim([35, len(count_data) - 20]) plt.xlabel(r"$\tau$ (in days)") plt.ylabel("probability") plt.show()
Conclusion:
one λ 1 is about 18, λ 2 is about 23, these two λ There are significant differences in the posterior distribution, indicating that the behavior of users receiving short messages has indeed changed.
two λ The posterior distribution of is not like exponential distribution. In fact, the posterior distribution is not any distribution that we can distinguish from the original model, which can be well displayed by computer.
three τ It is a discrete variable with different posterior distribution. 50% assurance indicates that the user's behavior changes in 44 days.
Result display:
Draw the fitting result:
plt.figure(figsize=(12.5, 5)) N = tau_samples.shape[0] excepted_texts_per_day = np.zeros(n_count_data) for day in range(0, n_count_data): ix = day < tau_samples excepted_texts_per_day[day] = (lambda_1_samples[ix].sum() +lambda_2_samples[~ix].sum()) / N plt.plot(range(n_count_data), excepted_texts_per_day, lw=4, color='#E24A33', label="expected number of text-messages received") plt.xlim(0, n_count_data) plt.xlabel("Day") plt.ylabel("Excepted number of text-messages received") plt.ylim(0, 60) plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65, label='observed texts per day') plt.legend(loc="upper left")
Conclusion:
Observing the results in the figure above, it is found that the analysis results are very consistent with the previous estimates. The user's behavior has indeed changed, and the change is very sudden. Therefore, it can be speculated that the reason for the situation may be: the reduction of SMS charges, or during the Spring Festival, or the weather reminder SMS subscription, etc.
2.2 SMS data inference behavior (expansion)
Title:
code:
Build model:
with pm.Model() as model: alpha = 1.0 / count_data.mean() lambda_1 = pm.Exponential("lambda_1", alpha) lambda_2 = pm.Exponential("lambda_2", alpha) lambda_3 = pm.Exponential("lambda_3", alpha) tau_1 = pm.DiscreteUniform("tau_1", lower=0, upper=n_count_data) tau_2 = pm.DiscreteUniform("tau_2", lower=0, upper=n_count_data) with model: idx = np.arange(n_count_data) #pm.math.switch(tau_2>=idx, lambda_2, lambda_3) lambda_ = pm.math.switch(tau_1>=idx, lambda_1, pm.math.switch(tau_2>=idx, lambda_2, lambda_3)) #print(lambda_) with model: observation = pm.Poisson("obs", lambda_, observed=count_data) with model: step = pm.Metropolis() trace = pm.sample(10000, tune=5000, step=step) lambda_1_samples = trace['lambda_1'] lambda_2_samples = trace['lambda_2'] lambda_3_samples = trace['lambda_3'] tau_1_samples = trace['tau_1'] tau_2_samples = trace['tau_2'] print(tau_1_samples) print(lambda_1_samples.sum()) print(lambda_2_samples.sum()) print(lambda_3_samples.sum())
Result display:
Draw a posteriori histogram of five parameters:
plt.figure(figsize=(12.5, 10)) ax = plt.subplot(511) ax.set_autoscaley_on(False) plt.hist(lambda_1_samples, histtype='stepfilled', bins=30, alpha=0.85, label='posterior of $\lambda_1$', color='#A60628', density=True) plt.legend(loc='upper left') plt.title(r"""Posterior distributions of the varibles $\lambda_1,\;\lambda_2,\;\tau$""") plt.xlim([15, 30]) plt.xlabel("$\lambda_1$ value") ax = plt.subplot(512) ax.set_autoscaley_on(False) plt.hist(lambda_2_samples, histtype='stepfilled', bins=30, alpha=0.85, label='posterior of $\lambda_2$', color="#7A68A6", density=True) plt.legend(loc='upper left') plt.xlim([5,20]) plt.xlabel("$\lambda_2$ value") ax = plt.subplot(513) ax.set_autoscaley_on(False) plt.hist(lambda_3_samples, histtype='stepfilled', bins=30, alpha=0.85, label='posterior of $\lambda_3$', color='#467821', density=True) plt.legend(loc='upper left') plt.xlim([15, 30]) plt.xlabel("$\lambda_3$ value") plt.subplot(514) w = 1.0 / tau_1_samples.shape[0] * np.ones_like(tau_1_samples) plt.hist(tau_1_samples, bins=n_count_data, alpha=1, label=r"posterior of $\tau$_1", color='#467821', weights=w, rwidth=2) plt.xticks(np.arange(n_count_data)) plt.legend(loc='upper left') plt.ylim([0, .75]) plt.xlim([35, len(count_data) - 20]) plt.xlabel(r"$\tau$_1 (in days)") plt.ylabel("probability") plt.subplot(515) w = 1.0 / tau_2_samples.shape[0] * np.ones_like(tau_2_samples) plt.hist(tau_2_samples, bins=n_count_data, alpha=1, label=r"posterior of $\tau$_2", color='#467821', weights=w, rwidth=2) plt.xticks(np.arange(n_count_data)) plt.legend(loc='upper left') plt.ylim([0, .75]) plt.xlim([35, len(count_data) - 20]) plt.xlabel(r"$\tau$_2 (in days)") plt.ylabel("probability") plt.show()
Conclusion:
one λ 1 is about 20, λ 2 is about 10, λ 3 is about 23, these three λ There are significant differences in the posterior distribution, indicating that the behavior of users receiving short messages has indeed changed.
two λ The posterior distribution of is not like exponential distribution. In fact, the posterior distribution is not any distribution that we can distinguish from the original model, which can be well displayed by computer.
three τ It is a discrete variable, and the posterior distribution is different. 40% assurance indicates that the user's behavior changes in 43 days.
Result display:
Draw the fitting result:
plt.figure(figsize=(12.5, 5)) N = tau_1_samples.shape[0] excepted_texts_per_day = np.zeros(n_count_data) for day in range(0, n_count_data): ix = day<tau_1_samples iy = day>tau_2_samples iz = day>tau_1_samples if ix.sum()/len(ix) == 1: excepted_texts_per_day[day] = (lambda_1_samples.sum())/N elif iy.sum()/len(iy) == 1: excepted_texts_per_day[day] = (lambda_3_samples.sum())/N else: excepted_texts_per_day[day] = (lambda_2_samples.sum())/N plt.plot(range(n_count_data), excepted_texts_per_day, lw=4, color='#E24A33', label="expected number of text-messages received") plt.xlim(0, n_count_data) plt.xlabel("Day") plt.ylabel("Excepted number of text-messages received") plt.ylim(0, 60) plt.bar(np.arange(len(count_data)), count_data, color="#348ABD", alpha=0.65, label='observed texts per day') plt.legend(loc="upper left")
Conclusion:
Observing the results in the figure above, it is found that the analysis results are very consistent with the previous estimates. The user's behavior has indeed changed, and the change is very sudden. Therefore, it can be speculated that the reason for the situation may be: the reduction of SMS charges, or during the Spring Festival, or the weather reminder SMS subscription, etc.
2.3 A-B test
Title:
2.3.1 method I
Beta distribution is a good idea for a priori distribution (the conversion rate is between 0 and 1, which is just consistent with the value range of beta distribution). When the sample is binomial distribution, the conjugate prior distribution family of parameters is beta distribution, so MCMC sampling is not required.
code:
Import required libraries:
import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats import pymc3 as pm from scipy.stats import beta from IPython.core.pylabtools import figsize
Data to be imported:
visitors_to_A=1300 visitors_to_B=1300 conversions_from_A=120 conversions_from_B=125 alpha_prior=1 beta_prior=1
Obtain the posterior distribution and sample:
posterior_A=beta(alpha_prior+conversions_from_A, beta_prior+visitors_to_A-conversions_from_A) posterior_B=beta(alpha_prior+conversions_from_B, beta_prior+visitors_to_B-conversions_from_B) samples=2000 samples_posterior_A=posterior_A.rvs(samples) samples_posterior_B=posterior_B.rvs(samples)
Draw a posteriori histogram and estimate p A p_A pA, p B p_B pB:
plt.figure(figsize=(12.5,4)) plt.rcParams['savefig.dpi']=300 plt.rcParams['figure.dpi']=300 x=np.linspace(0,1,500) plt.plot(x,posterior_A.pdf(x),label='posterior of A') plt.plot(x,posterior_B.pdf(x),label='posterior of B') plt.xlim(0.05,0.15) plt.xlabel('value')
2.3.2 method II
MCMC method is used for sampling.
code:
Import required libraries:
import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats import pymc3 as pm from scipy.stats import beta from IPython.core.pylabtools import figsize
Generate analog data:
p_true=0.05 N=1500 occurrences=stats.bernoulli.rvs(p_true,size=N) print("What is the observed frequency in Group A?%.4f"%np.mean(occurrences)) print("Does this equal the true frequency?%s"%(np.mean(occurrences)==p_true))
Build the model and import the required data:
with pm.Model() as model: p=pm.Uniform('p',lower=0,upper=1) obs=pm.Bernoulli('obs',p=p,observed=occurrences) step=pm.Metropolis() trace=pm.sample(12000,tune=1200,step=step,cores=2) burned_trace=trace[1000:] pm.traceplot(trace)
Check the sampling results:
pm.traceplot(trace)
Draw the posterior distribution diagram:
plt.figure(figsize=(12.5,4)) plt.title("Posterior distribution of $p_A$,the true effectiveness of site A") plt.vlines(p_true,0,90,linestyles="--",label='true $p_A$(unknown)') plt.hist(burned_trace['p'],bins=25,histtype='stepfilled',density=True) plt.legend() plt.show()
2.4 A-B test (expansion)
Title:
code:
The uniform distribution is selected as a priori distribution.
Import required libraries:
import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats import pymc3 as pm from scipy.stats import beta from IPython.core.pylabtools import figsize
Generate analog data:
true_p_A=0.05 true_p_B=0.04 N_A=1500 N_B=750 observations_A=stats.bernoulli.rvs(true_p_A,size=N_A) observations_B=stats.bernoulli.rvs(true_p_B,size=N_B)
Build the model and import the required data:
with pm.Model() as model: p_A=pm.Uniform('p_A',0,1) p_B=pm.Uniform('p_B',0,1) delta=pm.Deterministic('delta',p_A-p_B) obs_A=pm.Bernoulli('obs_A',p_A,observed=observations_A) obs_B=pm.Bernoulli('obs_B',p_B,observed=observations_B) step=pm.Metropolis() trace=pm.sample(20000,step=step,cores=1) burned_trace=trace[1000:]
Result display:
p_A_samples=burned_trace["p_A"] p_B_samples=burned_trace["p_B"] delta_samples=burned_trace["delta"] plt.figure(figsize=(12.5,4)) plt.figure(1) ax=plt.subplot(311) plt.xlim(0,.1) plt.hist(p_A_samples,histtype='stepfilled',bins=25,alpha=0.85, label="posterior of $p_A$",color="#A60628",density=True) plt.vlines(true_p_A,0,80,linestyle="--",label="true $p_A$(unknown)") plt.legend(loc="upper right") plt.title("Poserior distributions of $p_A$,$_B$,and delta unknowns") ax=plt.subplot(312) plt.xlim(0,.1) plt.hist(p_B_samples,histtype='stepfilled',bins=25,alpha=0.85, label="posterior of $p_B$",color="#467821",density=True) plt.vlines(true_p_B,0,80,linestyle="--",label="true $p_B$(unknown)") plt.legend(loc="upper right") ax=plt.subplot(313) plt.hist(delta_samples,histtype='stepfilled',bins=30,alpha=0.85, label="posterior of delta",color="#7A68A6",density=True) plt.vlines(true_p_A-true_p_B,0,60,linestyle="--",label="true delta(unknown)") plt.vlines(0,0,60,color="black",alpha=0.2) plt.legend(loc="upper right")
plt.figure(2) plt.xlim(0,.1) plt.hist(p_A_samples,histtype='stepfilled',bins=30,alpha=0.80, label="posterior of $p_A$",color='#A60628',density=True) plt.hist(p_B_samples,histtype='stepfilled',bins=30,alpha=0.80, label="posterior of $p_B$",color='#467821',density=True) plt.legend(loc='upper right') plt.xlabel("Value") plt.ylabel("Density") plt.title('Posterior distribution of $p_A$ and $p_B$') plt.ylim(0,80) plt.show()
compare p A p_A pA, p B p_B pB ^ size
print("Probability site A is WRSE than site B:%.3f"%np.mean(delta_samples<0)) print("Probability site A is BETTER than site B:%.3f"%np.mean(delta_samples>0))
3, Experimental experience
Through this experiment, I am more familiar with the use of pymc3 and have a better understanding of Bayesian statistical inference. In case of difficulties during the experiment, we should communicate with teachers and students in time, so that the experiment can be carried out smoothly. We hope that the follow-up experiment can master Bayesian statistical inference better.