ARIMA model
Time series model ARIMA
Application process of ARIMA model

The stationarity is identified according to the scatter diagram, autocorrelation function and partial autocorrelation function diagram of time series.

The nonstationary time series data are smoothed. Until the processed autocorrelation function and partial autocorrelation function are non significant and nonzero.

According to the identified features, the corresponding time series model is established. After the stabilization, if the partial autocorrelation function is truncated and the autocorrelation function is trailing, the AR model is established; If the partial autocorrelation function is trailing and the autocorrelation function is truncated, the MA model is established; If the partial autocorrelation function and autocorrelation function are trailing, the sequence is suitable for ARMA model.

Parameter estimation to test whether it has statistical significance.

Hypothesis test to judge (diagnose) whether the residual sequence is a white noise sequence.

Use the tested model for prediction.
## Load package import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns ## The image is displayed in the Jupiter notebook %matplotlib inline ## The displayed picture format (HD format in mac) can also be set to "bmp" and other formats %config InlineBackend.figure_format = "retina" ## The output diagram shows Chinese from matplotlib.font_manager import FontProperties fonts = FontProperties(fname = "D:\Desktop\python Application in machine learning\Founder bold black song simplified Chinese.ttf",size=14) import statsmodels.api as sm import statsmodels.formula.api as smf from sklearn import metrics from sklearn.model_selection import train_test_split ## Ignore reminder import warnings warnings.filterwarnings("ignore")
## Read the data set of the number of aircraft passengers Airp = pd.read_csv("D:\Desktop\python Application in machine learning\AirPassengers.csv",index_col="Month") Airp = Airp.astype("float64",copy=False) ## Adjust data type ## Adjust data index Airp.index = pd.date_range(start=Airp.index.values[0],periods=Airp.shape[0],freq="MS") Airp.head(5)
The index column index is specified_ For the convenience of using the method of "Month" *, convert the data to the format of "Month" *, and use the method of "Month" * date_ The range() * function is used to generate a time series. Start specifies the start time, periods specifies the length of the backward generated series, * freq = "MS" * represents the time mode, and draws a line chart for the beginning of each Month
## The data set is from 1949 to 1960 Airp.plot(kind = "line",figsize=(12,6)) plt.grid("on") plt.title("Trend of air passenger volume",FontProperties = fonts) plt.show()
The data shows a rising trend of waveform, which shows that the data set has a strong seasonal law. The socalled seasonal law is that the change law of the data is the same as before every other period of time, such as the change of temperature in the four seasons of the year
White noise test of time series
This test is used to check whether the sequence is a random sequence. If it is a random sequence, there is no relationship between their values
## LB Test is used to test whether the sequence is white noise. The original assumption is that the sequences are independent of each other within the number of delay periods. lags = [4,8,12,16,20,24,28,32,36,64,128] LB = sm.stats.diagnostic.acorr_ljungbox(Airp,lags = lags) LB = pd.DataFrame(data = np.array(LB).T,columns=["LBvalue","Pvalue"]) LB["lags"] = lags LB
Delay order [4,8,12,16,20,24,28,32,36,64128]
If the P value of LB test is less than 0.05, it is considered that the data is not random and the data is regular
From the above LB Test, the P values are far less than 0.05, indicating that we can reject the original hypothesis and cannot consider the sequence as white noise.
After the white noise test, the stability of the data shall be judged
## The unit root test of the sequence is to test the stationarity of the sequence dftest = sm.tsa.adfuller(Airp.Passengers,autolag='BIC') print("adf:",dftest[0]) print("pvalue:",dftest[1]) print("usedlag:",dftest[2]) dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','pvalue','#Lags Used','Number of Observations Used']) dfoutput
The P value is far greater than 0.05, indicating that the sequence is not stable. It is necessary to make a firstorder difference on the data set, and then conduct white noise test and unit root test again after the firstorder difference
## First order difference of data datadiff = Airp.diff().dropna() ## Visualize time series datadiff.plot(kind = "line",figsize=(12,6)) plt.grid("on") plt.title('First order difference',FontProperties = fonts) plt.show()
## White noise test of sequence after firstorder difference ## This test is used to check whether the sequence is a random sequence. If it is a random sequence, there is no relationship between their values ## LB Test is used to test whether the sequence is white noise. The original assumption is that the sequences are independent of each other within the number of delay periods. lags = [4,8,12,16,20,24,28,32,36,64,128] LB = sm.stats.diagnostic.acorr_ljungbox(datadiff,lags = lags) LB = pd.DataFrame(data = np.array(LB).T,columns=["LBvalue","Pvalue"]) LB["lags"] = lags print("White noise test of differential sequences:\n",LB) print("") ## The unit root test of the sequence after difference is to test the stationarity of the sequence dftest = sm.tsa.adfuller(datadiff.Passengers,autolag='AIC') print("adf:",dftest[0]) print("pvalue:",dftest[1]) print("usedlag:",dftest[2])
After the firstorder difference: carry out white noise test, and the sequence can be excluded as white noise; In the unit root test, Pvalue=0.054, which is very close to 0.05. When the confidence is 90%, it is considered that the modified sequence is stable. It can be determined that parameter I = 1
## Autocorrelation and partial correlation diagram of differential sequence fig = plt.figure(figsize=(10,6)) ax1 = fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(datadiff, lags=80, ax=ax1) ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(datadiff, lags=80, ax=ax2) plt.subplots_adjust(hspace = 0.3) plt.ylim(0.5,1.5) plt.show()
p can be determined according to the autocorrelation coefficient diagram, Q can be determined according to the partial correlation coefficient diagram, and p = 2 can be determined in the diagram; q=10
Moreover, we can also see the periodic change of correlation coefficient from the figure. By observing the autocorrelation diagram, the period fluctuates around 12. We can calculate multiple models of 12 such as period positioning [12, 24], and select the best one. seasonal_ Determination of the other three parameters in order = (1,0,8,12)
ARIMA model
## We can calculate multiple models by cycle and determine the order of the model by comparing the bic value #Generally, the order shall not exceed length/10 pmax = 12 #Generally, the order shall not exceed length/10 qmax = 15 #bic matrix bic_matrix = [] for p in range(pmax+1): tmp = [] for q in range(qmax+1): # There are some error reports, so try is used to skip the error reports. try: tmp.append(sm.tsa.ARIMA(Airp, (p,1,q)).fit().bic) except: tmp.append(None) bic_matrix.append(tmp) print("End of model iteration") #The minimum value can be found bic_matrix = pd.DataFrame(bic_matrix) # print("bic matrix: \ n",bic_matrix) #First use stack to flatten, and then use idxmin to find the minimum position. p,q = bic_matrix.stack().idxmin() print("More appropriate p:",p) print("More appropriate q:",q) bic_matrix ARIMAmol = sm.tsa.ARIMA(Airp,order=(4,1,9)).fit() print(ARIMAmol.summary2()) fig, ax = plt.subplots(figsize=(12, 8)) ax = Airp.ix['1949':].plot(ax=ax,marker = "*") fig = ARIMAmol.plot_predict(start=143,end=168,dynamic=True, ax=ax, plot_insample=False) plt.xlabel("Year") plt.ylabel("Number") plt.title("ARIMA(4,1,9)") plt.legend(loc = "upper left") plt.show()
ARIMAX model
## We can calculate multiple models by cycle and determine the order of the model by comparing the bic value #Generally, the order does not exceed length/10 (I have reduced the running time) pmax = 10 #Generally, the order shall not exceed length/10 qmax = 10 #bic matrix bic_matrix = [] for p in range(pmax+1): tmp = [] for q in range(qmax+1): # There are some error reports, so try is used to skip the error reports. try: # tmp.append(sm.tsa.ARIMA(Airp, (p,1,q)).fit().bic) tmp.append(SARIMAX(Airp,order=(p,1,q),seasonal_order=(0,0,0,12)).fit().bic) except: tmp.append(None) bic_matrix.append(tmp) print("End of model iteration") #The minimum value can be found bic_matrix = pd.DataFrame(bic_matrix) print("bic matrix:\n",bic_matrix) #First use stack to flatten, and then use idxmin to find the minimum position. p,q = bic_matrix.stack().idxmin() print("More appropriate p:",p) print("More appropriate q:",q) modl = SARIMAX(Airp,order=(10,1,0),seasonal_order=(0,0,0,12)).fit() print(modl.summary()) ## Prediction using models datapre = modl.predict(start=120,end= 160) ## Draw raw data and training data Airp.plot(kind = "line",figsize=(12,6)) datapre.plot(kind = "line",color = "r") # FontProperties = fonts plt.title("ARIMAX((10,1,0),(0,0,0,12))") plt.show()