5 time series prediction

ARIMA model

Time series model ARIMA

Application process of ARIMA model
  1. The stationarity is identified according to the scatter diagram, autocorrelation function and partial autocorrelation function diagram of time series.

  2. The non-stationary time series data are smoothed. Until the processed autocorrelation function and partial autocorrelation function are non significant and non-zero.

  3. According to the identified features, the corresponding time series model is established. After the stabilization, if the partial autocorrelation function is truncated and the autocorrelation function is trailing, the AR model is established; If the partial autocorrelation function is trailing and the autocorrelation function is truncated, the MA model is established; If the partial autocorrelation function and autocorrelation function are trailing, the sequence is suitable for ARMA model.

  4. Parameter estimation to test whether it has statistical significance.

  5. Hypothesis test to judge (diagnose) whether the residual sequence is a white noise sequence.

  6. Use the tested model for prediction.

## Load package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
## The image is displayed in the Jupiter notebook
%matplotlib inline
## The displayed picture format (HD format in mac) can also be set to "bmp" and other formats
%config InlineBackend.figure_format = "retina"
## The output diagram shows Chinese
from matplotlib.font_manager import FontProperties
fonts = FontProperties(fname = "D:\Desktop\python Application in machine learning\Founder bold black song simplified Chinese.ttf",size=14)
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import metrics
from sklearn.model_selection import train_test_split
## Ignore reminder
import warnings
warnings.filterwarnings("ignore")
## Read the data set of the number of aircraft passengers
Airp = pd.read_csv("D:\Desktop\python Application in machine learning\AirPassengers.csv",index_col="Month")
Airp = Airp.astype("float64",copy=False) ## Adjust data type
## Adjust data index
Airp.index = pd.date_range(start=Airp.index.values[0],periods=Airp.shape[0],freq="MS")
Airp.head(5)


The index column index is specified_ For the convenience of using the method of "Month" *, convert the data to the format of "Month" *, and use the method of "Month" * date_ The range() * function is used to generate a time series. Start specifies the start time, periods specifies the length of the backward generated series, * freq = "MS" * represents the time mode, and draws a line chart for the beginning of each Month

## The data set is from 1949 to 1960
Airp.plot(kind = "line",figsize=(12,6))
plt.grid("on")
plt.title("Trend of air passenger volume",FontProperties = fonts)
plt.show()


The data shows a rising trend of waveform, which shows that the data set has a strong seasonal law. The so-called seasonal law is that the change law of the data is the same as before every other period of time, such as the change of temperature in the four seasons of the year

White noise test of time series

This test is used to check whether the sequence is a random sequence. If it is a random sequence, there is no relationship between their values

## LB Test is used to test whether the sequence is white noise. The original assumption is that the sequences are independent of each other within the number of delay periods.
lags = [4,8,12,16,20,24,28,32,36,64,128]
LB = sm.stats.diagnostic.acorr_ljungbox(Airp,lags = lags)
LB = pd.DataFrame(data = np.array(LB).T,columns=["LBvalue","P-value"])
LB["lags"] = lags
LB


Delay order [4,8,12,16,20,24,28,32,36,64128]
If the P value of LB test is less than 0.05, it is considered that the data is not random and the data is regular
From the above LB Test, the P values are far less than 0.05, indicating that we can reject the original hypothesis and cannot consider the sequence as white noise.
After the white noise test, the stability of the data shall be judged

##  The unit root test of the sequence is to test the stationarity of the sequence
dftest = sm.tsa.adfuller(Airp.Passengers,autolag='BIC')
print("adf:",dftest[0])
print("pvalue:",dftest[1])
print("usedlag:",dftest[2])
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
dfoutput


The P value is far greater than 0.05, indicating that the sequence is not stable. It is necessary to make a first-order difference on the data set, and then conduct white noise test and unit root test again after the first-order difference

## First order difference of data
datadiff = Airp.diff().dropna()
## Visualize time series
datadiff.plot(kind = "line",figsize=(12,6))
plt.grid("on")
plt.title('First order difference',FontProperties = fonts)
plt.show()

## White noise test of sequence after first-order difference
## This test is used to check whether the sequence is a random sequence. If it is a random sequence, there is no relationship between their values
## LB Test is used to test whether the sequence is white noise. The original assumption is that the sequences are independent of each other within the number of delay periods.
lags = [4,8,12,16,20,24,28,32,36,64,128]
LB = sm.stats.diagnostic.acorr_ljungbox(datadiff,lags = lags)
LB = pd.DataFrame(data = np.array(LB).T,columns=["LBvalue","P-value"])
LB["lags"] = lags
print("White noise test of differential sequences:\n",LB)
print("-------------------")
##  The unit root test of the sequence after difference is to test the stationarity of the sequence
dftest = sm.tsa.adfuller(datadiff.Passengers,autolag='AIC')
print("adf:",dftest[0])
print("pvalue:",dftest[1])
print("usedlag:",dftest[2])


After the first-order difference: carry out white noise test, and the sequence can be excluded as white noise; In the unit root test, P-value=0.054, which is very close to 0.05. When the confidence is 90%, it is considered that the modified sequence is stable. It can be determined that parameter I = 1

## Autocorrelation and partial correlation diagram of differential sequence
fig = plt.figure(figsize=(10,6))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(datadiff, lags=80, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(datadiff, lags=80, ax=ax2)
plt.subplots_adjust(hspace = 0.3)
plt.ylim(-0.5,1.5)
plt.show()


p can be determined according to the autocorrelation coefficient diagram, Q can be determined according to the partial correlation coefficient diagram, and p = 2 can be determined in the diagram; q=10

Moreover, we can also see the periodic change of correlation coefficient from the figure. By observing the autocorrelation diagram, the period fluctuates around 12. We can calculate multiple models of 12 such as period positioning [12, 24], and select the best one. seasonal_ Determination of the other three parameters in order = (1,0,8,12)

ARIMA model

## We can calculate multiple models by cycle and determine the order of the model by comparing the bic value
#Generally, the order shall not exceed length/10 
pmax = 12
#Generally, the order shall not exceed length/10
qmax = 15
#bic matrix
bic_matrix = [] 
for p in range(pmax+1):
    tmp = []
    for q in range(qmax+1):
        # There are some error reports, so try is used to skip the error reports.
        try: 
            tmp.append(sm.tsa.ARIMA(Airp, (p,1,q)).fit().bic)
        except:
            tmp.append(None)
    bic_matrix.append(tmp)
print("End of model iteration")

#The minimum value can be found
bic_matrix = pd.DataFrame(bic_matrix) 
# print("bic matrix: \ n",bic_matrix)
#First use stack to flatten, and then use idxmin to find the minimum position.
p,q = bic_matrix.stack().idxmin() 
print("More appropriate p:",p)
print("More appropriate q:",q)
bic_matrix

ARIMAmol = sm.tsa.ARIMA(Airp,order=(4,1,9)).fit()
print(ARIMAmol.summary2())

fig, ax = plt.subplots(figsize=(12, 8))
ax = Airp.ix['1949':].plot(ax=ax,marker = "*")
fig = ARIMAmol.plot_predict(start=143,end=168,dynamic=True,
                            ax=ax, plot_insample=False)
plt.xlabel("Year")
plt.ylabel("Number")
plt.title("ARIMA(4,1,9)")
plt.legend(loc = "upper left")
plt.show()

ARIMAX model

## We can calculate multiple models by cycle and determine the order of the model by comparing the bic value
#Generally, the order does not exceed length/10 (I have reduced the running time)
pmax = 10
#Generally, the order shall not exceed length/10
qmax = 10
#bic matrix

bic_matrix = [] 
for p in range(pmax+1):
    tmp = []
    for q in range(qmax+1):
        # There are some error reports, so try is used to skip the error reports.
        try: 
#             tmp.append(sm.tsa.ARIMA(Airp, (p,1,q)).fit().bic)
            tmp.append(SARIMAX(Airp,order=(p,1,q),seasonal_order=(0,0,0,12)).fit().bic)
        except:
            tmp.append(None)
    bic_matrix.append(tmp)
print("End of model iteration")

#The minimum value can be found
bic_matrix = pd.DataFrame(bic_matrix) 
print("bic matrix:\n",bic_matrix)
#First use stack to flatten, and then use idxmin to find the minimum position.
p,q = bic_matrix.stack().idxmin() 
print("More appropriate p:",p)
print("More appropriate q:",q)


modl = SARIMAX(Airp,order=(10,1,0),seasonal_order=(0,0,0,12)).fit()
print(modl.summary())

## Prediction using models
datapre = modl.predict(start=120,end= 160)

## Draw raw data and training data
Airp.plot(kind = "line",figsize=(12,6))
datapre.plot(kind = "line",color = "r")
# FontProperties = fonts
plt.title("ARIMAX((10,1,0),(0,0,0,12))")
plt.show()

Tags: Machine Learning

Posted by flunn on Thu, 05 May 2022 02:36:54 +0300