Master 20 knowledge points of Pandas time series analysis

Author | Soner Y ı ld ı r ı m
Compile VK
Source: towards Data Science

There are many definitions of time series data, which represent the same meaning in different ways. A simple definition is that time series data is a data point containing a sequence timestamp.

The source of time series data is periodic measurement or observation. We observed the time series data of many industries. For example:

  • Stock prices change over time
  • Daily, weekly and monthly sales
  • Periodic measurement in process
  • Electricity or natural gas consumption rate over a period of time

In this article, I will list 20 points, which will help you fully understand how to deal with Pandas's time series data processing.

1. Different forms of time series data

Time series data can be in the form of specific dates, durations, or fixed defined intervals.

The timestamp can be a day's date or a nanosecond of a given date, depending on the accuracy. For example, "2020 – 01 – 01 14:59:30" is a second based timestamp.

2. Time series data structure

Pandas provides flexible and efficient data structure to process various time series data.

In addition to these three structures, Pandas also supports the concept of date offset, which considers the relative time length of calendar algorithm.

3. Create timestamp

The most basic time series data structure is Timestamp, which can use to_datetime or Timestamp function creation

import pandas as pd

pd.to_datetime('2020-9-13')
Timestamp('2020-09-13 00:00:00')

pd.Timestamp('2020-9-13')
Timestamp('2020-09-13 00:00:00')

4. Access timestamp information

We can get information about the date, month and year stored in the timestamp.

a = pd.Timestamp('2020-9-13')

a.day_name()
'Sunday'

a.month_name()
'September'

a.day
13

a.month
9

a.year
2020

5. Get less obvious information

The Timestamp object also holds information about date arithmetic. For example, we can ask whether a year is a leap year. Here are some more specific information we can get:

b = pd.Timestamp('2020-9-30')

b.is_month_end
True

b.is_leap_year
True

b.is_quarter_start
False

b.weekofyear
40

6. European date

We can use to_datetime function to handle European dates (i.e. date first). The dayfirst parameter is set to True.

pd.to_datetime('10-9-2020', dayfirst=True)
Timestamp('2020-09-10 00:00:00')

pd.to_datetime('10-9-2020')
Timestamp('2020-10-09 00:00:00')

Note: if the first item is greater than 12, Pandas knows that it cannot be a month.

pd.to_datetime('13-9-2020')
Timestamp('2020-09-13 00:00:00')

7. Convert the data frame into time series data

to_ The datetime function can convert data frames with appropriate columns into time series. Consider the following data frames:

pd.to_datetime(df)

0   2020-04-13 
1   2020-05-16 
2   2019-04-11 
dtype: datetime64[ns]

8. Time series data

In real life, we almost always deal with continuous time series data rather than individual dates. Pandas makes processing timing data very simple.

We can pass the date list to_datetime function.

pd.to_datetime(['2020-09-13', '2020-08-12', 
'2020-08-04', '2020-09-05'])

DatetimeIndex(['2020-09-13', '2020-08-12', '2020-08-04', '2020-09-05'], dtype='datetime64[ns]', freq=None)

The object returned is DatetimeIndex.

There are more practical ways to create date series.

9. Create to_datetime and to_ Time series of timedelta

You can create a DatetimeIndex by adding a TimedeltaIndex to the timestamp.

pd.to_datetime('10-9-2020') + pd.to_timedelta(np.arange(5), 'D')

"D" means "day", but many other options are available. You can view the entire list here: https://pandas.pydata.org/pan...

10.date_range function

It provides a more flexible way to create DatetimeIndex.

pd.date_range(start='2020-01-10', periods=10, freq='M')

The periods parameter specifies the number of entries in the index. freq is the frequency, "M" indicates the last day of a month.

In terms of the parameters of the freq parameter, date_range is quite flexible.

pd.date_range(start='2020-01-10', periods=10, freq='6D')

We created an index with a frequency of 6 days.

11.period_range function

It returns a periodic index. The syntax is similar to date_range function.

pd.period_range('2018', periods=10, freq='M')

12.timedelta_range function

Returns TimedeltaIndex.

pd.timedelta_range(start='0', periods=24, freq='H')

13 time zone

By default, the time series object of pandas does not have a specified time zone.

dates = pd.date_range('2019-01-01','2019-01-10')

dates.tz is None
True

We can use TZ_ The localize method specifies a time zone for these objects.

dates_lcz = dates.tz_localize('Europe/Berlin')

dates_lcz.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>

14. Create a time series with a specified time zone

We can also use the tz keyword parameter to create a time series object with a time zone.

pd.date_range('2020-01-01', periods = 5, freq = 'D', tz='US/Eastern')

15. Offset

Suppose we have a time series index and want to offset all dates at a specific time.

A = pd.date_range('2020-01-01', periods=10, freq='D')
A

Let's add an offset of one week to this sequence.

A + pd.offsets.Week()

16. Mobile time series data

Time series data analysis may require moving data points for comparison. The shift function moves data in time.

A.shift(10, freq='M')

17.shift and tshift

  • shift: move data
  • tshift: change time index

Let's create a data frame with a time series index and plot it to see the difference between shift and tshift.

dates = pd.date_range('2020-03-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
df = pd.DataFrame({'values':values}, index=dates)

df.head()

Let's draw the original time series together with the shifted time series.

import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=3, figsize=(10,6), sharey=True)
plt.tight_layout(pad=4)
df.plot(ax=axs[0], legend=None)
df.shift(10).plot(ax=axs[1], legend=None)
df.tshift(10).plot(ax=axs[2], legend=None)

18. Resample using the resampling function

Another common operation of time series data is resampling. Depending on the task, we may need to resample the data at a higher or lower frequency.

Resampling creates a specified internal group and allows you to aggregate the group.

Let's create a Pandas sequence with 30 values and a time series index.

A = pd.date_range('2020-01-01', periods=30, freq='D')
values = np.random.randint(10, size=30)
S = pd.Series(values, index=A)

The average value of the 3-day period will be returned below.

S.resample('3D').mean()

19.Asfreq function

In some cases, we may be interested in values at some frequencies. The Asfreq function returns the value at the end of the specified interval. For example, in the sequence created in the previous step, we may only need the value every 3 days (instead of the 3-day average).

S.asfreq('3D')

20. Scroll

Scrolling is a very useful operation of time series data. Scrolling means creating a scrolling window with a specified size and performing calculations on the data in the window. Of course, the window scrolls the data. The following figure illustrates the concept of scrolling.

It is worth noting that when the entire window is in the data, the calculation begins. In other words, if the window size is 3, the first aggregation will be completed on the third line.

Let's apply a 3-day scrolling window to our sequence.

S.rolling(3).mean()[:10]

conclusion

We have fully introduced Pandas's time series analysis. It is worth noting that Pandas provides more time series analysis.

The official document covers all functions and methods of time series. At first glance, it may seem exhaustive, but you will feel that you have grown through practice.

Official documents: https://pandas.pydata.org/doc...

Thank you for reading. If you have any feedback, please let me know.

Original link: https://towardsdatascience.co...

Welcome to panchuang AI blog:
http://panchuang.net/

Official Chinese document of sklearn machine learning:
http://sklearn123.com/

Welcome to panchuang blog resources summary station:
http://docs.panchuang.net/

Tags: AI

Posted by alexk1781 on Tue, 10 May 2022 03:32:07 +0300