logo

Pandas-Timeseries

王哲峰 / 2022-11-28


目录

Pandas 提供了多种功能来支持时间序列数据。以下主要功能对于使用 Pandas 进行时间序列预测非常重要:

Pandas 支持的四个与时间相关的概念

概念 标量 Class 数组 Class 数据类型 创建方法
Date Time Timestamp DatetimeIndex datetime64[ns] to_datetime()
datetime64[ns,tz] date_range()
Time Deltas Timedelta TimedeltaIndex timedelta64[ns] to_timedelta()
timedelta_range()
Time Spans Period PeriodIndex period[freq] Period()
period_range()
Date Offset DateOffset None None DateOffset

Freq

Alias Description
B business day frequency
C custom business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter end frequency
QS quarter start frequency
BQS business quarter start frequency
A, Y year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds

Format

Directive Example
Year
%Y 0001,2014,9999
%y 00,01,99
Month
%m 01,…12
%B January,…,December
%b Jan,…,Dec
Day
%d 01,…,30,31
Week
%w 0,…,6
%A Monday,…,Sunday
%a Mon,…,Sun
Time
%H 00,…,23
%I 00,…,12
%M 00,…,59
%S 00,…,59
%p AM,PM
Timezone
%Z 001,002,…,366

Date Time

Timestamp

import datetime
import pandas as pd

# datetime object
datetime.datetime(2019, 8, 21)

# scalar datetime
pd.Timestamp(datetime.datetime(2019, 8, 21))
pd.Timestamp("2019-08-21")
pd.Timestamp(2019, 8, 21)

# day_name
wednesday = pd.Timestamp("2019-08-21")
print(wednesday.day_name())
datetime.datetime(2019, 8, 21, 0, 0)
Timestamp('2019-08-21 00:00:00')
Timestamp('2019-08-21 00:00:00')
Timestamp('2019-08-21 00:00:00')
Wednesday

DatetimeIndex

DatetimeIndex 对象

dates = [
    pd.Timestamp("2019-08-21"), 
    pd.Timestamp("2019-08-22"), 
    pd.Timestamp("2019-08-23"),
]
ts = pd.Series(data = np.random.randn(3), index = dates)
print(ts)
print("-" * 25)
print(ts.index)
print("-" * 25)
print(type(ts.index))
2019-08-21   -0.195642
2019-08-22    0.433605
2019-08-23    0.519007
dtype: float64
-------------------------
DatetimeIndex(['2019-08-21', '2019-08-22', '2019-08-23'], dtype='datetime64[ns]', freq=None)
-------------------------
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

slice and index

start = datetime.datetime(2011, 1, 1)
end = datetime.datetime(2012, 1, 1)
rng = pd.date_range(start, end, freq = "BM") # Business Month 
ts = pd.Series(np.random.randn(len(rng)), index = rng)
print(ts)
print("-" * 100)

print(ts.index)
print("-" * 100)

print(ts[:5].index)
print("-" * 100)

print(ts[::2].index)
print("-" * 100)

print(ts["1/31/2011"])
print("-" * 100)

print(ts[datetime.datetime(2011, 12, 25):])
print("-" * 100)

print(ts['10/31/2011':'12/31/2011'])
print("-" * 100)

print(ts["2011"])
print("-" * 100)

print(ts["2011-6"])
2011-01-31    0.341942
2011-02-28    0.859577
2011-03-31    0.985406
2011-04-29   -0.224811
2011-05-31   -0.888166
2011-06-30    0.710712
2011-07-29    0.355943
2011-08-31   -2.422465
2011-09-30   -0.769204
2011-10-31    1.886651
2011-11-30    0.456291
2011-12-30    1.312205
Freq: BM, dtype: float64
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31'],
              dtype='datetime64[ns]', freq='BM')
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
               '2011-09-30', '2011-11-30'],
              dtype='datetime64[ns]', freq='2BM')
----------------------------------------------------------------------------------------------------
0.3419422485361135
----------------------------------------------------------------------------------------------------
2011-12-30    1.312205
Freq: BM, dtype: float64
----------------------------------------------------------------------------------------------------
2011-10-31    1.886651
2011-11-30    0.456291
2011-12-30    1.312205
Freq: BM, dtype: float64
----------------------------------------------------------------------------------------------------
2011-01-31    0.341942
2011-02-28    0.859577
2011-03-31    0.985406
2011-04-29   -0.224811
2011-05-31   -0.888166
2011-06-30    0.710712
2011-07-29    0.355943
2011-08-31   -2.422465
2011-09-30   -0.769204
2011-10-31    1.886651
2011-11-30    0.456291
2011-12-30    1.312205
Freq: BM, dtype: float64
----------------------------------------------------------------------------------------------------
2011-06-30    0.710712
Freq: BM, dtype: float64

freq infer

print(pd.DatetimeIndex(["2018-01-01", "2018-01-03", "2018-01-05"]))
print(pd.DatetimeIndex(["2018-01-01", "2018-01-03", "2018-01-05"], freq = "infer"))
DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], dtype='datetime64[ns]', freq='2D')

DataFrame slice and index

dft = pd.DataFrame(np.random.randn(100000, 1),
                   columns = ["A"],
                   index = pd.date_range("20130101", periods = 100000, freq = "T")) # minutely 
print(dft)
print("-" * 100)
print(dft["2013"])
print("-" * 100)
print(dft["2013-1":"2013-2"])
print("-" * 100)
print(dft["2013-1":"2013-2-28"])
print("-" * 100)
print(dft["2013-1":"2013-2-28 00:00:00"])
print("-" * 100)
print(dft["2013-1-15":"2013-1-15 12:30:00"])
dft2 = pd.DataFrame(
    data = np.random.randn(20, 1),
    columns = ["A"],
    index = pd.MultiIndex.from_product([
        pd.date_range("20130101", periods = 10, freq = "12H"),
        ["a", "b"]
    ]))
print(dft2)
print("-" * 100)

try:
    print(dft2["2013-01-05"])
except:
    print("ERROR")
print("-" * 100)

print(dft2.loc["2013-01-05"])
print("-" * 100)

dft2 = dft2.swaplevel(0, 1).sort_index()
print(dft2)
print("-" * 100)

idx = pd.IndexSlice
print(dft2.loc[idx[:, "2013-01-05"], :])

to_datetime

pd.Series

s = pd.Series(data = ["Jul 31, 2009", "2010-01-10", None])
ts  = pd.to_datetime(s)
print(ts)
0   2009-07-31
1   2010-01-10
2          NaT
dtype: datetime64[ns]

List

l = [
    '1/1/2018',
    np.datetime64('2018-01-01'), 
    datetime.datetime(2018, 1, 1),
]
ts = pd.to_datetime(l)
ts
DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01'], 
dtype='datetime64[ns]', freq=None)

dayfirst

d = ["04-01-2012 10:00"]
td1 = pd.to_datetime(d)
td2 = pd.to_datetime(d, dayfirst = True)
print(td1)
print(td2)
print("-" * 80)
d2 = ["14-01-2019", "01-14-2019"]
td3 = pd.to_datetime(d2)
td4 = pd.to_datetime(d2, dayfirst = True)
print(td3)
print(td4)
DatetimeIndex(['2012-04-01 10:00:00'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2012-01-04 10:00:00'], dtype='datetime64[ns]', freq=None)
--------------------------------------------------------------------------------
DatetimeIndex(['2019-01-14', '2019-01-14'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2019-01-14', '2019-01-14'], dtype='datetime64[ns]', freq=None)

format

print(pd.to_datetime("2010/11/12", format = "%Y/%m/%d"))
print(pd.to_datetime("12-11-2010 00:00", format = "%d-%m-%Y %H:%M"))
2010-11-12 00:00:00
2010-11-12 00:00:00

pd.DataFrame

df = pd.DataFrame({
    "year": [2015, 2016],
    "month": [2, 3],
    "day": [4, 5],
    "hour": [2, 3]
})
print(df)
print()
print(pd.to_datetime(df))
print()
print(pd.to_datetime(df[["year", "month", "day"]]))
   year  month  day  hour
0  2015      2    4     2
1  2016      3    5     3

0   2015-02-04 02:00:00
1   2016-03-05 03:00:00
dtype: datetime64[ns]

0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

errors

try:
    pd.to_datetime(["2019-08-21", "asd"], errors = "raise")
except ValueError as e:
    print(e)

print(pd.to_datetime(["2019-08-21", "asd"], errors = "ignore"))
print(pd.to_datetime(["2019-08-21", "asd"], errors = "coerce"))
('Unknown string format:', 'asd')
['2019-08-21' 'asd']
DatetimeIndex(['2019-08-21', 'NaT'], dtype='datetime64[ns]', freq=None)

unit

print(pd.to_datetime([
    1349720105, 1349806505, 1349892905, 1349979305, 1350065705
], unit = "s"))
print(pd.to_datetime([
    1349720105000, 1349806505000, 1349892905000, 1349979305000, 1350065705000
], unit = "ms"))
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
               '2012-10-10 18:15:05', '2012-10-11 18:15:05',
               '2012-10-12 18:15:05'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
               '2012-10-10 18:15:05', '2012-10-11 18:15:05',
               '2012-10-12 18:15:05'],
              dtype='datetime64[ns]', freq=None)

origin

print(pd.to_datetime([1, 2, 3], unit = "D", origin = pd.Timestamp('1960-01-01')))
print(pd.to_datetime([1, 2, 3], unit = "D"))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

to_localize

print(pd.to_datetime([1262347200000000000]).tz_localize("US/Pacific"))
print(pd.to_datetime([1262347200000000000]).tz_localize("UTC"))
print(pd.DatetimeIndex([1262347200000000000]).tz_localize("US/Pacific"))
DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)
DatetimeIndex(['2010-01-01 12:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

tz_convert

print(pd.DatetimeIndex([1262347200000000000]).tz_convert('US/Pacific'))
DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

date_range

start = datetime.datetime(2018, 1, 1)
end = datetime.datetime(2019, 1, 1)

print(pd.date_range(start, end))  # Calendar day 
print(pd.bdate_range(start, end)) # Business day

print(pd.date_range(start, periods = 1000, freq = "M"))   # Month End
print(pd.bdate_range(start, periods = 250, freq = "BQS")) # Business Quarter Start

print(pd.date_range(start, end, freq = "BM")) # Business Month End
print(pd.date_range(start, end, freq = "W")) # Weekly

print(pd.date_range(end = end, periods = 20))     # end and length
print(pd.date_range(start = start, periods = 20)) # start and length

print(pd.date_range("2018-01-01", "2018-01-05", periods = 5)) # start and end and length
print(pd.date_range("2018-01-01", "2018-01-05", periods = 10))# start and end and length
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2018-12-23', '2018-12-24', '2018-12-25', '2018-12-26',
               '2018-12-27', '2018-12-28', '2018-12-29', '2018-12-30',
               '2018-12-31', '2019-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10',
               '2018-01-11', '2018-01-12',
               ...
               '2018-12-19', '2018-12-20', '2018-12-21', '2018-12-24',
               '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28',
               '2018-12-31', '2019-01-01'],
              dtype='datetime64[ns]', length=262, freq='B')
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31',
               ...
               '2100-07-31', '2100-08-31', '2100-09-30', '2100-10-31',
               '2100-11-30', '2100-12-31', '2101-01-31', '2101-02-28',
               '2101-03-31', '2101-04-30'],
              dtype='datetime64[ns]', length=1000, freq='M')
DatetimeIndex(['2018-01-01', '2018-04-02', '2018-07-02', '2018-10-01',
               '2019-01-01', '2019-04-01', '2019-07-01', '2019-10-01',
               '2020-01-01', '2020-04-01',
               ...
               '2078-01-03', '2078-04-01', '2078-07-01', '2078-10-03',
               '2079-01-02', '2079-04-03', '2079-07-03', '2079-10-02',
               '2080-01-01', '2080-04-01'],
              dtype='datetime64[ns]', length=250, freq='BQS-JAN')
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-30', '2018-04-30',
               '2018-05-31', '2018-06-29', '2018-07-31', '2018-08-31',
               '2018-09-28', '2018-10-31', '2018-11-30', '2018-12-31'],
              dtype='datetime64[ns]', freq='BM')
DatetimeIndex(['2018-01-07', '2018-01-14', '2018-01-21', '2018-01-28',
               '2018-02-04', '2018-02-11', '2018-02-18', '2018-02-25',
               '2018-03-04', '2018-03-11', '2018-03-18', '2018-03-25',
               '2018-04-01', '2018-04-08', '2018-04-15', '2018-04-22',
               '2018-04-29', '2018-05-06', '2018-05-13', '2018-05-20',
               '2018-05-27', '2018-06-03', '2018-06-10', '2018-06-17',
               '2018-06-24', '2018-07-01', '2018-07-08', '2018-07-15',
               '2018-07-22', '2018-07-29', '2018-08-05', '2018-08-12',
               '2018-08-19', '2018-08-26', '2018-09-02', '2018-09-09',
               '2018-09-16', '2018-09-23', '2018-09-30', '2018-10-07',
               '2018-10-14', '2018-10-21', '2018-10-28', '2018-11-04',
               '2018-11-11', '2018-11-18', '2018-11-25', '2018-12-02',
               '2018-12-09', '2018-12-16', '2018-12-23', '2018-12-30'],
              dtype='datetime64[ns]', freq='W-SUN')
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2018-12-13', '2018-12-14', '2018-12-15', '2018-12-16',
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-22', '2018-12-23', '2018-12-24',
               '2018-12-25', '2018-12-26', '2018-12-27', '2018-12-28',
               '2018-12-29', '2018-12-30', '2018-12-31', '2019-01-01'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
               '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
               '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20'],
              dtype='datetime64[ns]', freq='D')
----------------------------------------------------------------------------------------------------
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 10:40:00',
               '2018-01-01 21:20:00', '2018-01-02 08:00:00',
               '2018-01-02 18:40:00', '2018-01-03 05:20:00',
               '2018-01-03 16:00:00', '2018-01-04 02:40:00',
               '2018-01-04 13:20:00', '2018-01-05 00:00:00'],
              dtype='datetime64[ns]', freq=None)

bdate_range

weekmask = "Mon Wed Fri"
holidays = [datetime.datetime(2011, 1, 5), datetime.datetime(2011, 3, 14)]

idx1 = pd.bdate_range(start, end, freq = "C", weekmask = weekmask, holidays = holidays)
print(idx1)

idx2 = pd.bdate_range(start, end, freq = "CBMS", weekmask = weekmask)
print(idx2)
DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05', '2018-01-08',
               '2018-01-10', '2018-01-12', '2018-01-15', '2018-01-17',
               '2018-01-19', '2018-01-22',
               ...
               '2018-12-10', '2018-12-12', '2018-12-14', '2018-12-17',
               '2018-12-19', '2018-12-21', '2018-12-24', '2018-12-26',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', length=157, freq='C')
DatetimeIndex(['2018-01-01', '2018-02-02', '2018-03-02', '2018-04-02',
               '2018-05-02', '2018-06-01', '2018-07-02', '2018-08-01',
               '2018-09-03', '2018-10-01', '2018-11-02', '2018-12-03'],
              dtype='datetime64[ns]', freq='CBMS')

min/max Timestamp

print(pd.Timestamp.min)
print(pd.Timestamp.max)
1677-09-21 00:12:43.145225
2262-04-11 23:47:16.854775807

Time Deltas

Timedelta

friday = pd.Timestamp("2019-08-23")
print(friday.day_name())
stariday = friday + pd.Timedelta("1 day")
print(stariday.day_name())
Friday
Saturday

TimedeltaIndex

to_timedelta

timedelta_range

Time Spans

Period

pd.Period("2019-08")
pd.Period("2019-08", freq = "D")
Period('2019-08', 'M')
Period('2019-08-01', 'D')

PeriodIndex

periods = [
    pd.Period("2019-08"), 
    pd.Period("2019-07"), 
    pd.Period("2019-06"),
]
ts = pd.Series(data = np.random.randn(3), index = periods)
print(ts)
print("-" * 20)
print(ts.index)
print("-" * 20)
print(type(ts.index))
2019-08   -0.999400
2019-07   -0.213444
2019-06   -1.463501
Freq: M, dtype: float64
--------------------
PeriodIndex(['2019-08', '2019-07', '2019-06'], dtype='period[M]', freq='M')
--------------------
<class 'pandas.core.indexes.period.PeriodIndex'>

period_range

ps = pd.Series(
    data = pd.period_range("1/1/2011", freq = "M", periods = 3)
)
ps
0   2011-01
1   2011-02
2   2011-03
dtype: object

Date Offset

DataOffset

ps = pd.Series(data = [pd.DateOffset(1), pd.DateOffset(2)])
ps
0         <DateOffset>
1    <2 * DateOffsets>
dtype: object

offsets

friday = pd.Timestamp("2019-08-23")
print(friday.day_name())
monday = friday + pd.offsets.BDay()
print(monday.day_name())
Monday

Time Zone

to_localize

tz_convert

NaT

print(pd.Timestamp(pd.NaT))
print(pd.Timedelta(pd.NaT))
print(pd.Period(pd.NaT))
print(pd.NaT == pd.NaT)
NaT
NaT
NaT
False

Window

Resampling

idx = pd.date_range("2019-01-01", periods = 6, freq = "H")
ts = pd.Series(
    data = range(len(idx)),
    index = idx,
)
ts
2019-01-01 00:00:00    0
2019-01-01 01:00:00    1
2019-01-01 02:00:00    2
2019-01-01 03:00:00    3
2019-01-01 04:00:00    4
2019-01-01 05:00:00    5
Freq: H, dtype: int64
ts.resample("2H").mean()
2019-01-01 00:00:00    0.5
2019-01-01 02:00:00    2.5
2019-01-01 04:00:00    4.5
Freq: 2H, dtype: float64

Difference

Interpolate