logo

tsfresh

机器学习特征工程

王哲峰 / 2022-05-03


目录

tsfresh 是一个自动化提取时序特征的库

tsfresh 安装

$ pip install tsfresh

tsfresh 使用步骤

使用tsfresh的使用步骤如下:

前期训练阶段:

  1. 数据准备:准备符合 tsfresh 输入格式的数据集
  2. 样本抽样:以步长 s 为间隔滑窗抽样
  3. 特征生成:对采样样本生成特征,并收集它们
  4. 特征选择:收集多个特征下的衍生特征,进行特征选择

后期部署阶段:

  1. 数据准备:准备符合 tsfresh 输入格式的数据集
  2. 特征选择:对滑窗样本生成特征,并收集它们

tsfresh 数据格式

输入数据格式

column_id column_value column_sort column_kind
id value sort kind

适合的 API:

Flat DataFrame

id time x y A t1 x(A, t1) y(A, t1)
A t2 x(A, t2) y(A, t2) A t3 x(A, t3) y(A, t3) B t1
x(B, t1) y(B, t1) B t2 x(B, t2) y(B, t2) B t3 x(B, t3)
y(B, t3)

Stacked DataFrame

id time kind value A t1 x x(A, t1)
A t2 x x(A, t2) A t3 x x(A, t3) A t1 y y(A, t1)
A t2 y y(A, t2) A t3 y y(A, t3) B t1 x x(B, t1)
B t2 x x(B, t2) B t3 x x(B, t3) B t1 y y(B, t1)
B t2 y y(B, t2) B t3 y y(B, t3)

Dictionary of flat DataFrame

{ 
    "x”:
        | id | time | value    |
        |----|------|----------|
        | A  | t1   | x(A, t1) |
        | A  | t2   | x(A, t2) |
        | A  | t3   | x(A, t3) |
        | B  | t1   | x(B, t1) |
        | B  | t2   | x(B, t2) |
        | B  | t3   | x(B, t3) |
  , "y”:
        | id | time | value    |
        |----|------|----------|
        | A  | t1   | y(A, t1) |
        | A  | t2   | y(A, t2) |
        | A  | t3   | y(A, t3) |
        | B  | t1   | y(B, t1) |
        | B  | t2   | y(B, t2) |
        | B  | t3   | y(B, t3) |
}

输出数据格式

id x feature 1 x feature N y feature 1 $\ldots$ y feature N
A
B

scikit-learn Transformers

Feature extraction

Feature selection

Feature extraction and selection

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter
import pandas as pd

# download data
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures
download_robot_execution_failures()

pipeline = Pipeline([
    ("augmenter", RelevantFeatureAugmenter(column_id = "id", column_sort = "time")),
    ("classifier", RandomForestClassifier()),
])

df_ts, y = load_robot_execution_failures()
X = pd.DataFrame(index = y.index)

pipeline.set_params(augmenter__timeseries_container = df_ts)
pipeline.fit(X, y)

大数据

Dask

PySpark

Rolling 和 时间序列预测

img

参考