tsfresh

机器学习特征工程

wangzf / 2022-05-03

tsfresh 安装
tsfresh 使用步骤
tsfresh 数据格式
- 输入数据格式
- 输出数据格式
scikit-learn Transformers
- Feature extraction
- Feature selection
  - Feature extraction and selection
大数据
- Dask
- PySpark
Rolling 和时间序列预测
参考

tsfresh 是一个自动化提取时序特征的库

tsfresh 安装

$ pip install tsfresh

tsfresh 使用步骤

使用tsfresh的使用步骤如下：

前期训练阶段：

数据准备：准备符合 tsfresh 输入格式的数据集
样本抽样：以步长 s 为间隔滑窗抽样
特征生成：对采样样本生成特征，并收集它们
特征选择：收集多个特征下的衍生特征，进行特征选择

后期部署阶段：

数据准备：准备符合 tsfresh 输入格式的数据集
特征选择：对滑窗样本生成特征，并收集它们

tsfresh 数据格式

输入数据格式

Flat DataFrame
Stacked DataFrame
dictionary of flat DataFrame

column_id	column_value	column_sort	column_kind
id	value	sort	kind

适合的 API:

tsfresh.extract_features()
tsfresh.

Flat DataFrame

id	time	x	y
A	t2	x(A, t2)	y(A, t2)	A	t3	x(A, t3)	y(A, t3)	B	t1
x(B, t1)	y(B, t1)	B	t2	x(B, t2)	y(B, t2)	B	t3	x(B, t3)
y(B, t3)

Stacked DataFrame

id	time	kind	value							A	t1
A	t2	x	x(A, t2)	A	t3	x	x(A, t3)	A	t1	y	y(A, t1)
A	t2	y	y(A, t2)	A	t3	y	y(A, t3)	B	t1	x	x(B, t1)
B	t2	x	x(B, t2)	B	t3	x	x(B, t3)	B	t1	y	y(B, t1)
B	t2	y	y(B, t2)	B	t3	y	y(B, t3)

Dictionary of flat DataFrame

{ 
    "x”:
        | id | time | value    |
        |----|------|----------|
        | A  | t1   | x(A, t1) |
        | A  | t2   | x(A, t2) |
        | A  | t3   | x(A, t3) |
        | B  | t1   | x(B, t1) |
        | B  | t2   | x(B, t2) |
        | B  | t3   | x(B, t3) |
  , "y”:
        | id | time | value    |
        |----|------|----------|
        | A  | t1   | y(A, t1) |
        | A  | t2   | y(A, t2) |
        | A  | t3   | y(A, t3) |
        | B  | t1   | y(B, t1) |
        | B  | t2   | y(B, t2) |
        | B  | t3   | y(B, t3) |
}

输出数据格式

id	x feature 1	…	x feature N	y feature 1	$\dots$	y feature N
A	…	…	…	…	…	…
B	…	…	…	…	…	…

scikit-learn Transformers

Feature extraction

tsfresh.FeatureAugmenter

Feature selection

tsfresh.FeatureSelector

Feature extraction and selection

tsfresh.RelevantFeatureAugmenter

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter
import pandas as pd

# download data
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures
download_robot_execution_failures()

pipeline = Pipeline([
    ("augmenter", RelevantFeatureAugmenter(column_id = "id", column_sort = "time")),
    ("classifier", RandomForestClassifier()),
])

df_ts, y = load_robot_execution_failures()
X = pd.DataFrame(index = y.index)

pipeline.set_params(augmenter__timeseries_container = df_ts)
pipeline.fit(X, y)

tsfresh 安装

tsfresh 使用步骤

tsfresh 数据格式

输入数据格式

Flat DataFrame

Stacked DataFrame

Dictionary of flat DataFrame

输出数据格式

scikit-learn Transformers

Feature extraction

Feature selection

Feature extraction and selection

大数据

Dask

PySpark

Rolling 和 时间序列预测

参考

Rolling 和时间序列预测