logo

Statsmodels

wangzf / 2022-05-02


目录

statistical models, hypothesis tests, and data exploration

安装

依赖

conda

$ conda install -c conda-forge statsmodels

PyPI

$ pip install statsmodels

使用

简单线性回归

statsmodels 支持模型使用以下两种风格:

公式风格

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# data
data = sm.datasets.get_rdataset("Guerry", "HistData").data
print(data)

# regression model
results = smf.ols(
    "Lottery ~ Literacy + np.log(Pop1831)", 
    data = data
).fit()
print(results.summary())
print(dir(results))
print(results.__doc__)
    dept Region    Department  Crime_pers  ...  Prostitutes  Distance  Area  Pop1831
0      1      E           Ain       28870  ...           13   218.372  5762   346.03
1      2      N         Aisne       26226  ...          327    65.945  7369   513.00
2      3      C        Allier       26747  ...           34   161.927  7340   298.26
3      4      E  Basses-Alpes       12935  ...            2   351.399  6925   155.90
4      5      E  Hautes-Alpes       17488  ...            1   320.280  5549   129.10
..   ...    ...           ...         ...  ...          ...       ...   ...      ...
81    86      W        Vienne       15010  ...           18   170.523  6990   282.73
82    87      C  Haute-Vienne       16256  ...            7   198.874  5520   285.13
83    88      E        Vosges       18835  ...           43   174.477  5874   397.99
84    89      C         Yonne       18006  ...          272    81.797  7427   352.49
85   200    NaN         Corse        2199  ...            1   539.213  8680   195.41

[86 rows x 23 columns]
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.333
Method:                 Least Squares   F-statistic:                     22.20
Date:                Thu, 17 Nov 2022   Prob (F-statistic):           1.90e-08
Time:                        00:01:21   Log-Likelihood:                -379.82
No. Observations:                  86   AIC:                             765.6
Df Residuals:                      83   BIC:                             773.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         246.4341     35.233      6.995      0.000     176.358     316.510
Literacy           -0.4889      0.128     -3.832      0.000      -0.743      -0.235
np.log(Pop1831)   -31.3114      5.977     -5.239      0.000     -43.199     -19.424
==============================================================================
Omnibus:                        3.713   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
Skew:                          -0.487   Prob(JB):                        0.183
Kurtosis:                       3.003   Cond. No.                         702.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']
Results class for for an OLS model.

Parameters
----------
model : RegressionModel
    The regression model instance.
params : ndarray
    The estimated parameters.
normalized_cov_params : ndarray
    The normalized covariance parameters.
scale : float
    The estimated scale of the residuals.
cov_type : str
    The covariance estimator used in the results.
cov_kwds : dict
    Additional keywords used in the covariance specification.
use_t : bool
    Flag indicating to use the Student's t in inference.
**kwargs
    Additional keyword arguments used to initialize the results.

See Also
--------
RegressionResults
    Results store for WLS and GLW models.

Notes
-----
Most of the methods and attributes are inherited from RegressionResults.
The special methods that are only available for OLS are:

- get_influence
- outlier_test
- el_test
- conf_int_el

numpy array 风格

import numpy as np
import statsmodels.api as sm

# data
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, 0.1, 0.5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e

# regression model
results = sm.OLS(y, X).fit()
print(results.summary())
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.178
Model:                            OLS   Adj. R-squared:                  0.161
Method:                 Least Squares   F-statistic:                     10.51
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           7.41e-05
Time:                        17:12:45   Log-Likelihood:                -20.926
No. Observations:                 100   AIC:                             47.85
Df Residuals:                      97   BIC:                             55.67
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4713      0.075     19.579      0.000       1.322       1.620
x1             0.1045      0.105      1.000      0.320      -0.103       0.312
x2             0.4831      0.107      4.503      0.000       0.270       0.696
==============================================================================
Omnibus:                       39.684   Durbin-Watson:                   1.848
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.503
Skew:                           0.096   Prob(JB):                       0.0387
Kurtosis:                       1.766   Cond. No.                         5.09
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

详细使用

载入模块和函数

import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices

数据

df = sm.datasets.get_rdataset("Guerry", "HistData").data
vars = ["Department", "Lottery", "Literacy", "Wealth", "Region"]
df = df[vars]
df = df.dropna()
df[-5:]
      Department  Lottery  Literacy  Wealth Region
80        Vendee       68        28      56      W
81        Vienne       40        25      68      W
82  Haute-Vienne       55        13      67      C
83        Vosges       14        62      82      E
84         Yonne       51        47      30      C

构造设计矩阵

y, X = dmatrices(
    "Lottery ~ Literacy + Wealth + Region", 
    data = df,
    return_type = "dataframe"
)
y[:3]
x[:3]
   Lottery
0     41.0
1     38.0
2     66.0

   Intercept  Region[T.E]  Region[T.N]  ...  Region[T.W]  Literacy  Wealth
0        1.0          1.0          0.0  ...          0.0      37.0    73.0
1        1.0          0.0          1.0  ...          0.0      51.0    22.0
2        1.0          0.0          0.0  ...          0.0      13.0    61.0

[3 rows x 7 columns]

模型

model = sm.OLS(y, X)
result = model.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           1.07e-05
Time:                        17:12:43   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      0.000      19.826      57.478
Region[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938
Region[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419
Region[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943
Region[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235
Literacy       -0.1858      0.210     -0.886      0.378      -0.603       0.232
Wealth          0.4515      0.103      4.390      0.000       0.247       0.656
==============================================================================
Omnibus:                        3.049   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
Skew:                          -0.340   Prob(JB):                        0.260
Kurtosis:                       2.454   Cond. No.                         371.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
result.params
result.rsquared
Intercept      38.651655
Region[T.E]   -15.427785
Region[T.N]   -10.016961
Region[T.S]    -4.548257
Region[T.W]   -10.091276
Literacy       -0.185819
Wealth          0.451475
dtype: float64

0.337950869192882

模型诊断和测试

sm.stats.linear_rainbow(result)
print(sm.stats.linear_rainbow.__doc__)
# (F-statistic, p-value)
(0.8472339976156916, 0.6997965543621643)
sm.graphics.plot_partregress(
    "Lottery",
    "Wealth",
    ["Region", "Literacy"],
    data = df,
    obs_labels = False
)

Alt text

回归与线性模型

时间序列分析

ARMA

statsmodels.tsa

statsmodels.tsa 命名空间中的模块结构:

估计

指数平滑 Exponential Smoothing

Holt Winter’s Exponential Smoothing

Simple Exponential Smoothing

Holt’s Exponential Smoothing

自相关系数和偏自相关系数计算

import statsmodels.api as sm

X = [2, 3, 4, 3, 8, 7]
print(sm.tsa.stattools.acf(X, nlags = 1, adjusted = True))
[1, 0.3559322]

自相关图和偏自相关图绘制

import numpy as np
import pandas as pd
import akshare as ak
from matplotlib import pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

np.random.seed(123)

# -------------- 准备数据 --------------
# 白噪声
white_noise = np.random.standard_normal(size=1000)

# 随机游走
x = np.random.standard_normal(size=1000)
random_walk = np.cumsum(x)

# GDP
df = ak.macro_china_gdp()
df = df.set_index('季度')
df.index = pd.to_datetime(df.index)
gdp = df['国内生产总值-绝对值'][::-1].astype('float')

# GDP DIFF
gdp_diff = gdp.diff(4).dropna()

# -------------- 绘制图形 --------------
fig, ax = plt.subplots(4, 2)
fig.subplots_adjust(hspace = 0.5)

plot_acf(white_noise, ax = ax[0][0])
ax[0][0].set_title('ACF(white_noise)')
plot_pacf(white_noise, ax = ax[0][1])
ax[0][1].set_title('PACF(white_noise)')

plot_acf(random_walk, ax = ax[1][0])
ax[1][0].set_title('ACF(random_walk)')
plot_pacf(random_walk, ax = ax[1][1])
ax[1][1].set_title('PACF(random_walk)')

plot_acf(gdp, ax = ax[2][0])
ax[2][0].set_title('ACF(gdp)')
plot_pacf(gdp, ax = ax[2][1])
ax[2][1].set_title('PACF(gdp)')

plot_acf(gdp_diff, ax = ax[3][0])
ax[3][0].set_title('ACF(gdp_diff)')
plot_pacf(gdp_diff, ax = ax[3][1])
ax[3][1].set_title('PACF(gdp_diff)')

plt.show()

状态空间方法(State Space Methods)

向量自回归(Vector Autoregressions)

其他模型

生存分析

非参数方法

GMM(Generalized Method of Moments)

小众模型

多元统计分析

统计学和工具

统计

分布

图性

Input-Output

参考