文章/答案/技术大牛

发布

社区首页 >问答首页 >Python中面板数据的多重共线性

问Python中面板数据的多重共线性
EN

Stack Overflow用户

提问于 2019-03-08 22:10:52

回答 1查看 2.9K关注 0票数 2

我习惯于使用Stata或R来进行线性回归模型，但我正在将更多的工作流转换为Python。

这两个程序的有用之处在于，它们直觉地知道，您并不关心线性模型中的所有实体或时间固定效应，因此，当估计面板模型时，它们会从模型中删除多色线性模型(报告它们丢弃的是哪些)。

虽然我理解以这种方式估计模型并不理想，人们应该小心回归以运行(等等)，但这在实践中是有用的，因为这意味着您可以先看到结果，然后再担心虚拟模型的一些细微之处(特别是在完全饱和的固定效果模型中，您不关心虚拟模型)。

让我举一个例子。以下操作需要linearmodels并加载数据集并尝试运行面板回归。这是他们的文档中示例的修改版本。

# Load the data (requires statsmodels and linearmodels)
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
import pandas as pd
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(['nr', 'year'])
data['year'] = year
print(wage_panel.DESCR)
print(data.head())

# Run the panel regression
from linearmodels.panel import PanelOLS
exog_vars = ['exper','union','married']
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)

这会产生以下错误：

AbsorbingEffectError:无法估计模型。所包含的影响已经完全吸收了一个或多个变量。当使用模型中包含的效果完美地解释了一个或多个因变量时，就会发生这种情况。

但是，如果通过将相同的数据导出到Stata，在Stata中进行估计，则运行：

data.drop(columns='year').to_stata('data.dta')

然后在stata文件中运行等效值(在加载数据之后)：

xtset nr year
xtreg lwage exper union married i.year, fe

这将完成以下工作：

> . xtreg lwage exper union married i.year, fe
note: 1987.year omitted because of collinearity

Fixed-effects (within) regression               Number of obs      =      4360
Group variable: nr                              Number of groups   =       545

R-sq:  within  = 0.1689                         Obs per group: min =         8
       between = 0.0000                                        avg =       8.0
       overall = 0.0486                                        max =         8

                                                F(9,3806)          =     85.95
corr(u_i, Xb)  = -0.1747                        Prob > F           =    0.0000

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       exper |   .0638624   .0032594    19.59   0.000     .0574721    .0702527
       union |   .0833697   .0194393     4.29   0.000     .0452572    .1214821
     married |   .0583372   .0183688     3.18   0.002     .0223235    .0943509
             |
        year |
       1981  |   .0496865   .0200714     2.48   0.013     .0103348    .0890382
       1982  |   .0399445    .019123     2.09   0.037     .0024521    .0774369
       1983  |   .0193513    .018662     1.04   0.300    -.0172373    .0559398
       1984  |   .0229574   .0186503     1.23   0.218    -.0136081    .0595229
       1985  |   .0081499   .0191359     0.43   0.670    -.0293677    .0456674
       1986  |   .0036329   .0200851     0.18   0.856    -.0357456    .0430115
       1987  |          0  (omitted)
             |
       _cons |   1.169184   .0231221    50.57   0.000     1.123851    1.214517
-------------+----------------------------------------------------------------
     sigma_u |  .40761229
     sigma_e |  .35343397
         rho |  .57083029   (fraction of variance due to u_i)
------------------------------------------------------------------------------

请注意，stata从1987年的回归中任意下降，但仍在运行。有没有办法在linearmodels或statsmodels中获得类似的功能？

python

pandas

stata

statsmodels

linearmodels

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-03-09 06:00:30

我唯一能想到的办法就是手动。

样本数据

import pandas as pd
import numpy as np
import statsmodels.api as sm

from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS

data = wage_panel.load()

首先，我们将跟随Stata的脚步，为每年的固定效果生成假人，并且省略按字典顺序排序的第一个值(用drop_first=True参数完成)。使用np.unique获取标签是很重要的，因为这也是一种分类。statsmodels不需要添加常量，只需自己做：

data = wage_panel.load()
data = pd.concat([data, pd.get_dummies(data.year, drop_first=True)], axis=1)
exog_vars = ['exper','union','married'] + np.unique(data.year)[1::].tolist()
data = data.set_index(['nr', 'year'])

exog = data[exog_vars].assign(constant=1)

如果我们尝试运行这个模型，它会因为完美的共线性而失败。因为我们做的是一个内部回归，我们不能只是在外部测试共线性，我们需要首先在面板中去平均，因为非平均自变量的共线性才是最重要的。我将在这里复制一个exog，我们将在最终回归中使用它。

exog2 = exog.copy()
exog2 = exog2 - exog2.groupby(level=0).transform('mean') + exog2.mean()

现在我们可以看到存在着完美的共线性；当我们将一个不平均的外部变量与每一个其他变量相对应时，我们得到了一些回归的完全R-平方：

for col in exog2:
    print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)

#exper 1.0
#union 0.004326532427527674
#married 0.18901677578724896
#1981 1.0
#1982 1.0
#1983 1.0
#1984 1.0
#1985 1.0
#1986 1.0
#1987 1.0

现在，Stata或其他软件包是如何决定要删除哪个变量的，这对我来说是个谜。在这种情况下，您可能会倾向于放弃您的一年中的一个假人，而不是其他变量。让我们选择1987年，这样我们就可以得到和Stata一样的结果了。

exog2 = exog2.drop(columns=1987)
for col in exog2:
    print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)

#exper 0.48631
#union 0.0043265
#married 0.1890
#1981 0.34978
#1982 0.28369
#1983 0.2478680
#1984 0.2469180
#1985 0.2846552
#1986 0.35067075
#constant 0.94641

因此，我们已经处理了共线，可以回到模型。由于我们已经手动包含了年度固定效果，所以我们可以从模型中删除time_effects。

mod = PanelOLS(data.lwage, exog.drop(columns=1987), entity_effects=True, time_effects=False)
print(mod.fit())

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1689
Estimator:                   PanelOLS   R-squared (Between):             -0.0882
No. Observations:                4360   R-squared (Within):               0.1689
Date:                Sat, Mar 09 2019   R-squared (Overall):              0.0308
Time:                        00:59:14   Log-likelihood                   -1355.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      85.946
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(9,3806)
Min Obs:                       8.0000                                           
Max Obs:                       8.0000   F-statistic (robust):             85.946
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(9,3806)
Avg Obs:                       545.00                                           
Min Obs:                       545.00                                           
Max Obs:                       545.00                                           

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exper          0.0639     0.0033     19.593     0.0000      0.0575      0.0703
union          0.0834     0.0194     4.2887     0.0000      0.0453      0.1215
married        0.0583     0.0184     3.1759     0.0015      0.0223      0.0944
1981           0.0497     0.0201     2.4755     0.0133      0.0103      0.0890
1982           0.0399     0.0191     2.0888     0.0368      0.0025      0.0774
1983           0.0194     0.0187     1.0369     0.2998     -0.0172      0.0559
1984           0.0230     0.0187     1.2309     0.2184     -0.0136      0.0595
1985           0.0081     0.0191     0.4259     0.6702     -0.0294      0.0457
1986           0.0036     0.0201     0.1809     0.8565     -0.0357      0.0430
constant       1.1692     0.0231     50.566     0.0000      1.1239      1.2145
==============================================================================

与Stata输出匹配。没有真正的理由放弃1987年，我们可以选择其他的东西，但至少我们可以看到结果匹配xtreg。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55071706

复制

相似问题

问Python中面板数据的多重共线性
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python中面板数据的多重共线性EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python中面板数据的多重共线性
EN