首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >patsy中的交互效果,patsy.dmatrices为":“和"+”或"*“提供重复列

patsy中的交互效果,patsy.dmatrices为":“和"+”或"*“提供重复列
EN

Stack Overflow用户
提问于 2014-05-15 08:02:25
回答 2查看 6K关注 0票数 4

我有一个列的dataframe,我打算将这两个列作为分类变量来处理。

第一列是country,它的值有SGP、AUS、MYS等。第二列是一天中的时间,它的值为24小时格式,如00、11、14、15等,事件是一个具有1/0标志的二进制变量。我明白要对它们进行分类,我需要在进行Logistic回归之前使用patsy。这个,我使用dmatrices构建。

Usecase :只考虑和time_day的交互效应(以及其他属性,如“操作系统”)

代码语言:javascript
复制
f= 'event_int ~ time_day:country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')

我希望看到列名与两个国家& time_day,但事实并非如此。我可以通过指定X = X.ix[:,range(7,len(X.columns))]来手动获取子集,但这意味着对每个dataset进行HARDCODING。

我的理解是A*B与A:B的不同之处在于它没有列出A+B有趣的东西,尽管我没有看到A,即仅在上面的输出中看到time_day的分类值。

此外,当我执行以下操作时,要显式地将"country“单独从"X”dataframe中排除在外,这是不起作用的,而且我得到了与上面相同的输出。

代码语言:javascript
复制
f='event_int ~ time_day:country-country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')

--这让我觉得":“是"*”的一种简化形式,因为它错过了一个范畴变量。,我认为它无法理解两者都是绝对变量?

代码语言:javascript
复制
f='event_int ~ time_day*country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'time_day[T.02]', u'time_day[T.03]', u'time_day[T.04]', u'time_day[T.05]', u'time_day[T.06]', u'time_day[T.07]', u'time_day[T.08]', u'time_day[T.09]', u'time_day[T.10]', u'time_day[T.11]', u'time_day[T.12]', u'time_day[T.NA]', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[T.HKG]', u'time_day[T.03]:country[T.HKG]', u'time_day[T.04]:country[T.HKG]', u'time_day[T.05]:country[T.HKG]', u'time_day[T.06]:country[T.HKG]', u'time_day[T.07]:country[T.HKG]', u'time_day[T.08]:country[T.HKG]', u'time_day[T.09]:country[T.HKG]', u'time_day[T.10]:country[T.HKG]', u'time_day[T.11]:country[T.HKG]', u'time_day[T.12]:country[T.HKG]', u'time_day[T.NA]:country[T.HKG]', u'time_day[T.02]:country[T.IDN]', u'time_day[T.03]:country[T.IDN]', u'time_day[T.04]:country[T.IDN]', u'time_day[T.05]:country[T.IDN]', u'time_day[T.06]:country[T.IDN]', u'time_day[T.07]:country[T.IDN]', u'time_day[T.08]:country[T.IDN]', u'time_day[T.09]:country[T.IDN]', u'time_day[T.10]:country[T.IDN]', u'time_day[T.11]:country[T.IDN]', u'time_day[T.12]:country[T.IDN]', u'time_day[T.NA]:country[T.IDN]', u'time_day[T.02]:country[T.IND]', u'time_day[T.03]:country[T.IND]', u'time_day[T.04]:country[T.IND]', u'time_day[T.05]:country[T.IND]', u'time_day[T.06]:country[T.IND]', u'time_day[T.07]:country[T.IND]', u'time_day[T.08]:country[T.IND]', u'time_day[T.09]:country[T.IND]', u'time_day[T.10]:country[T.IND]', u'time_day[T.11]:country[T.IND]', u'time_day[T.12]:country[T.IND]', u'time_day[T.NA]:country[T.IND]', u'time_day[T.02]:country[T.MYS]', u'time_day[T.03]:country[T.MYS]', u'time_day[T.04]:country[T.MYS]', u'time_day[T.05]:country[T.MYS]', u'time_day[T.06]:country[T.MYS]', u'time_day[T.07]:country[T.MYS]', u'time_day[T.08]:country[T.MYS]', u'time_day[T.09]:country[T.MYS]', u'time_day[T.10]:country[T.MYS]', u'time_day[T.11]:country[T.MYS]', u'time_day[T.12]:country[T.MYS]', u'time_day[T.NA]:country[T.MYS]', u'time_day[T.02]:country[T.NZL]', u'time_day[T.03]:country[T.NZL]', u'time_day[T.04]:country[T.NZL]', u'time_day[T.05]:country[T.NZL]', u'time_day[T.06]:country[T.NZL]', u'time_day[T.07]:country[T.NZL]', u'time_day[T.08]:country[T.NZL]', u'time_day[T.09]:country[T.NZL]', u'time_day[T.10]:country[T.NZL]', u'time_day[T.11]:country[T.NZL]', u'time_day[T.12]:country[T.NZL]', u'time_day[T.NA]:country[T.NZL]', u'time_day[T.02]:country[T.PHL]', u'time_day[T.03]:country[T.PHL]', u'time_day[T.04]:country[T.PHL]', u'time_day[T.05]:country[T.PHL]', u'time_day[T.06]:country[T.PHL]', u'time_day[T.07]:country[T.PHL]', u'time_day[T.08]:country[T.PHL]', u'time_day[T.09]:country[T.PHL]', u'time_day[T.10]:country[T.PHL]', u'time_day[T.11]:country[T.PHL]', u'time_day[T.12]:country[T.PHL]', u'time_day[T.NA]:country[T.PHL]', u'time_day[T.02]:country[T.SGP]', u'time_day[T.03]:country[T.SGP]', u'time_day[T.04]:country[T.SGP]', u'time_day[T.05]:country[T.SGP]', u'time_day[T.06]:country[T.SGP]', u'time_day[T.07]:country[T.SGP]', u'time_day[T.08]:country[T.SGP]', u'time_day[T.09]:country[T.SGP]', ...], dtype='object')

--如果我显式地将它们声明为“绝对”的vars,我将得到以下-:

代码语言:javascript
复制
f='event_int ~ C(time_day):C(country)'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'C(country)[T.HKG]', u'C(country)[T.IDN]', u'C(country)[T.IND]', u'C(country)[T.MYS]', u'C(country)[T.NZL]', u'C(country)[T.PHL]', u'C(country)[T.SGP]', u'C(time_day)[T.02]:C(country)[AUS]', u'C(time_day)[T.03]:C(country)[AUS]', u'C(time_day)[T.04]:C(country)[AUS]', u'C(time_day)[T.05]:C(country)[AUS]', u'C(time_day)[T.06]:C(country)[AUS]', u'C(time_day)[T.07]:C(country)[AUS]', u'C(time_day)[T.08]:C(country)[AUS]', u'C(time_day)[T.09]:C(country)[AUS]', u'C(time_day)[T.10]:C(country)[AUS]', u'C(time_day)[T.11]:C(country)[AUS]', u'C(time_day)[T.12]:C(country)[AUS]', u'C(time_day)[T.NA]:C(country)[AUS]', u'C(time_day)[T.02]:C(country)[HKG]', u'C(time_day)[T.03]:C(country)[HKG]', u'C(time_day)[T.04]:C(country)[HKG]', u'C(time_day)[T.05]:C(country)[HKG]', u'C(time_day)[T.06]:C(country)[HKG]', u'C(time_day)[T.07]:C(country)[HKG]', u'C(time_day)[T.08]:C(country)[HKG]', u'C(time_day)[T.09]:C(country)[HKG]', u'C(time_day)[T.10]:C(country)[HKG]', u'C(time_day)[T.11]:C(country)[HKG]', u'C(time_day)[T.12]:C(country)[HKG]', u'C(time_day)[T.NA]:C(country)[HKG]', u'C(time_day)[T.02]:C(country)[IDN]', u'C(time_day)[T.03]:C(country)[IDN]', u'C(time_day)[T.04]:C(country)[IDN]', u'C(time_day)[T.05]:C(country)[IDN]', u'C(time_day)[T.06]:C(country)[IDN]', u'C(time_day)[T.07]:C(country)[IDN]', u'C(time_day)[T.08]:C(country)[IDN]', u'C(time_day)[T.09]:C(country)[IDN]', u'C(time_day)[T.10]:C(country)[IDN]', u'C(time_day)[T.11]:C(country)[IDN]', u'C(time_day)[T.12]:C(country)[IDN]', u'C(time_day)[T.NA]:C(country)[IDN]', u'C(time_day)[T.02]:C(country)[IND]', u'C(time_day)[T.03]:C(country)[IND]', u'C(time_day)[T.04]:C(country)[IND]', u'C(time_day)[T.05]:C(country)[IND]', u'C(time_day)[T.06]:C(country)[IND]', u'C(time_day)[T.07]:C(country)[IND]', u'C(time_day)[T.08]:C(country)[IND]', u'C(time_day)[T.09]:C(country)[IND]', u'C(time_day)[T.10]:C(country)[IND]', u'C(time_day)[T.11]:C(country)[IND]', u'C(time_day)[T.12]:C(country)[IND]', u'C(time_day)[T.NA]:C(country)[IND]', u'C(time_day)[T.02]:C(country)[MYS]', u'C(time_day)[T.03]:C(country)[MYS]', u'C(time_day)[T.04]:C(country)[MYS]', u'C(time_day)[T.05]:C(country)[MYS]', u'C(time_day)[T.06]:C(country)[MYS]', u'C(time_day)[T.07]:C(country)[MYS]', u'C(time_day)[T.08]:C(country)[MYS]', u'C(time_day)[T.09]:C(country)[MYS]', u'C(time_day)[T.10]:C(country)[MYS]', u'C(time_day)[T.11]:C(country)[MYS]', u'C(time_day)[T.12]:C(country)[MYS]', u'C(time_day)[T.NA]:C(country)[MYS]', u'C(time_day)[T.02]:C(country)[NZL]', u'C(time_day)[T.03]:C(country)[NZL]', u'C(time_day)[T.04]:C(country)[NZL]', u'C(time_day)[T.05]:C(country)[NZL]', u'C(time_day)[T.06]:C(country)[NZL]', u'C(time_day)[T.07]:C(country)[NZL]', u'C(time_day)[T.08]:C(country)[NZL]', u'C(time_day)[T.09]:C(country)[NZL]', u'C(time_day)[T.10]:C(country)[NZL]', u'C(time_day)[T.11]:C(country)[NZL]', u'C(time_day)[T.12]:C(country)[NZL]', u'C(time_day)[T.NA]:C(country)[NZL]', u'C(time_day)[T.02]:C(country)[PHL]', u'C(time_day)[T.03]:C(country)[PHL]', u'C(time_day)[T.04]:C(country)[PHL]', u'C(time_day)[T.05]:C(country)[PHL]', u'C(time_day)[T.06]:C(country)[PHL]', u'C(time_day)[T.07]:C(country)[PHL]', u'C(time_day)[T.08]:C(country)[PHL]', u'C(time_day)[T.09]:C(country)[PHL]', u'C(time_day)[T.10]:C(country)[PHL]', u'C(time_day)[T.11]:C(country)[PHL]', u'C(time_day)[T.12]:C(country)[PHL]', u'C(time_day)[T.NA]:C(country)[PHL]', u'C(time_day)[T.02]:C(country)[SGP]', u'C(time_day)[T.03]:C(country)[SGP]', u'C(time_day)[T.04]:C(country)[SGP]', u'C(time_day)[T.05]:C(country)[SGP]', u'C(time_day)[T.06]:C(country)[SGP]', u'C(time_day)[T.07]:C(country)[SGP]', u'C(time_day)[T.08]:C(country)[SGP]', u'C(time_day)[T.09]:C(country)[SGP]', ...], dtype='object')

问题:

1.如何只包含交互效果&对于这些变量不包括其他任何内容?

2.为什么在第二种情况下将-country排除在外?

相关:Statsmodels formula API (patsy): How to exclude a subset of interaction components?

编辑了基于@Nathaniel j.smith以下-:的答案来排序自己的疑难解答

代码语言:javascript
复制
f2='event_int ~ country:time_day'
y2,X2 = patsy.dmatrices(f2, df, return_type='dataframe')
X2.design_info.term_names
['Intercept', 'country:time_day']

f1='event_int ~ country:time_day-1'
y1,X1 = patsy.dmatrices(f1, df, return_type='dataframe')
X1.design_info.term_names
['country:time_day']
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-05-15 14:01:18

简单回答:试试event_int ~ -1 + time_day:country

较长的答覆:

首先要理解的是,patsy如何决定构建一个设计矩阵有两个不同的阶段。首先,它确定要包含哪些术语。术语是像aa:b这样的东西。( aba:b中称为a;术语a包含一个单独的因素,也被拼写为a。)找出存在的术语需要扩展和简化您给出的公式,直到您有了一个只使用+:的表达式。a*b扩展为a + b + a:b等。减法是在这个阶段发生的操作:a + b - a简化为简单的b。所以a*b - a扩展到a + b + a:b - a,简化为b + a:b,但是a:b - aa:b是一样的,因为没有a可以减去,所以- a被忽略了。这就是为什么编写time_day:country - country和编写time_day:country是一样的。

然后在第二阶段,一旦patsy决定了包含哪些术语,就必须决定如何对这些术语进行编码。这是你遇到麻烦的阶段。

一般规则是,patsy遍历每个包含绝对因素的术语,并计算出它可以使用的一组列,这些列将使模型足够灵活,可以包括指定的交互,但不会与已经添加的任何术语重复使用。

在这种情况下,您的麻烦是由patsy默认添加的截取术语引起的:event_int ~ time_day:country被解释为event_int ~ 1 + time_day:country。这告诉patsy,您希望有一个列表示单独的截获项,然后是包含交互的第二组列--但是这些列与拦截不重叠。显然的虚拟编码方法是time_daycountry都是冗余的(共线的)与拦截,所以patsy找到了一个有点复杂的方案,没有这个属性。如果您删除拦截,那么您告诉patsy,它可以继续使用简单的方案,所以它做到了。

关于patsy如何选择编码方案的详细说明如下:http://patsy.readthedocs.org/en/latest/formulas.html#redundancy-and-categorical-factors

手册部分的第一部分可能有太多的数学,但是如果你向下滚动,有一些很好的图表,可以让它更清楚地说明发生了什么(并为数学提供了一些上下文)。如果您搜索y ~ 1 + a:b,您将看到一个图表,它具体显示了当您键入event_int ~ time_day:country时所处的情况。如果您搜索y ~ 1 + a + b + a:b,您将看到event_int ~ time_day*country案例中正在发生的事情的图片。

除了查看X.columns之外,查看X.design_info.term_namesX.design_info.term_slices也很有用,它们向您展示了patsy认为存在哪些“术语”以及它们对应的列。(aa:b是术语,每个术语都生成多个列。)y ~ 1 + a:b图中的粗轮廓表示,在这种情况下,单项a:b生成两组列:一组列编码b和处理编码,另一组列编码虚拟编码b和处理编码a的成对产品。

最后,解释输出的两个技巧:(1)您可以确定patsy实际上是将这些因素看作是绝对的,因为列名看起来类似于varname[something involving the var's value]。数值因素要么类似于varname,要么(在少有的情况下,将2d矩阵作为预测器传递) varname[column index]。(2)注意country[T.HKG]country[HKG]的区别-前者表明patsy使用的是降低级别的“处理”编码来避免冗余,而后者则表示简单的虚拟编码。当然,就单个列而言,这些列是相同的,但从概念上讲,差异是非常重要的-- T.模式意味着它删除了其中一个列(注意没有country[T.AUS]),所以像您考虑的那样对列进行细分将不会奏效!

希望这能有所帮助!

票数 9
EN

Stack Overflow用户

发布于 2014-05-15 13:36:32

看起来我们需要删除常量以避免引用编码。

代码语言:javascript
复制
    >>> f0 = 'event_int ~ C(time_day):C(country) - 1'
    >>> y,X0 = patsy.dmatrices(f0, df, return_type='dataframe')
    >>> print len(X0.columns)
    12
    >>> print X0.columns
    Index([u'C(time_day)[1]:C(country)[a]', u'C(time_day)[2]:C(country)[a]', 
           u'C(time_day)[3]:C(country)[a]', u'C(time_day)[4]:C(country)[a]', 
           u'C(time_day)[1]:C(country)[b]', u'C(time_day)[2]:C(country)[b]', 
           u'C(time_day)[3]:C(country)[b]', u'C(time_day)[4]:C(country)[b]', 
           u'C(time_day)[1]:C(country)[c]', u'C(time_day)[2]:C(country)[c]', 
           u'C(time_day)[3]:C(country)[c]', u'C(time_day)[4]:C(country)[c]'],
           dtype=object)

这是基于一个简单的例子,有3个国家和4个时间段:

代码语言:javascript
复制
import numpy as np
import pandas as pd
import patsy

time = np.tile([1, 2, 3, 4], 3)
country = np.repeat(['a', 'b', 'c'], 4)
event_int = np.random.randint(0, 2, size=len(time))

df = pd.DataFrame({'event_int':event_int, 'time_day':time, 'country':country})
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/23672466

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档