我有一个列的dataframe,我打算将这两个列作为分类变量来处理。
第一列是country,它的值有SGP、AUS、MYS等。第二列是一天中的时间,它的值为24小时格式,如00、11、14、15等,事件是一个具有1/0标志的二进制变量。我明白要对它们进行分类,我需要在进行Logistic回归之前使用patsy。这个,我使用dmatrices构建。
Usecase :只考虑和time_day的交互效应(以及其他属性,如“操作系统”)
f= 'event_int ~ time_day:country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')我希望看到只列名与两个国家& time_day,但事实并非如此。我可以通过指定X = X.ix[:,range(7,len(X.columns))]、来手动获取子集,但这意味着对每个dataset进行HARDCODING。
我的理解是A*B与A:B的不同之处在于它没有列出A+B有趣的东西,尽管我没有看到A,即仅在上面的输出中看到time_day的分类值。
此外,当我执行以下操作时,要显式地将"country“单独从"X”dataframe中排除在外,这是不起作用的,而且我得到了与上面相同的输出。
f='event_int ~ time_day:country-country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[AUS]', u'time_day[T.03]:country[AUS]', u'time_day[T.04]:country[AUS]', u'time_day[T.05]:country[AUS]', u'time_day[T.06]:country[AUS]', u'time_day[T.07]:country[AUS]', u'time_day[T.08]:country[AUS]', u'time_day[T.09]:country[AUS]', u'time_day[T.10]:country[AUS]', u'time_day[T.11]:country[AUS]', u'time_day[T.12]:country[AUS]', u'time_day[T.NA]:country[AUS]', u'time_day[T.02]:country[HKG]', u'time_day[T.03]:country[HKG]', u'time_day[T.04]:country[HKG]', u'time_day[T.05]:country[HKG]', u'time_day[T.06]:country[HKG]', u'time_day[T.07]:country[HKG]', u'time_day[T.08]:country[HKG]', u'time_day[T.09]:country[HKG]', u'time_day[T.10]:country[HKG]', u'time_day[T.11]:country[HKG]', u'time_day[T.12]:country[HKG]', u'time_day[T.NA]:country[HKG]', u'time_day[T.02]:country[IDN]', u'time_day[T.03]:country[IDN]', u'time_day[T.04]:country[IDN]', u'time_day[T.05]:country[IDN]', u'time_day[T.06]:country[IDN]', u'time_day[T.07]:country[IDN]', u'time_day[T.08]:country[IDN]', u'time_day[T.09]:country[IDN]', u'time_day[T.10]:country[IDN]', u'time_day[T.11]:country[IDN]', u'time_day[T.12]:country[IDN]', u'time_day[T.NA]:country[IDN]', u'time_day[T.02]:country[IND]', u'time_day[T.03]:country[IND]', u'time_day[T.04]:country[IND]', u'time_day[T.05]:country[IND]', u'time_day[T.06]:country[IND]', u'time_day[T.07]:country[IND]', u'time_day[T.08]:country[IND]', u'time_day[T.09]:country[IND]', u'time_day[T.10]:country[IND]', u'time_day[T.11]:country[IND]', u'time_day[T.12]:country[IND]', u'time_day[T.NA]:country[IND]', u'time_day[T.02]:country[MYS]', u'time_day[T.03]:country[MYS]', u'time_day[T.04]:country[MYS]', u'time_day[T.05]:country[MYS]', u'time_day[T.06]:country[MYS]', u'time_day[T.07]:country[MYS]', u'time_day[T.08]:country[MYS]', u'time_day[T.09]:country[MYS]', u'time_day[T.10]:country[MYS]', u'time_day[T.11]:country[MYS]', u'time_day[T.12]:country[MYS]', u'time_day[T.NA]:country[MYS]', u'time_day[T.02]:country[NZL]', u'time_day[T.03]:country[NZL]', u'time_day[T.04]:country[NZL]', u'time_day[T.05]:country[NZL]', u'time_day[T.06]:country[NZL]', u'time_day[T.07]:country[NZL]', u'time_day[T.08]:country[NZL]', u'time_day[T.09]:country[NZL]', u'time_day[T.10]:country[NZL]', u'time_day[T.11]:country[NZL]', u'time_day[T.12]:country[NZL]', u'time_day[T.NA]:country[NZL]', u'time_day[T.02]:country[PHL]', u'time_day[T.03]:country[PHL]', u'time_day[T.04]:country[PHL]', u'time_day[T.05]:country[PHL]', u'time_day[T.06]:country[PHL]', u'time_day[T.07]:country[PHL]', u'time_day[T.08]:country[PHL]', u'time_day[T.09]:country[PHL]', u'time_day[T.10]:country[PHL]', u'time_day[T.11]:country[PHL]', u'time_day[T.12]:country[PHL]', u'time_day[T.NA]:country[PHL]', u'time_day[T.02]:country[SGP]', u'time_day[T.03]:country[SGP]', u'time_day[T.04]:country[SGP]', u'time_day[T.05]:country[SGP]', u'time_day[T.06]:country[SGP]', u'time_day[T.07]:country[SGP]', u'time_day[T.08]:country[SGP]', u'time_day[T.09]:country[SGP]', ...], dtype='object')--这让我觉得":“是"*”的一种简化形式,因为它错过了一个范畴变量。,我认为它无法理解两者都是绝对变量?
f='event_int ~ time_day*country'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'time_day[T.02]', u'time_day[T.03]', u'time_day[T.04]', u'time_day[T.05]', u'time_day[T.06]', u'time_day[T.07]', u'time_day[T.08]', u'time_day[T.09]', u'time_day[T.10]', u'time_day[T.11]', u'time_day[T.12]', u'time_day[T.NA]', u'country[T.HKG]', u'country[T.IDN]', u'country[T.IND]', u'country[T.MYS]', u'country[T.NZL]', u'country[T.PHL]', u'country[T.SGP]', u'time_day[T.02]:country[T.HKG]', u'time_day[T.03]:country[T.HKG]', u'time_day[T.04]:country[T.HKG]', u'time_day[T.05]:country[T.HKG]', u'time_day[T.06]:country[T.HKG]', u'time_day[T.07]:country[T.HKG]', u'time_day[T.08]:country[T.HKG]', u'time_day[T.09]:country[T.HKG]', u'time_day[T.10]:country[T.HKG]', u'time_day[T.11]:country[T.HKG]', u'time_day[T.12]:country[T.HKG]', u'time_day[T.NA]:country[T.HKG]', u'time_day[T.02]:country[T.IDN]', u'time_day[T.03]:country[T.IDN]', u'time_day[T.04]:country[T.IDN]', u'time_day[T.05]:country[T.IDN]', u'time_day[T.06]:country[T.IDN]', u'time_day[T.07]:country[T.IDN]', u'time_day[T.08]:country[T.IDN]', u'time_day[T.09]:country[T.IDN]', u'time_day[T.10]:country[T.IDN]', u'time_day[T.11]:country[T.IDN]', u'time_day[T.12]:country[T.IDN]', u'time_day[T.NA]:country[T.IDN]', u'time_day[T.02]:country[T.IND]', u'time_day[T.03]:country[T.IND]', u'time_day[T.04]:country[T.IND]', u'time_day[T.05]:country[T.IND]', u'time_day[T.06]:country[T.IND]', u'time_day[T.07]:country[T.IND]', u'time_day[T.08]:country[T.IND]', u'time_day[T.09]:country[T.IND]', u'time_day[T.10]:country[T.IND]', u'time_day[T.11]:country[T.IND]', u'time_day[T.12]:country[T.IND]', u'time_day[T.NA]:country[T.IND]', u'time_day[T.02]:country[T.MYS]', u'time_day[T.03]:country[T.MYS]', u'time_day[T.04]:country[T.MYS]', u'time_day[T.05]:country[T.MYS]', u'time_day[T.06]:country[T.MYS]', u'time_day[T.07]:country[T.MYS]', u'time_day[T.08]:country[T.MYS]', u'time_day[T.09]:country[T.MYS]', u'time_day[T.10]:country[T.MYS]', u'time_day[T.11]:country[T.MYS]', u'time_day[T.12]:country[T.MYS]', u'time_day[T.NA]:country[T.MYS]', u'time_day[T.02]:country[T.NZL]', u'time_day[T.03]:country[T.NZL]', u'time_day[T.04]:country[T.NZL]', u'time_day[T.05]:country[T.NZL]', u'time_day[T.06]:country[T.NZL]', u'time_day[T.07]:country[T.NZL]', u'time_day[T.08]:country[T.NZL]', u'time_day[T.09]:country[T.NZL]', u'time_day[T.10]:country[T.NZL]', u'time_day[T.11]:country[T.NZL]', u'time_day[T.12]:country[T.NZL]', u'time_day[T.NA]:country[T.NZL]', u'time_day[T.02]:country[T.PHL]', u'time_day[T.03]:country[T.PHL]', u'time_day[T.04]:country[T.PHL]', u'time_day[T.05]:country[T.PHL]', u'time_day[T.06]:country[T.PHL]', u'time_day[T.07]:country[T.PHL]', u'time_day[T.08]:country[T.PHL]', u'time_day[T.09]:country[T.PHL]', u'time_day[T.10]:country[T.PHL]', u'time_day[T.11]:country[T.PHL]', u'time_day[T.12]:country[T.PHL]', u'time_day[T.NA]:country[T.PHL]', u'time_day[T.02]:country[T.SGP]', u'time_day[T.03]:country[T.SGP]', u'time_day[T.04]:country[T.SGP]', u'time_day[T.05]:country[T.SGP]', u'time_day[T.06]:country[T.SGP]', u'time_day[T.07]:country[T.SGP]', u'time_day[T.08]:country[T.SGP]', u'time_day[T.09]:country[T.SGP]', ...], dtype='object')--如果我显式地将它们声明为“绝对”的vars,我将得到以下-:
f='event_int ~ C(time_day):C(country)'
y,X = patsy.dmatrices(f, df, return_type='dataframe')
X.columns
Index([u'Intercept', u'C(country)[T.HKG]', u'C(country)[T.IDN]', u'C(country)[T.IND]', u'C(country)[T.MYS]', u'C(country)[T.NZL]', u'C(country)[T.PHL]', u'C(country)[T.SGP]', u'C(time_day)[T.02]:C(country)[AUS]', u'C(time_day)[T.03]:C(country)[AUS]', u'C(time_day)[T.04]:C(country)[AUS]', u'C(time_day)[T.05]:C(country)[AUS]', u'C(time_day)[T.06]:C(country)[AUS]', u'C(time_day)[T.07]:C(country)[AUS]', u'C(time_day)[T.08]:C(country)[AUS]', u'C(time_day)[T.09]:C(country)[AUS]', u'C(time_day)[T.10]:C(country)[AUS]', u'C(time_day)[T.11]:C(country)[AUS]', u'C(time_day)[T.12]:C(country)[AUS]', u'C(time_day)[T.NA]:C(country)[AUS]', u'C(time_day)[T.02]:C(country)[HKG]', u'C(time_day)[T.03]:C(country)[HKG]', u'C(time_day)[T.04]:C(country)[HKG]', u'C(time_day)[T.05]:C(country)[HKG]', u'C(time_day)[T.06]:C(country)[HKG]', u'C(time_day)[T.07]:C(country)[HKG]', u'C(time_day)[T.08]:C(country)[HKG]', u'C(time_day)[T.09]:C(country)[HKG]', u'C(time_day)[T.10]:C(country)[HKG]', u'C(time_day)[T.11]:C(country)[HKG]', u'C(time_day)[T.12]:C(country)[HKG]', u'C(time_day)[T.NA]:C(country)[HKG]', u'C(time_day)[T.02]:C(country)[IDN]', u'C(time_day)[T.03]:C(country)[IDN]', u'C(time_day)[T.04]:C(country)[IDN]', u'C(time_day)[T.05]:C(country)[IDN]', u'C(time_day)[T.06]:C(country)[IDN]', u'C(time_day)[T.07]:C(country)[IDN]', u'C(time_day)[T.08]:C(country)[IDN]', u'C(time_day)[T.09]:C(country)[IDN]', u'C(time_day)[T.10]:C(country)[IDN]', u'C(time_day)[T.11]:C(country)[IDN]', u'C(time_day)[T.12]:C(country)[IDN]', u'C(time_day)[T.NA]:C(country)[IDN]', u'C(time_day)[T.02]:C(country)[IND]', u'C(time_day)[T.03]:C(country)[IND]', u'C(time_day)[T.04]:C(country)[IND]', u'C(time_day)[T.05]:C(country)[IND]', u'C(time_day)[T.06]:C(country)[IND]', u'C(time_day)[T.07]:C(country)[IND]', u'C(time_day)[T.08]:C(country)[IND]', u'C(time_day)[T.09]:C(country)[IND]', u'C(time_day)[T.10]:C(country)[IND]', u'C(time_day)[T.11]:C(country)[IND]', u'C(time_day)[T.12]:C(country)[IND]', u'C(time_day)[T.NA]:C(country)[IND]', u'C(time_day)[T.02]:C(country)[MYS]', u'C(time_day)[T.03]:C(country)[MYS]', u'C(time_day)[T.04]:C(country)[MYS]', u'C(time_day)[T.05]:C(country)[MYS]', u'C(time_day)[T.06]:C(country)[MYS]', u'C(time_day)[T.07]:C(country)[MYS]', u'C(time_day)[T.08]:C(country)[MYS]', u'C(time_day)[T.09]:C(country)[MYS]', u'C(time_day)[T.10]:C(country)[MYS]', u'C(time_day)[T.11]:C(country)[MYS]', u'C(time_day)[T.12]:C(country)[MYS]', u'C(time_day)[T.NA]:C(country)[MYS]', u'C(time_day)[T.02]:C(country)[NZL]', u'C(time_day)[T.03]:C(country)[NZL]', u'C(time_day)[T.04]:C(country)[NZL]', u'C(time_day)[T.05]:C(country)[NZL]', u'C(time_day)[T.06]:C(country)[NZL]', u'C(time_day)[T.07]:C(country)[NZL]', u'C(time_day)[T.08]:C(country)[NZL]', u'C(time_day)[T.09]:C(country)[NZL]', u'C(time_day)[T.10]:C(country)[NZL]', u'C(time_day)[T.11]:C(country)[NZL]', u'C(time_day)[T.12]:C(country)[NZL]', u'C(time_day)[T.NA]:C(country)[NZL]', u'C(time_day)[T.02]:C(country)[PHL]', u'C(time_day)[T.03]:C(country)[PHL]', u'C(time_day)[T.04]:C(country)[PHL]', u'C(time_day)[T.05]:C(country)[PHL]', u'C(time_day)[T.06]:C(country)[PHL]', u'C(time_day)[T.07]:C(country)[PHL]', u'C(time_day)[T.08]:C(country)[PHL]', u'C(time_day)[T.09]:C(country)[PHL]', u'C(time_day)[T.10]:C(country)[PHL]', u'C(time_day)[T.11]:C(country)[PHL]', u'C(time_day)[T.12]:C(country)[PHL]', u'C(time_day)[T.NA]:C(country)[PHL]', u'C(time_day)[T.02]:C(country)[SGP]', u'C(time_day)[T.03]:C(country)[SGP]', u'C(time_day)[T.04]:C(country)[SGP]', u'C(time_day)[T.05]:C(country)[SGP]', u'C(time_day)[T.06]:C(country)[SGP]', u'C(time_day)[T.07]:C(country)[SGP]', u'C(time_day)[T.08]:C(country)[SGP]', u'C(time_day)[T.09]:C(country)[SGP]', ...], dtype='object')问题:
1.如何只包含交互效果&对于这些变量不包括其他任何内容?
2.为什么在第二种情况下将-country排除在外?
相关:Statsmodels formula API (patsy): How to exclude a subset of interaction components?
编辑了基于@Nathaniel j.smith以下-:的答案来排序自己的疑难解答
f2='event_int ~ country:time_day'
y2,X2 = patsy.dmatrices(f2, df, return_type='dataframe')
X2.design_info.term_names
['Intercept', 'country:time_day']
f1='event_int ~ country:time_day-1'
y1,X1 = patsy.dmatrices(f1, df, return_type='dataframe')
X1.design_info.term_names
['country:time_day']发布于 2014-05-15 14:01:18
简单回答:试试event_int ~ -1 + time_day:country
较长的答覆:
首先要理解的是,patsy如何决定构建一个设计矩阵有两个不同的阶段。首先,它确定要包含哪些术语。术语是像a或a:b这样的东西。( a和b在a:b中称为a;术语a包含一个单独的因素,也被拼写为a。)找出存在的术语需要扩展和简化您给出的公式,直到您有了一个只使用+和:的表达式。a*b扩展为a + b + a:b等。减法是在这个阶段发生的操作:a + b - a简化为简单的b。所以a*b - a扩展到a + b + a:b - a,简化为b + a:b,但是a:b - a和a:b是一样的,因为没有a可以减去,所以- a被忽略了。这就是为什么编写time_day:country - country和编写time_day:country是一样的。
然后在第二阶段,一旦patsy决定了包含哪些术语,就必须决定如何对这些术语进行编码。这是你遇到麻烦的阶段。
一般规则是,patsy遍历每个包含绝对因素的术语,并计算出它可以使用的一组列,这些列将使模型足够灵活,可以包括指定的交互,但不会与已经添加的任何术语重复使用。
在这种情况下,您的麻烦是由patsy默认添加的截取术语引起的:event_int ~ time_day:country被解释为event_int ~ 1 + time_day:country。这告诉patsy,您希望有一个列表示单独的截获项,然后是包含交互的第二组列--但是这些列与拦截不重叠。显然的虚拟编码方法是time_day和country都是冗余的(共线的)与拦截,所以patsy找到了一个有点复杂的方案,没有这个属性。如果您删除拦截,那么您告诉patsy,它可以继续使用简单的方案,所以它做到了。
关于patsy如何选择编码方案的详细说明如下:http://patsy.readthedocs.org/en/latest/formulas.html#redundancy-and-categorical-factors
手册部分的第一部分可能有太多的数学,但是如果你向下滚动,有一些很好的图表,可以让它更清楚地说明发生了什么(并为数学提供了一些上下文)。如果您搜索y ~ 1 + a:b,您将看到一个图表,它具体显示了当您键入event_int ~ time_day:country时所处的情况。如果您搜索y ~ 1 + a + b + a:b,您将看到event_int ~ time_day*country案例中正在发生的事情的图片。
除了查看X.columns之外,查看X.design_info.term_names和X.design_info.term_slices也很有用,它们向您展示了patsy认为存在哪些“术语”以及它们对应的列。(a和a:b是术语,每个术语都生成多个列。)y ~ 1 + a:b图中的粗轮廓表示,在这种情况下,单项a:b生成两组列:一组列编码b和处理编码,另一组列编码虚拟编码b和处理编码a的成对产品。
最后,解释输出的两个技巧:(1)您可以确定patsy实际上是将这些因素看作是绝对的,因为列名看起来类似于varname[something involving the var's value]。数值因素要么类似于varname,要么(在少有的情况下,将2d矩阵作为预测器传递) varname[column index]。(2)注意country[T.HKG]和country[HKG]的区别-前者表明patsy使用的是降低级别的“处理”编码来避免冗余,而后者则表示简单的虚拟编码。当然,就单个列而言,这些列是相同的,但从概念上讲,差异是非常重要的-- T.模式意味着它删除了其中一个列(注意没有country[T.AUS]),所以像您考虑的那样对列进行细分将不会奏效!
希望这能有所帮助!
发布于 2014-05-15 13:36:32
看起来我们需要删除常量以避免引用编码。
>>> f0 = 'event_int ~ C(time_day):C(country) - 1'
>>> y,X0 = patsy.dmatrices(f0, df, return_type='dataframe')
>>> print len(X0.columns)
12
>>> print X0.columns
Index([u'C(time_day)[1]:C(country)[a]', u'C(time_day)[2]:C(country)[a]',
u'C(time_day)[3]:C(country)[a]', u'C(time_day)[4]:C(country)[a]',
u'C(time_day)[1]:C(country)[b]', u'C(time_day)[2]:C(country)[b]',
u'C(time_day)[3]:C(country)[b]', u'C(time_day)[4]:C(country)[b]',
u'C(time_day)[1]:C(country)[c]', u'C(time_day)[2]:C(country)[c]',
u'C(time_day)[3]:C(country)[c]', u'C(time_day)[4]:C(country)[c]'],
dtype=object)这是基于一个简单的例子,有3个国家和4个时间段:
import numpy as np
import pandas as pd
import patsy
time = np.tile([1, 2, 3, 4], 3)
country = np.repeat(['a', 'b', 'c'], 4)
event_int = np.random.randint(0, 2, size=len(time))
df = pd.DataFrame({'event_int':event_int, 'time_day':time, 'country':country})https://stackoverflow.com/questions/23672466
复制相似问题