我想给一个函数一个任意的dataframe、dateindex和column,并要求它返回有多少个连续的前面的行(包括它本身)具有相同的值。我已经能够保存我的大部分熊猫代码的矢量化。想清楚我怎么能做到这一点。
下面是一个小玩具数据集,以及我希望从函数中得到什么输出的例子。
bar foo
2016-06-01 False True
2016-06-02 True False
2016-06-03 True True
2016-06-06 True False
2016-06-07 False False
2016-06-08 True False
2016-06-09 True False
2016-06-10 False True
2016-06-13 False True
2016-06-14 True True
import pandas as pd
rng = pd.bdate_range('6/1/2016', periods=10)
cola = [True, False, True, False, False, False,False, True, True, True]
colb = [False, True, True, True, False, True, True, False, False, True]
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df = pd.DataFrame(d)
"""
consec('foo','2016-06-09') => 4 # it's the fourth continuous 'False' in a row
consec('foo', '2016-06-08') => 3 # It's the third continuous'False' in a row
consec('bar', '2016-06-02') => 1 # It's the first continuou true in a row
consec('foo', '2016-06-14') => 3 # It's the third continuous True
"""==================
最后,我使用了迭代工具--下面的答案,做了一个小小的改动,因为它完全满足了我的需要(比我最初的问题规范稍微多了一点)。谢谢你的许多建议。
rng = pd.bdate_range('6/1/2016', periods=100)
cola = [True, False, True, False, False, False,False, True, True, True]*10
colb = [False, True, True, True, False, True, True, False, False, True]*10
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df2 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
if state == True:
lst.extend([i+1 for i,v in enumerate(repeat_values)])
elif state == False:
lst.extend([0 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])产出如下:
bar foo bar_consec foo_consec
2016-06-01 False True 0 1
2016-06-02 True False 1 0
2016-06-03 True True 2 1
2016-06-06 True False 3 0
2016-06-07 False False 0 0
2016-06-08 True False 1 0
2016-06-09 True False 2 0
2016-06-10 False True 0 1
2016-06-13 False True 0 2
2016-06-14 True True 1 3
2016-06-15 False True 0 4
2016-06-16 True False 1 0
2016-06-17 True True 2 1
2016-06-20 True False 3 0
2016-06-21 False False 0 0
2016-06-22 True False 1 0发布于 2016-06-30 14:09:29
下面是另一种方法,它为每一行创建一个具有相关的连续计数的新列。当dataframe有10000行时,我测试了它,花费了24 ms。它使用来自groupby的itertools。它利用了这样一个事实:每当键值(在本例中是foo和bar )发生变化时,就会创建一个中断,因此我们可以在那里使用索引。
rng = pd.bdate_range('6/1/2016', periods=10000)
cola = [True, False, True, False, False, False,False, True, True, True]*1000
colb = [False, True, True, True, False, True, True, False, False, True]*1000
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df1 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
lst.extend([i+1 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])输出:
bar foo bar_consec foo_consec
2016-06-01 False True 1 1
2016-06-02 True False 1 1
2016-06-03 True True 2 1
2016-06-06 True False 3 1
2016-06-07 False False 1 2
2016-06-08 True False 1 3
...
[10000 rows x 4 columns]
10 loops, best of 3: 24.1 ms per loop发布于 2016-06-30 13:37:40
试试这个:
In [135]: %paste
def consec(df, col, d):
return (df[:d].groupby((df[col] != df[col].shift())
.cumsum())[col]
.transform('size').tail(1)[0])
## -- End pasted text --
In [137]: consec(df, 'foo', '2016-06-09')
Out[137]: 4
In [138]: consec(df, 'foo', '2016-06-08')
Out[138]: 3
In [139]: consec(df, 'bar', '2016-06-02')
Out[139]: 1
In [140]: consec(df, 'bar', '2016-06-14')
Out[140]: 1解释:
In [141]: (df.foo != df.foo.shift()).cumsum()
Out[141]:
2016-06-01 1
2016-06-02 2
2016-06-03 3
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 5
2016-06-13 5
2016-06-14 5
Freq: B, Name: foo, dtype: int32
In [142]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size')
Out[142]:
2016-06-01 1
2016-06-02 1
2016-06-03 1
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 3
2016-06-13 3
2016-06-14 3
Freq: B, dtype: int64
In [143]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size').tail(1)
Out[143]:
2016-06-14 3
Freq: B, dtype: int64发布于 2016-06-30 13:41:23
您可以使用:
#reorder index in df
df = df[::-1]
def consec(col, date):
#select df by date
df1 = df.ix[date:,:]
#get first group == 1
colconsec = (df1[col] != df1[col].shift()).cumsum() == 1
return 'Value is ' + str(df1.ix[0,col]) + ', Len is: '+ str(len(df1[colconsec]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3另一种解决方案是通过colconsec为创建掩码找到iat系列的最后值:
def consec(col, date):
df1 = df.ix[:date,:]
colconsec = (df1[col] != df1[col].shift()).cumsum()
mask = colconsec == colconsec.iat[-1]
return 'Value is ' + str(df1[col].iat[-1]) + ', Len is: '+ str(len(df1[mask]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3 https://stackoverflow.com/questions/38123904
复制相似问题