文章/答案/技术大牛

发布

社区首页 >问答首页 >如何很好地测量熊猫数据中相同数据的运行情况

问如何很好地测量熊猫数据中相同数据的运行情况
EN

Stack Overflow用户

提问于 2016-06-30 13:00:29

回答 3查看 151关注 0票数 4

我想给一个函数一个任意的dataframe、dateindex和column，并要求它返回有多少个连续的前面的行(包括它本身)具有相同的值。我已经能够保存我的大部分熊猫代码的矢量化。想清楚我怎么能做到这一点。

下面是一个小玩具数据集，以及我希望从函数中得到什么输出的例子。

            bar     foo
2016-06-01  False   True
2016-06-02  True    False
2016-06-03  True    True
2016-06-06  True    False
2016-06-07  False   False
2016-06-08  True    False
2016-06-09  True    False
2016-06-10  False   True
2016-06-13  False   True
2016-06-14  True    True


import pandas as pd

rng = pd.bdate_range('6/1/2016', periods=10)
cola = [True, False, True, False, False, False,False, True, True, True]
colb = [False, True, True, True, False, True, True, False, False, True]

d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df = pd.DataFrame(d)    

"""
consec('foo','2016-06-09') => 4  # it's the fourth  continuous 'False' in a row
consec('foo', '2016-06-08') => 3  # It's the third continuous'False' in a row
consec('bar', '2016-06-02') => 1  # It's the first continuou true in a row
consec('foo', '2016-06-14') => 3  # It's the third continuous True
"""

==================

最后，我使用了迭代工具--下面的答案，做了一个小小的改动，因为它完全满足了我的需要(比我最初的问题规范稍微多了一点)。谢谢你的许多建议。

rng = pd.bdate_range('6/1/2016', periods=100)
cola = [True, False, True, False, False, False,False, True, True, True]*10
colb = [False, True, True, True, False, True, True, False, False, True]*10

d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df2 = pd.DataFrame(d)    

def make_new_col_of_consec(df,col_list):
    for col_name in col_list:
        lst = []
        for state, repeat_values in itertools.groupby(df1[col_name]):
            if state == True:
                lst.extend([i+1 for i,v in enumerate(repeat_values)])
            elif state == False:
                lst.extend([0 for i,v in enumerate(repeat_values)])
        df1[col_name + "_consec"] = lst
    return df


print make_new_col_of_consec(df1,["bar","foo"])

产出如下：

              bar    foo  bar_consec  foo_consec
2016-06-01  False   True           0           1
2016-06-02   True  False           1           0
2016-06-03   True   True           2           1
2016-06-06   True  False           3           0
2016-06-07  False  False           0           0
2016-06-08   True  False           1           0
2016-06-09   True  False           2           0
2016-06-10  False   True           0           1
2016-06-13  False   True           0           2
2016-06-14   True   True           1           3
2016-06-15  False   True           0           4
2016-06-16   True  False           1           0
2016-06-17   True   True           2           1
2016-06-20   True  False           3           0
2016-06-21  False  False           0           0
2016-06-22   True  False           1           0

python

pandas

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-06-30 14:09:29

下面是另一种方法，它为每一行创建一个具有相关的连续计数的新列。当dataframe有10000行时，我测试了它，花费了24 ms。它使用来自groupby的itertools。它利用了这样一个事实:每当键值(在本例中是foo和bar )发生变化时，就会创建一个中断，因此我们可以在那里使用索引。

rng = pd.bdate_range('6/1/2016', periods=10000)
cola = [True, False, True, False, False, False,False, True, True, True]*1000
colb = [False, True, True, True, False, True, True, False, False, True]*1000

d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df1 = pd.DataFrame(d)    

def make_new_col_of_consec(df,col_list):
    for col_name in col_list:
        lst = []
        for state, repeat_values in itertools.groupby(df1[col_name]):
            lst.extend([i+1 for i,v in enumerate(repeat_values)])
        df1[col_name + "_consec"] = lst
    return df


print make_new_col_of_consec(df1,["bar","foo"])

输出：

              bar    foo  bar_consec  foo_consec
2016-06-01  False   True           1           1
2016-06-02   True  False           1           1
2016-06-03   True   True           2           1
2016-06-06   True  False           3           1
2016-06-07  False  False           1           2
2016-06-08   True  False           1           3
...
[10000 rows x 4 columns]
10 loops, best of 3: 24.1 ms per loop

票数 1

Stack Overflow用户

发布于 2016-06-30 13:37:40

试试这个：

In [135]: %paste
def consec(df, col, d):
    return (df[:d].groupby((df[col] != df[col].shift())
                           .cumsum())[col]
                  .transform('size').tail(1)[0])
## -- End pasted text --

In [137]: consec(df, 'foo', '2016-06-09')
Out[137]: 4

In [138]: consec(df, 'foo', '2016-06-08')
Out[138]: 3

In [139]: consec(df, 'bar', '2016-06-02')
Out[139]: 1

In [140]: consec(df, 'bar', '2016-06-14')
Out[140]: 1

解释：

In [141]: (df.foo != df.foo.shift()).cumsum()
Out[141]:
2016-06-01    1
2016-06-02    2
2016-06-03    3
2016-06-06    4
2016-06-07    4
2016-06-08    4
2016-06-09    4
2016-06-10    5
2016-06-13    5
2016-06-14    5
Freq: B, Name: foo, dtype: int32

In [142]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size')
Out[142]:
2016-06-01    1
2016-06-02    1
2016-06-03    1
2016-06-06    4
2016-06-07    4
2016-06-08    4
2016-06-09    4
2016-06-10    3
2016-06-13    3
2016-06-14    3
Freq: B, dtype: int64

In [143]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size').tail(1)
Out[143]:
2016-06-14    3
Freq: B, dtype: int64

票数 1

Stack Overflow用户

发布于 2016-06-30 13:41:23

您可以使用：

#reorder index in df
df = df[::-1]

def consec(col, date):
    #select df by date
    df1 = df.ix[date:,:]
    #get first group == 1
    colconsec = (df1[col] != df1[col].shift()).cumsum() == 1
    return 'Value is ' + str(df1.ix[0,col]) + ', Len is: '+ str(len(df1[colconsec]))

print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08')) 
print (consec('bar', '2016-06-02'))   
print (consec('foo', '2016-06-14'))   
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3

另一种解决方案是通过colconsec为创建掩码找到iat系列的最后值：

def consec(col, date):
    df1 = df.ix[:date,:]
    colconsec = (df1[col] != df1[col].shift()).cumsum() 
    mask = colconsec == colconsec.iat[-1]
    return 'Value is ' + str(df1[col].iat[-1]) + ', Len is: '+ str(len(df1[mask]))

print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08')) 
print (consec('bar', '2016-06-02'))   
print (consec('foo', '2016-06-14')) 
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/38123904

复制

相似问题

问如何很好地测量熊猫数据中相同数据的运行情况
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何很好地测量熊猫数据中相同数据的运行情况EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何很好地测量熊猫数据中相同数据的运行情况
EN