在this question上,我有一个类似于这样的数据,我想创建一个新的列next_domain和next_next_domain。
它是通过根据时间戳查找IP的下一个域(和下一个域)来计算的。我怎样才能在熊猫身上做到这一点?
输入:
domain ip timestamp
0 Google 101 2020-04-01 23:01:41
1 Google 101 2020-04-01 23:01:59
2 Google 101 2020-04-02 12:01:41
3 Facebook 101 2020-04-02 13:11:33
4 Facebook 101 2020-04-02 13:11:35
5 Youtube 103 2020-04-21 13:01:41
6 Youtube 103 2020-04-21 13:11:46
7 Youtube 103 2020-04-22 01:01:01
8 Google 103 2020-04-22 02:11:23
9 Facebook 103 2020-04-23 14:11:13
10 Youtube 103 2020-04-23 14:11:55预期输出:
domain ip timestamp next_domain next_next_domain
0 Google 101 2020-04-01 23:01:41 Facebook N/A
1 Google 101 2020-04-01 23:01:59 Facebook N/A
2 Google 101 2020-04-02 12:01:41 Facebook N/A
3 Facebook 101 2020-04-02 13:11:33 N/A N/A
4 Facebook 101 2020-04-02 13:11:35 N/A N/A
5 Youtube 103 2020-04-21 13:01:41 Google Facebook
6 Youtube 103 2020-04-21 13:11:46 Google Facebook
7 Youtube 103 2020-04-22 01:01:01 Google Facebook
8 Google 103 2020-04-22 02:11:23 Facebook Youtube
9 Facebook 103 2020-04-23 14:11:13 Youtube N/A
10 Youtube 103 2020-04-23 14:11:55 N/A N/A发布于 2022-01-25 17:13:24
在前面问题的my answer基础上,您可以创建一个函数并迭代该过程:
def next_domain(df, col='domain', group='ip'):
s = df[col]
return (s.where(s.ne(s.shift())) # keep only first domain of each stretch
.groupby(df[group]) # per group
.apply(lambda s: s.bfill().shift(-1)) # bfill and shift up
)
df['next_domain'] = next_domain(df, 'domain')
df['next_next_domain'] = next_domain(df, 'next_domain')产出:
domain ip timestamp next_domain next_next_domain
0 Google 101 2020-04-01 23:01:41 Facebook NaN
1 Google 101 2020-04-01 23:01:59 Facebook NaN
2 Google 101 2020-04-02 12:01:41 Facebook NaN
3 Facebook 101 2020-04-02 13:11:33 NaN NaN
4 Facebook 101 2020-04-02 13:11:35 NaN NaN
5 Youtube 103 2020-04-21 13:01:41 Google Facebook
6 Youtube 103 2020-04-21 13:11:46 Google Facebook
7 Youtube 103 2020-04-22 01:01:01 Google Facebook
8 Google 103 2020-04-22 02:11:23 Facebook Youtube
9 Facebook 103 2020-04-23 14:11:13 Youtube NaN
10 Youtube 103 2020-04-23 14:11:55 NaN NaNhttps://stackoverflow.com/questions/70852633
复制相似问题