我有以下输入df:
domain ip timestamp
0 Google 101 2020-04-01 23:01:41
1 Google 101 2020-04-01 23:01:59
2 Google 101 2020-04-02 12:01:41
3 Facebook 101 2020-04-02 13:11:33
4 Facebook 101 2020-04-02 13:11:35
5 Youtube 103 2020-04-21 13:01:41
6 Youtube 103 2020-04-21 13:11:46
7 Youtube 103 2020-04-22 01:01:01
8 Google 103 2020-04-22 02:11:23
9 Facebook 103 2020-04-23 14:11:13
10 Youtube 103 2020-04-23 14:11:55我怎样才能得到这个输出?其中,domain_num是一个迭代器,每次域在IP内切换时都会增加。
domain ip timestamp domain_num
0 Google 101 2020-04-01 23:01:41 1
1 Google 101 2020-04-01 23:01:59 1
2 Google 101 2020-04-02 12:01:41 1
3 Facebook 101 2020-04-02 13:11:33 2
4 Facebook 101 2020-04-02 13:11:35 2
5 Youtube 103 2020-04-21 13:01:41 1
6 Youtube 103 2020-04-21 13:11:46 1
7 Youtube 103 2020-04-22 01:01:01 1
8 Google 103 2020-04-22 02:11:23 2
9 Facebook 103 2020-04-23 14:11:13 3
10 Youtube 103 2020-04-23 14:11:55 4我试过这样的方法,得到计数,但我需要按ip分组
df['domain'].ne(df['domain'].shift()).cumsum()下面的代码出错了
df.groupby('ip').apply(lambda x : x[x.domain.ne(x.domain.shift().cumsum())])数据
import pandas as pd
data = {'domain':['Google', 'Google', 'Google', 'Facebook', 'Facebook', 'Youtube', 'Youtube', 'Youtube', 'Google', 'Facebook', 'Youtube'],
'ip':[101, 101, 101, 101, 101, 103, 103, 103, 103, 103, 103],
'timestamp' : ['2020-04-01 23:01:41', '2020-04-01 23:01:59', '2020-04-02 12:01:41', '2020-04-02 13:11:33',
'2020-04-02 13:11:35', '2020-04-21 13:01:41', '2020-04-21 13:11:46',
'2020-04-22 01:01:01', '2020-04-22 02:11:23','2020-04-23 14:11:13', '2020-04-23 14:11:55' ]}
df = pd.DataFrame(data)
df['timestamp']= pd.to_datetime(df['timestamp'])发布于 2022-01-18 17:25:54
假设您的数据按timestamp列排序:
inc_domain_num = lambda x: x.ne(x.shift()).cumsum()
df['domain_num'] = df.groupby('ip')['domain'].apply(inc_domain_num)
print(df)
# Output
domain ip timestamp domain_num
0 Google 101 2020-04-01 23:01:41 1
1 Google 101 2020-04-01 23:01:59 1
2 Google 101 2020-04-02 12:01:41 1
3 Facebook 101 2020-04-02 13:11:33 2
4 Facebook 101 2020-04-02 13:11:35 2
5 Youtube 103 2020-04-21 13:01:41 1
6 Youtube 103 2020-04-21 13:11:46 1
7 Youtube 103 2020-04-22 01:01:01 1
8 Google 103 2020-04-22 02:11:23 2
9 Facebook 103 2020-04-23 14:11:13 3
10 Youtube 103 2020-04-23 14:11:55 4发布于 2022-01-18 17:32:02
假设ip被正确地分组(不一定按排序顺序排列),首先查找您希望在以下位置增加的所有位置:
df['domain_num'] = (df['domain'] != df['domain'].shift(1)) | (df['ip'] != df['ip'].shift(1))现在将其替换为累积和,但对每一组都是独立的:
df['domain_num'] = df.groupby('ip')['domain_num'].cumsum()https://stackoverflow.com/questions/70759785
复制相似问题