我有这个数据集(输出到一个.csv文件中):
email, link
0,,
1, hello@dog.com, dog.com
2, bark@dog.com, dog.com
3, growl@dog.com, dog.com
4, meow@cat.net, cat.net
5, purr@cat.net, cat.net,
6, sleep@cat.net, cat.net
7, scream@monkey.eu, monkey.eu
8, run@horse.com, horse.com正如你所看到的,一些链接是相同的,而电子邮件总是唯一的。我想保留相同链接的最多2行,删除第三行和后续序列,如下所示:
email, link
0,,
1, hello@dog.com, dog.com
2, bark@dog.com, dog.com
3, meow@cat.net, cat.net
4, purr@cat.net, cat.net,
5, scream@monkey.eu, monkey.eu
6, run@horse.com, horse.com该怎么做呢?我尝试了这个解决方案,但它只输出链接。将其与电子邮件地址合并,由于子集(列表)的长度不同,一切都会变得混乱:
from collections import Counter
def keep_n_dupes(remove_from, how_many):
counts = Counter()
for item in remove_from:
counts[item] += 1
if counts[item] <= how_many:
yield item
new_links = list(keep_n_dupes(df['link'], 2))发布于 2019-11-04 19:01:26
df.groupby('link').head(2)
email link
0 hello@dog.com dog.com
1 bark@dog.com dog.com
3 meow@cat.net cat.net
4 purr@cat.net cat.net
6 scream@monkey.eu monkey.eu
7 run@horse.com horse.com发布于 2019-11-04 19:27:45
另一种方法是使用nth
df.groupby('link', as_index=False).nth([0,1])
Out[587]:
email link
1 hello@dog.com dog.com
2 bark@dog.com dog.com
4 meow@cat.net cat.net
5 purr@cat.net cat.net
7 scream@monkey.eu monkey.eu
8 run@horse.com horse.com发布于 2019-11-04 19:04:51
Pandas具有groupby功能
import pandas as pd
df = pd.read_csv('path to the file')
df.groupby('link').head(2)上面的命令将对链接进行分组,并打印具有相同链接的前2行
https://stackoverflow.com/questions/58692016
复制相似问题