我有一个如下的DF:
df = pd.DataFrame({'token': {0: 'FATHER', 1: 'MILTON', 2: 'IAN', 3: 'SMITH', 4: '.', 5: 'JOINTLY', 6: 'WITH', 7: 'BROTHER', 8: 'GREG', 9: 'I', 10: 'SMITH'}, 'tag': {0: 'O', 1: 'PERSON', 2: 'PERSON', 3: 'PERSON', 4: 'O', 5: 'O', 6: 'O', 7: 'O', 8: 'PERSON', 9: 'PERSON', 10: 'PERSON'}})
token tag
0 FATHER O
1 MILTON PERSON
2 IAN PERSON
3 SMITH PERSON
4 . O
5 JOINTLY P
6 WITH O
7 BROTHER O
8 GREG PERSON
9 I PERSON
10 SMITH PERSON我想要做的是将标记为PERSON的所有连续行分组,并连接标记。
预期输出:
token
0 MILTON IAN SMITH
1 GREG I SMITH发布于 2019-08-22 21:08:23
创建具有连续值PERSON by shift和cumsum的唯一组,并按掩码筛选匹配的行,然后传递给具有GroupBy.apply和join的groupby
m = df['tag'].eq('PERSON')
s = m.ne(m.shift()).cumsum()[m]
df = df.groupby(s)['token'].apply(' '.join).reset_index(drop=True).to_frame('token')
print (df)
token
0 MILTON IAN SMITH
1 GREG I SMITH发布于 2019-08-22 22:35:52
jezrael的回答已经足够好了。我将在这里抛出另一个解决方案。关键是为每个人员组创建标签。
创建gruops,
group = df['tag'].ne('PERSON').cumsum().where(df['tag'].eq('PERSON'))输出
0 NaN
1 1.0
2 1.0
3 1.0
4 NaN
5 NaN
6 NaN
7 NaN
8 5.0
9 5.0
10 5.0然后,
df['token'].groupby(group).apply(' '.join).reset_index(drop=True)请注意,groupby将自动删除标记为NaN的组。
https://stackoverflow.com/questions/57610062
复制相似问题