我试图删除一个单元格中的重复单词,以清理处理姓名(丈夫和妻子)的数据。
Current Desired
0 John Doe and Jane Doe John Doe and Jane doe
1 John Doe and John Doe John Doe
2 John Doe John Doe
3 Jane Doe and Jane Doe Jane Doe
4 Jane Doe and Jane Jane Doe
5 John and John Doe John Doe我有以下内容,可以很好地将John and John清除为John
df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))我如何调整我的代码,以考虑到其他情况,以清理名字和姓氏?
发布于 2022-07-21 18:56:02
尝试:
import re
pat = re.compile(r"(.+?)\s+and\s+(.+)")
def clean(x):
m = pat.match(x)
if not m:
return x
if m.group(1).split()[0] == m.group(2).split()[0]:
return max(m.group(1), m.group(2), key=len)
return x
df["Desired 2"] = df["Current"].apply(clean)
print(df)指纹:
Current Desired Desired 2
0 John Doe and Jane Doe John Doe and Jane doe John Doe and Jane Doe
1 John Doe and John Doe John Doe John Doe
2 John Doe John Doe John Doe
3 Jane Doe and Jane Doe Jane Doe Jane Doe
4 Jane Doe and Jane Jane Doe Jane Doe
5 John and John Doe John Doe John Doehttps://stackoverflow.com/questions/73071114
复制相似问题