我想从title列中提取以下字符串,并将其附加到名为hazard_extract的新列中,如下例所示。
test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other'], 'hazard_extract':['Other', 'Microbiological', 'Extraneous Material', 'Chemical', 'Chemical', 'Labelling']}
example = pd.DataFrame(test)
example title hazard_extract
0 Other Other
1 Microbiological - Listeria Microbiological
2 Extraneous Material Extraneous Material
3 Chemical Chemical
4 Chemical - Histamine Chemical
5 Labelling, Other Labelling但是,我使用的是下面的代码--如果字符串没有-或,,它就不会提取字符串。在这种情况下,如何既提取Extraneous Material中的单词,又提取Chemical或Other中的单个单词
example['hazard_extract'] = example['title'].str.extract(r'^(.*?),? ') title hazard_extract
0 Other NaN
1 Microbiological - Listeria Microbiological
2 Extraneous Material Extraneous
3 Chemical NaN
4 Chemical - Histamine Chemical
5 Labelling, Other Labelling非常感谢你的帮助!
发布于 2021-03-15 12:34:20
不需要复杂的正则表达式:
import pandas as pd
test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other']}
example = pd.DataFrame(test)
print(example)
print()
example['hazard_extract'] = example['title'].str.split(' -|,').str[0]
print(example) title
0 Other
1 Microbiological - Listeria
2 Extraneous Material
3 Chemical
4 Chemical - Histamine
5 Labelling, Other
title hazard_extract
0 Other Other
1 Microbiological - Listeria Microbiological
2 Extraneous Material Extraneous Material
3 Chemical Chemical
4 Chemical - Histamine Chemical
5 Labelling, Other Labelling发布于 2021-03-15 12:35:59
最简单的方法是使用split
example['title'].str.split(r'[-,]').str[0].str.strip()0 Other
1 Microbiological
2 Extraneous Material
3 Chemical
4 Chemical
5 Labelling发布于 2021-03-15 12:32:12
试试这个:
example['title'].str.extract(r'^(\w*\s*\w*)\s*[\,\-]?.*')https://stackoverflow.com/questions/66632333
复制相似问题