如果列表中的任何单词与dataframe字符串列完全匹配,我希望创建一个带有1或0的新列。
列表中的单词在之间可以有多个空格,因此我无法使用str.split()进行精确匹配。
list_provided=["mul the","a b c"]
#how my dataframe looks
id text
a simultaneous there the
b simultaneous there
c mul why the
d mul the
e simul a b c
f a c b预期输出
id text found
a simultaneous there the 0
b simultaneous there 0
c mul why the 0
d mul the 1
e simul a b c 1
f a c b 0列表元素中单词的排序也很重要!!
代码到现在为止尝试过
data=pd.DataFrame({"id":("a","b","c","d","e","f"), "text":("simultaneous there the","simultaneous there","mul why the","mul the","simul a b c","a c b")})
list_of_word=["mul the","a b c"]
pattern = '|'.join(list_of_word)
data['found'] = data['text'].apply(lambda x: sum(i in list_of_test_2 for i in x.split()))
data['found']=np.where(data['found']>0,1,0)
data
###Output generated###
id text found
a simultaneous there the 0
b simultaneous there 0
c mul why the 0
d mul the 0
e simul a b c 0
f a c b 0如何获得预期的输出,在其中我必须搜索与数据字符串列的列表中单词的准确匹配,其中有多个空格?
发布于 2018-04-11 11:26:22
你快到了,你做了所有的基础工作,现在剩下的就是调用右边的函数,在这个例子中,是str.contains。
data['found'] = data.text.str.contains(pattern).astype(int)
data
id text found
0 a simultaneous there the 0
1 b simultaneous there 0
2 c mul why the 0
3 d mul the 1
4 e simul a b c 1
5 f a c b 0如果您的模式本身包含regex或管道,那么首先尝试转义它们:
import re
pattern = '|'.join([re.escape(i) for i in list_of_word])https://stackoverflow.com/questions/49773486
复制相似问题