具有以下列的dataframe:
Index(['category', 'synonyms_text', 'enabled', 'stems_text'], dtype='object')
我感兴趣的只是synonyms_text中包含的行,例如,只包含单词food,而不是seafood:
df_text= df_syn.loc[df_syn['synonyms_text'].str.contains('food')]有以下结果(包括海鲜、食物柜和其他不需要的食物):
category synonyms_text \
130 Fishing seafarm, seafood, shellfish, sportfish
141 Refrigeration coldstorage, foodlocker, freeze, fridge, ice, refrigeration
183 Food Service cook, fastfood, foodserve, foodservice, foodtruck, mealprep
200 Restaurant expresso, food, galley, gastropub, grill, java, kitchen
377 fastfood carryout, fastfood, takeout
379 Animal Supplies feed, fodder, grain, hay, petfood
613 store convenience, food, grocer, grocery, market然后,我把结果发到一张名单上,作为单词得到食物:
food_l=df_text['synonyms_text'].str.split().tolist()但是,我在列表中的值如下:
['carryout,', 'fastfood,', 'takeout']所以,我去掉逗号:
food_l= [[x.replace(",","") for x in l]for l in food_l]最后,我将从列表中得到单词food:
food_l= [[l for x in l if "food"==x]for l in food_l]之后,我去掉了空名单:
food_l= [x for x in food_l if x != []]最后,我整理列表以获得最终结果:
food_l = [item for sublist in food_l for item in sublist]最后的结果如下:
[['bar', 'bistro', 'breakfast', 'buffet', 'cabaret', 'cafe', 'cantina', 'cappuccino', 'chai', 'coffee', 'commissary', 'cuisine', 'deli', 'dhaba', 'dine', 'diner', 'dining', 'eat', 'eater', 'eats', 'edible', 'espresso', 'expresso', 'food', 'galley', 'gastropub', 'grill', 'java', 'kitchen', 'latte', 'lounge', 'pizza', 'pizzeria', 'pub', 'publichouse', 'restaurant', 'roast', 'sandwich', 'snack', 'snax', 'socialhouse', 'steak', 'sub', 'sushi', 'takeout', 'taphouse', 'taverna', 'tea', 'tiffin', 'trattoria', 'treat', 'treatery'], ['convenience', 'food', 'grocer', 'grocery', 'market', 'mart', 'shop', 'store', 'variety']]@Erfan此数据可以用作测试:
df= pd.DataFrame({'category':['Fishing','Refrigeration','store'],'synonyms_text':['seafood','foodlocker','food']})两者都是空的:
df_tmp= df.loc[df['synonyms_text'].str.match('\bfood\b')]
df_tmp= df.loc[df['synonyms_text'].str.contains(pat='\bfood\b', regex= True)]您知道有什么更好的方法可以在不经历所有这些痛苦过程的情况下只使用单个单词food来获取行吗?我们是否有其他不同的函数包含在dataframe中查找与dataframe的值完全匹配的值?
谢谢
发布于 2019-10-31 00:00:39
示例dataframe:
df = pd.DataFrame({'category':['Fishing','Refrigeration','store'],
'synonyms_text':['seafood','foodlocker','food']})
print(df)
category synonyms_text
0 Fishing seafood
1 Refrigeration foodlocker
2 store food # <-- we want only the rows with exact "food"有三种方法可以做到这一点:
str.matchstr.containsstr.extract (这里不太有用)# 1
df['synonyms_text'].str.match(r'\bfood\b')# 2
df['synonyms_text'].str.match(r'\bfood\b')# 3
df['synonyms_text'].str.extract(r'(\bfood\b)').eq('food')输出
0 False
1 False
2 True
Name: synonyms_text, dtype: bool最后,我们使用boolean系列来过滤掉数据格式的.loc。
m = df['synonyms_text'].str.match(r'\bfood\b')
df.loc[m]输出
category synonyms_text
2 store food奖金
要匹配不区分大小写的使用?i
例如:
df['synonyms_text'].str.match(r'\b(?i)food\b')它将与food,Food,FOOD,fOoD相匹配
https://stackoverflow.com/questions/58634748
复制相似问题