我想在数据框中搜索某些关键字,然后为找到关键字的行条目建立索引,以便以后进行操作。
假设我得到了一个数据帧和一些类似以下结构的关键字:
import pandas as pd
data = {"metals": ["copper", "zinc", "aluminium", "iron", "platinum", "gold", "silver", "copper and zinc"]}
df = pd.DataFrame(data)
keywords = ["copper", "zinc"]最终,我希望实现以下目标:
# What I would like to obtain
[in] data
[out]
| ID | metals | label |
| -- | ----------------- | -------------- |
|0 |copper | copper |
|1 |zinc | zinc |
|2 |aluminium | 0 |
|3 |iron | 0 |
|4 |platinum | 0 |
|5 |gold | 0 |
|6 |silver | 0 |
|7 |copper and zinc | [copper, zinc] |我想出了随后的循环,但是它只返回:
df['label'] = 0
for word in keywords:
df['label'][df['metals'].str.contains(word)] = word
# What I actually obtain
[in] data
[out]
| ID | metals | label |
| -- | ----------------- | -------------- |
|0 |copper | copper |
|1 |zinc | zinc |
|2 |aluminium | 0 |
|3 |iron | 0 |
|4 |platinum | 0 |
|5 |gold | 0 |
|6 |silver | 0 |
|7 |copper and zinc | zinc |如何构建一个循环,用每一行中所有匹配的单词来更新'label‘列?我将非常感谢您的反馈。
发布于 2021-03-02 22:41:25
您可以简单地使用str.findall查找匹配模式的所有匹配项:
pat = fr"\b({'|'.join(keywords)})\b"
df['label'] = df['metals'].str.findall(pat)
>>> df
metals label
0 copper [copper]
1 zinc [zinc]
2 aluminium []
3 iron []
4 platinum []
5 gold []
6 silver []
7 copper and zinc [copper, zinc]如果您特别希望以问题中所示的所需格式输出,则还可以使用np.select
s = df['metals'].str.findall(fr"\b({'|'.join(keywords)})\b")
l = s.str.len()
df['label'] = np.select([l.ge(2), l.eq(1)], [s, s.str[0]], 0)
>>> df
metals label
0 copper copper
1 zinc zinc
2 aluminium 0
3 iron 0
4 platinum 0
5 gold 0
6 silver 0
7 copper and zinc [copper, zinc]发布于 2021-03-02 22:28:03
使用str.extractall
pattern = '|'.join(keywords)
df['label'] = (df['metals'].str.extractall(rf'\b({pattern})\b')[0]
.groupby(level=0).agg(list)
)输出:
metals label
0 copper [copper]
1 zinc [zinc]
2 aluminium NaN
3 iron NaN
4 platinum NaN
5 gold NaN
6 silver NaN
7 copper and zinc [copper, zinc]https://stackoverflow.com/questions/66441252
复制相似问题