我有一个数据文件,例如:
COL1 COL2
A eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus
B viruses; Retroviridae
C viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus
D Unclassified; RNA virus 我希望解析COL2列,其中元素用";"分隔,并为每一行添加一个包含"viridae"元素的COL3列。
然后我应该得到:
COL1 COL2 COL3
A eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus carnivoridae
B viruses; Retroviridae Retroviridae
C viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus Phenuiviridae
D Unclassified; RNA virus NA有人想办法吗?
这是dict格式的数据格式,如果可以帮助的话
{'COL1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'COL2 ': {0: 'eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus', 1: 'viruses; Retroviridae', 2: 'viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus', 3: 'Unclassified; RNA virus '}}发布于 2022-05-27 10:09:02
你可以这样做:
import pandas as pd
import re
df = {'COL1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'COL2': {0: 'eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus', 1: 'viruses; Retroviridae', 2: 'viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus', 3: 'Unclassified; RNA virus '}}
df = pd.DataFrame(df)然后可以使用以下方法:首先将列转换为列表列
df['COL2_list']= df['COL2'].str.split(';')
df = df.reset_index()然后解析df的每一行,以找到所需的字符串(在这里,我选择'ridae'):
DF = []
for i in range(len(df)):
a = df[df.index==i]
b = [string for string in a['COL2_list'][i] if 'ridae' in string]
a = np.where(len(b)!=0, b,'NAN')
DF.append(a)
DF = pd.DataFrame(DF, columns = ['COL3'])
DF这给了你
COL3
0 carnivoridae
1 Retroviridae
2 Phenuiviridae
3 None然后将结果串联起来:
Full = pd.concat([df,DF], axis=1)这意味着:
index COL1 COL2 \
0 0 A eucaryotes; mammal; carnivoridae; carnivorinae...
1 1 B viruses; Retroviridae
2 2 C viruses; mononegavirales; Phenuiviridae; Ascov...
3 3 D Unclassified; RNA virus
COL2_list COL3
0 [eucaryotes, mammal, carnivoridae, carnivor... carnivoridae
1 [viruses, Retroviridae] Retroviridae
2 [viruses, mononegavirales, Phenuiviridae, A... Phenuiviridae
3 [Unclassified, RNA virus ] None 与你所写的略有不同,但这是因为你拼写得不对。
发布于 2022-05-27 17:11:07
要匹配您的示例:
df.assign(COL3=df['COL2'].str.extract('(\w+v[io]ridae)'))
COL1 COL2 COL3
0 A eucaryotes; mammal; carnivoridae; carnivorinae... carnivoridae
1 B viruses; Retroviridae Retroviridae
2 C viruses; mononegavirales; Phenuiviridae; Ascov... Phenuiviridae
3 D Unclassified; RNA virus NaN为了符合你所说的寻找以病毒科结尾的单词的要求:
df.assign(COL3=df['COL2'].str.extract('(\w+viridae)'))
COL1 COL2 COL3
0 A eucaryotes; mammal; carnivoridae; carnivorinae... NaN
1 B viruses; Retroviridae Retroviridae
2 C viruses; mononegavirales; Phenuiviridae; Ascov... Phenuiviridae
3 D Unclassified; RNA virus NaNhttps://stackoverflow.com/questions/72403130
复制相似问题