我有一个数据数据(df1),详细的基因和与之相关的器官列表,以及另一个映射数据(df2),它将这些器官转化为独特的器官类型。
例如。
df1 <-
data.frame ("Gene_name"=c("Gene1", "Gene2", "Gene3, "Gene4"),
"Organ_name"=c("Skin, Stomach, Eyes, Hair", "Lungs, Mouth, Oesophagus", "Pharynx, Lungs, Throat, Skin", "Stomach, Small intestine"))df2 <-
data.frame ("Type"=c("External", "External", "External", "External"......"Internal", "Internal", "Internal"...),
"Organ"=c("Skin", "Eyes", "Hair", "Legs",.... "Lungs", "Small intestine", "Oesophagus".....))我想看看个人基因属于什么主要类别。经常出现在内部还是外部?
如果我使用"Organ_name"拆分str.split(","),那么在某些情况下,我将得到大约20列。将这些单独的"Organ_name“列与df1中的"Type"合并,使用Organ作为关键,这是一个很大的痛苦。
有没有更好的方法来分析这些数据?如何知道器官"Type"的频率/计数?请让我知道
发布于 2018-03-23 22:23:30
下面是如何使用pandas构建逻辑的一个示例。
设置
import pandas as pd
df1 = pd.DataFrame({"Gene_name": ("Gene1", "Gene2", "Gene3", "Gene4"),
"Organ_name": ("Skin, Stomach, Eyes, Hair", "Lungs, Mouth, Oesophagus",
"Pharynx, Lungs, Throat, Skin", "Stomach, Small intestine")})
df2 = pd.DataFrame({"Type": ("External", "External", "External", "External", "Internal", "Internal", "Internal"),
"Organ": ("Skin", "Eyes", "Hair", "Legs", "Lungs", "Small intestine", "Oesophagus")})溶液
t = df2.set_index('Organ')['Type']
df1['Organ_list'] = df1['Organ_name'].str.split(', ')
df1['Int_Ext'] = [list(filter(None, map(t.get, x))) for x in df1['Organ_list']]
df1['Int_Ext_Flag'] = df1['Int_Ext'].apply(lambda x: 'Internal' if \
x.count('Internal') / len(x) >= 0.5 else 'External')结果
Gene_name Organ_name Organ_list \
0 Gene1 Skin, Stomach, Eyes, Hair [Skin, Stomach, Eyes, Hair]
1 Gene2 Lungs, Mouth, Oesophagus [Lungs, Mouth, Oesophagus]
2 Gene3 Pharynx, Lungs, Throat, Skin [Pharynx, Lungs, Throat, Skin]
3 Gene4 Stomach, Small intestine [Stomach, Small intestine]
Int_Ext Int_Ext_Flag
0 [External, External, External] External
1 [Internal, Internal] Internal
2 [Internal, External] Internal
3 [Internal] Internal 解释
df2创建从器官到类型的映射。df1['Organ_list']中的字符串拆分成一个列表。pd.Series.apply添加逻辑以确定“内部”还是“外部”。list(filter(None, ...))过滤出尚未映射为类型的器官。https://stackoverflow.com/questions/49458108
复制相似问题