我有带有基因的数据帧:
pName genotype feture
person_1 TT feature_1
person_1 TY feature_2
person_1 YY feature_3
person_1 TY feature_4
person_2 TT feature_1
person_2 TT feature_2
person_2 YY feature_3
person_2 YY feature_4 我收集了一些病症。例如,它们中的大多数都基于一种基因型:
IF feature 1 == YY interpretation = RED
IF feature 1 == TY interpretation = BLUE
IF feature 1 == TT interpretation = Green我为此编写了pandas代码:
data.loc[(data['feture'] == 'feature_1') & (data['genotype'] == 'YY'),'interpretation'] = "RED"
data.loc[(data['feture'] == 'feature_1') & (data['genotype'] == 'TY'),'interpretation'] = "BLUE"
data.loc[(data['feture'] == 'feature_1') & (data['genotype'] == 'TT'),'interpretation'] = "Green"
etc. (3x 10 feauters)所以我得到了:
pName genotype feture interpetation
person_1 TT feature_1 Green
person_1 TY feature_2 ...
person_1 YY feature_3
person_1 TY feature_4
person_2 TT feature_1 Green
person_2 TT feature_2 ...
person_2 YY feature_3
person_2 YY feature_4 但我对基于两个基因的特征有问题。例如:
IF feature_3 == YY interpretation = RED
IF feature_4 == TT interpretation = BLUE但另外:
(IF feature_3 == YY) & (IF feature_4 == TT) interpretation = R/B正如您所看到的,我需要为拥有feature3和feature4的每个人添加新行。
最终的dataFrame将是这样的:
pName genotype feture interpetation
person_1 TT feature_1 Green
person_1 TY feature_2 ...
person_1 YY feature_3 RED
person_1 TY feature_4 BLUE
person_1 YYTY new_feature_34 R/W #new feature based on two others
person_2 TT feature_1 Green
person_2 TT feature_2 ...
person_2 YY feature_3 BLUE
person_2 YY feature_4 BLUE
person_2 YYYY new_feature_34 W/W #new feature based on two others因此,如果:
(IF feature_3 == YY) & (IF feature_4 == TY)我添加了新的行: person,两种基因型以及名称和解释的组合。如示例所示。
我也不知道熊猫该怎么做。我试着找到一个解决方案,但我没有。
我通过纯python解决了我的问题:
1)创建人员列表。
2)在df上迭代,并为每个人检查两个特征。
3)数据帧新增功能: person + CAT(genotype1,genotype2) + newFeatureXY +解释
但是如果我有超过1000个人,那就太慢了。在大熊猫身上也能做到吗?
发布于 2020-03-10 21:55:27
您可以在这里使用groupby和apply来构建新的行,然后将它们附加到数据帧中。但是,由于构建新行的函数不是微不足道的,因此我将明确声明它:
def feat34(x):
y = (x['feture'] == 'feature_3') & (x['genotype'] == 'YY')
z = (x['feture'] == 'feature_4') & (x['genotype'] == 'TY')
if y.any() and z.any():
return pd.DataFrame([['YYTY','new_feature_34', 'R/B']],
columns=x.columns[1:])
else:
return None
data = data.append(data.groupby('pName').apply(feat34).reset_index(
level=0)).sort_values('pName')通过样本数据,它提供了:
pName genotype feture interpretation
0 person_1 TT feature_1 Green
1 person_1 TY feature_2 NaN
2 person_1 YY feature_3 NaN
3 person_1 TY feature_4 NaN
0 person_1 YYTY new_feature_34 R/B
4 person_2 TT feature_1 Green
5 person_2 TT feature_2 NaN
6 person_2 YY feature_3 NaN
7 person_2 YY feature_4 NaN发布于 2020-03-10 23:42:43
您可以按照Serge Ballesta的建议生成新的列'feature_genotype'并使用groupby和apply:
import pandas as pd
n = 20_000
name = ['person_']*4*n
name = [p + str(i//4) for i, p in enumerate(name)]
df = pd.DataFrame({'pName': name,
'genotype': ['TT', 'TY', 'YY', 'TY']*n,
'feature': ['feature_1', 'feature_2', 'feature_3', 'feature_4']*n,
'interpretation': ['Green', '...', 'RED', 'BLUE']*n})
def fill_values(x, new):
v = x.feature_genotype.values
if 'feature_3_YY' in v and 'feature_4_TY' in v:
new.append({'pName': x.name,
'genotype': 'YYTY',
'feature': 'new_feature_34',
'interpretation': 'R/W'})
df['feature_genotype'] = df.feature + '_' + df.genotype
new = []
%time df.groupby('pName').apply(lambda x: fill_values(x, new))
Wall time: 1.19 s因此,对于80000个数据集,需要1.19秒。这对于丢弃副本也很重要,因为apply有时会将第一组处理两次:
new = pd.DataFrame(new)
new = new.drop_duplicates()
df = df.append(new).drop('feature_genotype', axis=1).sort_values('pName')但实际上,我建议使用每个唯一的'pName'为其他列的每个唯一值创建列来处理此df会更方便。
https://stackoverflow.com/questions/60617460
复制相似问题