我试图使用以下形式的数据完成一项任务(生物信息学,TCGA数据):
df = pd.DataFrame({'ID':['TCGA-AB-0001','TCGA-AB-0001','TCGA-AB-0001','TCGA-AB-0001','TCGA-AB-0002','TCGA-AB-0002','TCGA-AB-0002','TCGA-AB-0002','TCGA-AB-0003','TCGA-AB-0002'],
'Reference':['HG19','HG18','HG19','GRCh37','HG18','HG19','GRCh37','HG19','GRCh37','GRCh37'],
'SampleType':['Tumor','Tumor','Normal','Normal','Tumor','Normal','Normal','Tumor','Tumor','Tumor']
})看上去像是:
ID Reference SampleType
0 TCGA-AB-0001 HG19 Tumor
1 TCGA-AB-0001 HG18 Tumor
2 TCGA-AB-0001 HG19 Normal
3 TCGA-AB-0001 GRCh37 Normal
4 TCGA-AB-0002 HG18 Tumor
5 TCGA-AB-0002 HG19 Normal
6 TCGA-AB-0002 GRCh37 Normal
7 TCGA-AB-0002 HG19 Tumor
8 TCGA-AB-0003 GRCh37 Tumor
9 TCGA-AB-0002 GRCh37 Tumor如果行具有相同的“引用”和不同的“SampleType”,我将尝试匹配它们。其结果将是以下形式的新数据:
TUMOR NORMAL
index ID Reference SampleType index ID Reference SampleType
0 TCGA-AB-0001 HG19 Tumor 2 TCGA-AB-0001 HG19 Normal
7 TCGA-AB-0002 HG19 Tumor 5 TCGA-AB-0002 HG19 Tumor
9 TCGA-AB-0002 GRCh37 Tumor 6 TCGA-AB-0002 GRCh37 Normal现在我想删除重复的is,但是这样做的优先级是根据列表GRCh37、HG19、HG18。因此,例如,如果HG19和HG18都以相同的ID存在,我将保留HG19。结果应该如下所示:
TUMOR NORMAL
index ID Reference SampleType index ID Reference SampleType
0 TCGA-AB-0001 HG19 Tumor 2 TCGA-AB-0001 HG19 Normal
9 TCGA-AB-0002 GRCh37 Tumor 6 TCGA-AB-0002 GRCh37 Normal是否有办法通过群居或其他熊猫功能来做到这一点?
谢谢!
发布于 2016-06-02 21:37:54
对我来说,仍然不能百分之百地清楚所期望的产出是什么。但根据我的理解,这似乎起了作用。
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['TCGA-AB-0001','TCGA-AB-0001','TCGA-AB-0001','TCGA-AB-0001','TCGA-AB-0002','TCGA-AB-0002','TCGA-AB-0002','TCGA-AB-0002','TCGA-AB-0003','TCGA-AB-0002', 'TCGA-AB-0001', 'TCGA-AB-0001'],
'Reference':['HG19','HG18','HG19','GRCh37','HG18','HG19','GRCh37','HG19','GRCh37','GRCh37', 'GRCh37', 'GRCh37'],
'SampleType':['Tumor','Tumor','Normal','Normal','Tumor','Normal','Normal','Tumor','Tumor','Tumor', 'Normal', 'Tumor']
})这比原始示例和具有冗余候选行的测试稍微长一些。
ID Reference SampleType
0 TCGA-AB-0001 HG19 Tumor
1 TCGA-AB-0001 HG18 Tumor
2 TCGA-AB-0001 HG19 Normal
3 TCGA-AB-0001 GRCh37 Normal
4 TCGA-AB-0002 HG18 Tumor
5 TCGA-AB-0002 HG19 Normal
6 TCGA-AB-0002 GRCh37 Normal
7 TCGA-AB-0002 HG19 Tumor
8 TCGA-AB-0003 GRCh37 Tumor
9 TCGA-AB-0002 GRCh37 Tumor
10 TCGA-AB-0001 GRCh37 Normal
11 TCGA-AB-0001 GRCh37 Tumor现在,我们创建了一个具有“冗余”行的临时df。
##
# Create the df with sort and first level filtering
##
df_2 = df.groupby(['ID','Reference']).filter(lambda x:set(x.SampleType)=={'Tumor','Normal'}).drop_duplicates(['ID', 'Reference', 'SampleType']).sort(['ID','Reference', 'SampleType'])
# By dropping dups and sorting, the SampleType column must alternate: Normal, Tumor, Normal...
# Break into two pieces for horizontal concat
left = df_2.iloc[np.arange(0,df_2.shape[0], 2)]
right = df_2.iloc[np.arange(1, df_2.shape[0], 2)]
# Reindex by ID so that pd.concat can properly match rows
left['old_index'] = left.index.values
left.index = left['ID']
right['old_index'] = right.index.values
right.index = right['ID']
right.columns = [c + '_2' for c in right.columns] # Rename right side columns so we can groupby(['ID'])
# Horizontal concat
temp = pd.concat([left, right], axis=1) # with possible duplicates for each unique (ID, Reference) tuple
temp.index = np.arange(temp.shape[0])
temp
ID Reference SampleType old_index ID_2 Reference_2 \
0 TCGA-AB-0001 GRCh37 Normal 3 TCGA-AB-0001 GRCh37
1 TCGA-AB-0001 HG19 Normal 2 TCGA-AB-0001 HG19
2 TCGA-AB-0002 GRCh37 Normal 6 TCGA-AB-0002 GRCh37
3 TCGA-AB-0002 HG19 Normal 5 TCGA-AB-0002 HG19
SampleType_2 old_index_2
0 Tumor 11
1 Tumor 0
2 Tumor 9
3 Tumor 7如果我正确理解,我们只想为每个ID保留一行,按照priority = ['GRCh37', 'HG19', 'HG18']的顺序选择它们
##
# Second level of filtering using priority list
##
priority = ['GRCh37', 'HG19', 'HG18']
g = temp.groupby(['ID'])
def filter_2(grp, priority = ['GRCh37', 'HG19', 'HG18']):
pos = np.argsort(grp['Reference'], priority).iloc[0]
idx = grp.index[pos]
return grp.loc[idx, :]
final = temp.groupby(['ID']).apply(filter_2)
final.index = np.arange(final.shape[0])这就产生了我对最终期望输出的理解。注意:这与最初的示例不同,因为我在输入df中进行了扩展。
final
ID Reference SampleType old_index ID_2 Reference_2 \
0 TCGA-AB-0001 GRCh37 Normal 3 TCGA-AB-0001 GRCh37
1 TCGA-AB-0002 GRCh37 Normal 6 TCGA-AB-0002 GRCh37
SampleType_2 old_index_2
0 Tumor 11
1 Tumor 9发布于 2016-06-02 08:41:38
为了创建新的数据,您可以使用熊猫条件切片:(在您的问题中,您在索引5行的dataframe NORMAL上犯了一个错误,SampleType应该是Normal而不是Tumor)
NORMAL = df[df['SampleType']=='Normal'].copy()
TUMOR = df[df['SampleType']=='Tumor'].copy()或者,如果你有机会拥有'normal'和'tumor'以外的任何东西,而你只想得到'normal'以外的一切
NORMAL = df[df['SampleType']=='Normal']
TUMOR = df[~df['SampleType']=='Normal']然后,为了删除重复项并保留特定值,可以创建另一列,该列保留相同的信息,但由整数组成(比字符串列表更容易排序):
NORMAL['Whatever'] = 0
TUMOR['Whatever'] = 0当然,您可以在拆分dataframe之前这样做(然后只在一个dataframe而不是两个dataframe上这样做)。填写本栏:
NORMAL.ix[NORMAL['Reference'] == 'HG19','Whatever'] = 1
TUMOR.ix[TUMOR['Reference'] == 'HG19','Whatever'] = 1
NORMAL.ix[NORMAL['Reference'] == 'HG18','Whatever'] = 2
TUMOR.ix[TUMOR['Reference'] == 'HG18','Whatever'] = 2然后按照这个新列进行排序,并删除重复项,只保留第一行:
NORMAL.sort_values(by = 'Whatever', inplace = True)
NORMAL.drop_duplicates(subset = 'ID',inplace = True)
TUMOR.sort_values(by = 'Whatever', inplace = True)
TUMOR.drop_duplicates(subset = 'ID',inplace = True)为了获得预期的输出,删除临时列,然后按索引执行:
NORMAL.drop('Whatever',1,inplace = True)
NORMAL.sort_index(inplace = True)
TUMOR.drop('Whatever',1,inplace = True)
TUMOR.sort_index(inplace = True)产出:
Out[3]: NORMAL
ID Reference SampleType
3 TCGA-AB-0001 GRCh37 Normal
6 TCGA-AB-0002 GRCh37 Normal
Out[32]: TUMOR
ID Reference SampleType
0 TCGA-AB-0001 HG19 Tumor
8 TCGA-AB-0003 GRCh37 Tumor
9 TCGA-AB-0002 GRCh37 Tumorhttps://stackoverflow.com/questions/37579803
复制相似问题