我有一个数据文件,例如:
Names1 Gene_name Status
SP1 GENE1 0
SP1 GENE1 1
SP1 GENE1 1
SP1 GENE1 2
SP1 GENE1 2
SP1 GENE2 0
SP3 GENE2 0
SP1 GENE2 1
SP2 GENE2 2
SP4 GENE3 1
SP4 GENE3 2
SP5 GENE3 0
SP5 GENE3 0 然后,我想填充一个新的数据格式,其中每个Gene_name都是一个列,每个Names都是一行:
Names GENE1 GENE2 GENE3
SP1
SP2
SP3
SP4
SP5 并根据每个Values组的Satus填充单元格Names
如果只有0>值=0
h 116如果两者都值=1-2H 217<
>H 118如果都是0&1&2>值=0-1-2<代码>H 219F 220则为
。例如,GENE1在SP1中都呈现0、1和2状态,因此我在单元格中填充0-1-2:
Names GENE1 GENE2 GENE3
SP1 0-1-2
SP2
SP3
SP4
SP5 然后,SP2,SP3,SP4 and SP5对GENE1没有价值,因此我将NA放在:
Names GENE1 GENE2 GENE3
SP1 0-1-2
SP2 NA
SP3 NA
SP4 NA
SP5 NA那么对于GENE2:
GENE2在SP1中同时呈现0和1状态,因此我在单元格中填充0-1:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA
SP3 NA
SP4 NA
SP5 NAGENE2在SP2中只呈现值2状态,因此我在单元格中填充2:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA 2
SP3 NA
SP4 NA
SP5 NAGENE2在SP3中只显示值0状态,因此我在单元格中填充0:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA 2
SP3 NA 0
SP4 NA
SP5 NA其他名称没有GENE2值,所以我将NA放在
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA 2
SP3 NA 0
SP4 NA NA
SP5 NA NA等等..。最后,我应该得到一个完整的数据,如:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1 NA
SP2 NA 2 NA
SP3 NA 0 NA
SP4 NA NA 0-2
SP5 NA NA 0有人想办法吗?
以下是dict格式的dataframe (如果可以的话):
{'Names1': {0: 'SP1', 1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP1', 5: 'SP1', 6: 'SP3', 7: 'SP1', 8: 'SP2', 9: 'SP4', 10: 'SP4', 11: 'SP5', 12: 'SP5'}, 'Gene_name': {0: 'GENE1', 1: 'GENE1', 2: 'GENE1', 3: 'GENE1', 4: 'GENE1', 5: 'GENE2', 6: 'GENE2', 7: 'GENE2', 8: 'GENE2', 9: 'GENE3', 10: 'GENE3', 11: 'GENE3', 12: 'GENE3'}, 'Status': {0: 0, 1: 1, 2: 1, 3: 2, 4: 2, 5: 0, 6: 0, 7: 1, 8: 2, 9: 1, 10: 2, 11: 0, 12: 0}}发布于 2022-12-04 10:51:30
码
g = df.groupby(['Names1', 'Gene_name'])
g['Status'].agg(lambda x: '-'.join(x.astype('str').sort_values().unique())).unstack()输出
Gene_name GENE1 GENE2 GENE3
Names1
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0使期望的输出
(g['Status'].agg(lambda x: '-'.join(x.astype('str').sort_values().unique()))
.unstack().rename_axis(index='Name', columns=''))结果:
GENE1 GENE2 GENE3
Name
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0发布于 2022-12-04 11:04:47
上面的解决方案会更简洁,但只是想提出一种替代方案来解决这个问题:
import numpy as np
names = df['Names1'].unique()
genes = df['Gene_name'].unique()
result_df = pd.DataFrame({'Names': names})
for gene in genes:
values = []
for name in names:
result = '-'.join(map(str, count_df.loc[(count_df['Names1'] == name) & (count_df['Gene_name'] == gene), ['Status']]['Status'].to_numpy()))
if result == '':
values.append(np.nan)
else:
values.append(result)
result_df[gene] = values
result_df 输出
GENE1 GENE2 GENE3
Names
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0发布于 2022-12-04 11:44:30
通过使用枢轴表,解决方案可以如下所示:
df.pivot_table('Status','Names1','Gene_name',
aggfunc=lambda x: '-'.join(x.astype(str).unique())).rename_axis(columns=None)
>>>
'''
GENE1 GENE2 GENE3
Names1
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0https://stackoverflow.com/questions/74674654
复制相似问题