我想合并raw_clinical_patient和raw_clinical_sample数据格式。
但是,在合并之前,SAMPLE_ID列在raw_clinical_sample中应该重新命名为PATIENT_ID (因为它被错误地标上了标签)。
我希望新的PATIENT_ID成为raw_clin的索引。但是,我的代码仍然保留了原始的PATIENT_ID。
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()
# Clinical sample info
# Keep info only if `Patient_ID` value exists in `raw_clinical_patient`
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample.sort_index()
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = pd.merge(raw_clinical_patient, raw_clinical_sample, on="PATIENT_ID")
raw_clin.rename(columns={'PATIENT_ID':'PI'}, inplace=True)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clinraw_clinical_patient数据格式示例
raw_clinical_patient = pd.DataFrame([["TCGA-02-0003", 2003, "WITH TUMOR", 100, 50], ["TCGA-02-0004", 2002, "WITH TUMOR", 80, 59], ["TCGA-02-0006", 2002, "WITH TUMOR", 80, 56], ["TCGA-02-0007",2002, "WITH TUMOR", 80, 40]], columns=["PATIENT_ID", 'INITIAL_PATHOLOGIC_DX_YEAR', 'TUMOR_STATUS', 'KARNOFSKY_PERFORMANCE_SCORE', 'AGE'])raw_clinical_sample实例
raw_clinical_sample = pd.DataFrame([["TCGA-02-0003-01", "123ebb05-33c0-441c-9549-262433b14810", "TCGA-02-0003"], ["TCGA-02-0006-01", "03c9930f-7695-419a-bf24-184be75aea63", "TCGA-02-0006"], ["TCGA-02-0007-01", "1b41259b-0cb2-4194-ba1e-352d4ae98ed1", "TCGA-02-0007"], ["TCGA-02-0009-01", "a731e84e-436f-4e4c-9012-5d68949d1d0f", "TCGA-02-0009"]], columns=['SAMPLE_ID', 'OTHER_SAMPLE_ID', 'PATIENT_ID'])raw_clin 所需的输出
raw_clin = pd.DataFrame([["TCGA-02-0003-01", 2003, "WITH TUMOR", 100, 50, "123ebb05-33c0-441c-9549-262433b14810"], ["TCGA-02-0006-01", 2002, "WITH TUMOR", 80, 54, "03c9930f-7695-419a-bf24-184be75aea63"], ["TCGA-02-0007-01", 2002, "WITH TUMOR", 80, 56, "1b41259b-0cb2-4194-ba1e-352d4ae98ed1"]], columns=['PATIENT_ID', 'INITIAL_PATHOLOGIC_DX_YEAR', 'TUMOR_STATUS', 'KARNOFSKY_PERFORMANCE_SCORE', 'AGE', 'OTHER_SAMPLE_ID'])raw_clin的新索引值应该是:
TCGA-02-0003-01
TCGA-02-0006-01
TCGA-02-0007-01发布于 2022-05-05 02:53:59
我不完全确定我是否遵循了你所需要的输出,但我认为以下几点应该奏效:
raw_clin = (
raw_clinical_patient.merge(raw_clinical_sample, on='PATIENT_ID',how='left')
.dropna()
.set_index('PATIENT_ID')
# edit: also rename sample id:
.rename({'SAMPLE_ID': 'PATIENT_ID'}, axis=1)
)这给出了输出:
pd.DataFrame(
{'INITIAL_PATHOLOGIC_DX_YEAR': {'TCGA-02-0003': 2003,
'TCGA-02-0006': 2002,
'TCGA-02-0007': 2002},
'TUMOR_STATUS': {'TCGA-02-0003': 'WITH TUMOR',
'TCGA-02-0006': 'WITH TUMOR',
'TCGA-02-0007': 'WITH TUMOR'},
'KARNOFSKY_PERFORMANCE_SCORE': {'TCGA-02-0003': 100,
'TCGA-02-0006': 80,
'TCGA-02-0007': 80},
'AGE': {'TCGA-02-0003': 50, 'TCGA-02-0006': 56, 'TCGA-02-0007': 40},
'SAMPLE_ID': {'TCGA-02-0003': 'TCGA-02-0003-01',
'TCGA-02-0006': 'TCGA-02-0006-01',
'TCGA-02-0007': 'TCGA-02-0007-01'},
'OTHER_SAMPLE_ID': {'TCGA-02-0003': '123ebb05-33c0-441c-9549-262433b14810',
'TCGA-02-0006': '03c9930f-7695-419a-bf24-184be75aea63',
'TCGA-02-0007': '1b41259b-0cb2-4194-ba1e-352d4ae98ed1'}}
)您可以复制和打印以检查正确性。
https://stackoverflow.com/questions/72120722
复制相似问题