首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何将dataframe索引替换为来自另一个dataframe的列值?

如何将dataframe索引替换为来自另一个dataframe的列值?
EN

Stack Overflow用户
提问于 2022-05-05 00:21:31
回答 1查看 29关注 0票数 0

我想合并raw_clinical_patientraw_clinical_sample数据格式。

但是,在合并之前,SAMPLE_ID列在raw_clinical_sample中应该重新命名为PATIENT_ID (因为它被错误地标上了标签)。

我希望新的PATIENT_ID成为raw_clin的索引。但是,我的代码仍然保留了原始的PATIENT_ID

代码语言:javascript
复制
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()

# Clinical sample info
# Keep info only if `Patient_ID` value exists in `raw_clinical_patient`
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample.sort_index()
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]


# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = pd.merge(raw_clinical_patient, raw_clinical_sample, on="PATIENT_ID")
raw_clin.rename(columns={'PATIENT_ID':'PI'}, inplace=True)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clin

raw_clinical_patient数据格式示例

代码语言:javascript
复制
raw_clinical_patient = pd.DataFrame([["TCGA-02-0003", 2003, "WITH TUMOR", 100, 50], ["TCGA-02-0004", 2002, "WITH TUMOR", 80, 59], ["TCGA-02-0006", 2002, "WITH TUMOR", 80, 56], ["TCGA-02-0007",2002, "WITH TUMOR", 80, 40]], columns=["PATIENT_ID", 'INITIAL_PATHOLOGIC_DX_YEAR', 'TUMOR_STATUS', 'KARNOFSKY_PERFORMANCE_SCORE', 'AGE'])

raw_clinical_sample实例

代码语言:javascript
复制
raw_clinical_sample = pd.DataFrame([["TCGA-02-0003-01", "123ebb05-33c0-441c-9549-262433b14810", "TCGA-02-0003"], ["TCGA-02-0006-01", "03c9930f-7695-419a-bf24-184be75aea63", "TCGA-02-0006"], ["TCGA-02-0007-01", "1b41259b-0cb2-4194-ba1e-352d4ae98ed1", "TCGA-02-0007"], ["TCGA-02-0009-01", "a731e84e-436f-4e4c-9012-5d68949d1d0f", "TCGA-02-0009"]], columns=['SAMPLE_ID', 'OTHER_SAMPLE_ID', 'PATIENT_ID'])

raw_clin 所需的输出

代码语言:javascript
复制
raw_clin = pd.DataFrame([["TCGA-02-0003-01", 2003, "WITH TUMOR", 100, 50, "123ebb05-33c0-441c-9549-262433b14810"], ["TCGA-02-0006-01", 2002, "WITH TUMOR", 80, 54, "03c9930f-7695-419a-bf24-184be75aea63"], ["TCGA-02-0007-01", 2002, "WITH TUMOR", 80, 56, "1b41259b-0cb2-4194-ba1e-352d4ae98ed1"]], columns=['PATIENT_ID', 'INITIAL_PATHOLOGIC_DX_YEAR', 'TUMOR_STATUS', 'KARNOFSKY_PERFORMANCE_SCORE', 'AGE', 'OTHER_SAMPLE_ID'])

raw_clin的新索引值应该是:

代码语言:javascript
复制
TCGA-02-0003-01
TCGA-02-0006-01
TCGA-02-0007-01
EN

回答 1

Stack Overflow用户

发布于 2022-05-05 02:53:59

我不完全确定我是否遵循了你所需要的输出,但我认为以下几点应该奏效:

代码语言:javascript
复制
raw_clin = (
    raw_clinical_patient.merge(raw_clinical_sample, on='PATIENT_ID',how='left')
    .dropna()
    .set_index('PATIENT_ID')
    # edit: also rename sample id:
    .rename({'SAMPLE_ID': 'PATIENT_ID'}, axis=1)

 
)

这给出了输出:

代码语言:javascript
复制
pd.DataFrame(
{'INITIAL_PATHOLOGIC_DX_YEAR': {'TCGA-02-0003': 2003,
  'TCGA-02-0006': 2002,
  'TCGA-02-0007': 2002},
 'TUMOR_STATUS': {'TCGA-02-0003': 'WITH TUMOR',
  'TCGA-02-0006': 'WITH TUMOR',
  'TCGA-02-0007': 'WITH TUMOR'},
 'KARNOFSKY_PERFORMANCE_SCORE': {'TCGA-02-0003': 100,
  'TCGA-02-0006': 80,
  'TCGA-02-0007': 80},
 'AGE': {'TCGA-02-0003': 50, 'TCGA-02-0006': 56, 'TCGA-02-0007': 40},
 'SAMPLE_ID': {'TCGA-02-0003': 'TCGA-02-0003-01',
  'TCGA-02-0006': 'TCGA-02-0006-01',
  'TCGA-02-0007': 'TCGA-02-0007-01'},
 'OTHER_SAMPLE_ID': {'TCGA-02-0003': '123ebb05-33c0-441c-9549-262433b14810',
  'TCGA-02-0006': '03c9930f-7695-419a-bf24-184be75aea63',
  'TCGA-02-0007': '1b41259b-0cb2-4194-ba1e-352d4ae98ed1'}}
)

您可以复制和打印以检查正确性。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72120722

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档