首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在R中读取xml文件并转换为data.frame

如何在R中读取xml文件并转换为data.frame
EN

Stack Overflow用户
提问于 2016-09-25 21:57:24
回答 2查看 3.5K关注 0票数 0
代码语言:javascript
复制
library(XML)
file <-"E:/aaa.xml"
doc = xmlInternalTreeParse(file)
ns=names(xmlNamespace(xmlRoot(doc)))
patient=getNodeSet(doc, path=paste("/", ns, ":tcga_bcr/", ns,":patient", sep=""))
row=xmlToDataFrame(nodes=patient, stringsAsFactors = F)

shared_stage:stage_event有许多子节点,如何将每个子节点提取为列。

如果节点具有preferred_name,则使用preferred_name作为data.frame列名。

Aaa.xml:

代码语言:javascript
复制
<?xml version="1.0" encoding="UTF-8"?>
<brca:tcga_bcr xsi:schemaLocation="http://tcga.nci/bcr/xml/clinical/brca/2.7 http://tcga-data.nci.nih.gov/docs/xsd/BCR/tcga.nci/bcr/xml/clinical/brca/2.7/TCGA_BCR.BRCA_Clinical.xsd" schemaVersion="2.7" xmlns:brca="http://tcga.nci/bcr/xml/clinical/brca/2.7" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:admin="http://tcga.nci/bcr/xml/administration/2.7" xmlns:clin_shared="http://tcga.nci/bcr/xml/clinical/shared/2.7" xmlns:shared="http://tcga.nci/bcr/xml/shared/2.7" xmlns:brca_shared="http://tcga.nci/bcr/xml/clinical/brca/shared/2.7" xmlns:shared_stage="http://tcga.nci/bcr/xml/clinical/shared/stage/2.7" xmlns:brca_nte="http://tcga.nci/bcr/xml/clinical/brca/shared/new_tumor_event/2.7/1.0" xmlns:nte="http://tcga.nci/bcr/xml/clinical/shared/new_tumor_event/2.7" xmlns:follow_up_v2.1="http://tcga.nci/bcr/xml/clinical/brca/followup/2.7/2.1" xmlns:rx="http://tcga.nci/bcr/xml/clinical/pharmaceutical/2.7" xmlns:rad="http://tcga.nci/bcr/xml/clinical/radiation/2.7">
<brca:patient>
    <admin:additional_studies/>
    <clin_shared:tumor_tissue_site preferred_name="submitted_tumor_site" display_order="9999" cde="3427536" cde_ver="2.000" xsd_ver="2.6" tier="2" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175314">Breast</clin_shared:tumor_tissue_site>
    <clin_shared:race_list>
        <clin_shared:race preferred_name="race" display_order="12" cde="2192199" cde_ver="1.000" xsd_ver="1.8" tier="2" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175301">WHITE</clin_shared:race>
    </clin_shared:race_list>
    <shared:bcr_patient_barcode preferred_name="" display_order="9999" cde="2673794" cde_ver="" xsd_ver="1.8" owner="TSS" procurement_status="Completed" restricted="false">TCGA-A2-A0EV</shared:bcr_patient_barcode>
    <shared:tissue_source_site cde="" cde_ver="" xsd_ver="2.4" owner="TSS" procurement_status="Completed" restricted="false">A2</shared:tissue_source_site>
    <shared_stage:stage_event system="AJCC">
        <shared_stage:system_version preferred_name="ajcc_staging_edition" display_order="51" cde="2722309" cde_ver="1.000" xsd_ver="2.6" tier="1" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="1080001">6th</shared_stage:system_version>
        <shared_stage:tnm_categories>
            <shared_stage:pathologic_categories>
                <shared_stage:pathologic_T preferred_name="ajcc_tumor_pathologic_pt" display_order="52" cde="3045435" cde_ver="1.000" xsd_ver="2.6" tier="1" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175336">T1c</shared_stage:pathologic_T>
            </shared_stage:pathologic_categories>
        </shared_stage:tnm_categories>
    </shared_stage:stage_event>       
    <rx:drugs/>
    <rad:radiations/>
</brca:patient>
</brca:tcga_bcr>

data.frame

代码语言:javascript
复制
submitted_tumor_site  race  bcr_patient_barcode  ajcc_staging_edition ajcc_tumor_pathologic_pt
Breast                WHITE  TCGA-A2-A0EV            6th               T1c
EN

回答 2

Stack Overflow用户

发布于 2016-09-26 01:10:29

由于您有嵌套的子代和不同的名称空间,因此可以考虑只对每个所需的XML值运行xpath。然后将它们绑定到一个数据帧中。使用checkpath()函数在多个brca:patient节点上运行外部lapply(),以说明可能缺少的子节点或子代节点:

代码语言:javascript
复制
patientnum <- 1:length(xpathSApply(doc, "//brca:patient"))

checkpath <- function(xpath){
  val <- ifelse(length(xpath) > 0, xpath[[1]], NA)
}

patientdata <- lapply(patientnum, function(i){
  temp <- c(checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/clin_shared:tumor_tissue_site"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::clin_shared:race"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::shared:bcr_patient_barcode"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::shared_stage:system_version"), xmlValue)),
            checkpath(xpathSApply(doc, paste0("//brca:patient[",i,"]/descendant::shared_stage:pathologic_T"), xmlValue)))

  temp <- setNames(temp, c("tumor_tissue_site", "race", "bcr_patient_barcode", "system_version", "pathologic_T"))
})

patients <- do.call(rbind, patientdata)
patients <- data.frame(patients, stringsAsFactors = FALSE)

或者,您仍然可以使用xmlToDataFrame(),但需要扁平化和简化您的XML,这可以使用XSLT ( XML转换语言和XPath的兄弟语言)来完成。

虽然R没有专门的、通用的XSLT库,但是您可以使用外部处理器,包括其他语言的处理器(Python、Java、PHP,甚至Excel VBA)、专用的.exe (Saxon、Xalan)或命令行解释器(PowerShell、Bash)。R可以用system()调用每一个

XSLT脚本

代码语言:javascript
复制
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:brca="http://tcga.nci/bcr/xml/clinical/brca/2.7"
               xmlns:clin_shared="http://tcga.nci/bcr/xml/clinical/shared/2.7"
               xmlns:shared="http://tcga.nci/bcr/xml/shared/2.7"              
               xmlns:shared_stage="http://tcga.nci/bcr/xml/clinical/shared/stage/2.7">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <xsl:template match="/brca:tcga_bcr">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="brca:patient"/>
    </xsl:element>
  </xsl:template>    

  <xsl:template match="brca:patient">    
    <xsl:element name="{local-name()}">
        <tumor_tissue_site><xsl:value-of select="clin_shared:tumor_tissue_site"/></tumor_tissue_site>
        <race><xsl:value-of select="descendant::clin_shared:race"/></race>
        <bcr_patient_barcode><xsl:value-of select="descendant::shared:bcr_patient_barcode"/></bcr_patient_barcode>
        <system_version><xsl:value-of select="descendant::shared_stage:system_version"/></system_version>
        <pathologic_T><xsl:value-of select="descendant::shared_stage:pathologic_T"/></pathologic_T>
    </xsl:element>
  </xsl:template>

</xsl:transform>

R脚本

代码语言:javascript
复制
system("command line call to transform xml source with xslt")
# system('python "path/to/transformation_script.py"')          ' EXAMPLE: PYTHON SCRIPT

doc <- xmlParse("path/to/transformed.xml")
doc
# <?xml version="1.0" encoding="UTF-8"?>
# <tcga_bcr>
#   <patient>
#     <tumor_tissue_site>Breast</tumor_tissue_site>
#     <race>WHITE</race>
#     <bcr_patient_barcode>TCGA-A2-A0EV</bcr_patient_barcode>
#     <system_version>6th</system_version>
#     <pathologic_T>T1c</pathologic_T>
#   </patient>
# </tcga_bcr>

patients <- xmlToDataFrame(nodes = getNodeSet(doc, "//patient"), stringsAsFactors = FALSE)
票数 1
EN

Stack Overflow用户

发布于 2016-09-26 20:23:25

代码语言:javascript
复制
doc = xmlInternalTreeParse(file)
ns=names(xmlNamespace(xmlRoot(doc)))
patient=getNodeSet(doc, path=paste("/", ns, ":tcga_bcr/", ns,":patient", sep=""))

patient.fields=xmlChildren(patient[[1]])
patient.fields[[2]]

结果是

代码语言:javascript
复制
<clin_shared:tumor_tissue_site preferred_name="submitted_tumor_site" display_order="9999" cde="3427536" cde_ver="2.000" xsd_ver="2.6" tier="2" owner="TSS" procurement_status="Completed" restricted="false" source_system_identifier="175314">Breast</clin_shared:tumor_tissue_site> 

如何在patient.fields[2]中抽象preferred_name的内容?

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/39687633

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档