首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用rentrez解析R中的XML文件

使用rentrez解析R中的XML文件
EN

Stack Overflow用户
提问于 2017-03-04 15:30:50
回答 2查看 546关注 0票数 1

我不是XML专家。我在使用rentrez解析XML文件时遇到问题。我正在尝试通过每个pmid ( PubMed数据库中的文章id )作为输出的作者和从属关系。我有代码,工作良好,除非作者有多个从属关系。当author具有多个从属关系时,列first_nameslast_namesaffiliation的长度会不同,并返回错误。我真的没有xml解析方面的专业知识来处理这个问题。我严格期望得到如下结果:

代码语言:javascript
复制
pmid         first_names  last_names              affiliation
27869504     Luca           Villa         Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy 
27869504     Luca           Villa         Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France
27869504     Tarik Emre     Şener         Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France
27869504     Tarik Emre     Şener         Department of Urology, Marmara University School of Medicine, Istanbul, Turkey

我的entrez_fetch返回的样例XML文件的结构如下:

代码语言:javascript
复制
 <?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2017//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd">
<PubmedArticleSet>
  <PubmedArticle>
   <MedlineCitation Status="In-Data-Review" Owner="NLM">
    <PMID Version="1">27869504</PMID>
    <DateCreated>
     <Year>2016</Year>
     <Month>11</Month>
      <Day>21</Day>
    </DateCreated>
  <DateRevised>
    <Year>2017</Year>
    <Month>01</Month>
    <Day>06</Day>
  </DateRevised>
  <Article PubModel="Print-Electronic">
    <Journal>
      <ISSN IssnType="Electronic">1557-900X</ISSN>
      <JournalIssue CitedMedium="Internet">
        <Volume>31</Volume>
        <Issue>1</Issue>
        <PubDate>
          <Year>2017</Year>
          <Month>Jan</Month>
        </PubDate>
      </JournalIssue>
      <Title>Journal of endourology</Title>
      <ISOAbbreviation>J. Endourol.</ISOAbbreviation>
    </Journal>
    <ArticleTitle>Initial Content Validation Results of a New Simulation Model for Flexible Ureteroscopy: The Key-Box.</ArticleTitle>
    <Pagination>
      <MedlinePgn>72-77</MedlinePgn>
    </Pagination>
    <ELocationID EIdType="doi" ValidYN="Y">10.1089/end.2016.0677</ELocationID>
    <Abstract>
      <AbstractText Label="PURPOSE" NlmCategory="OBJECTIVE">We sought to test the content validity of a new training model for flexible ureteroscopy: the Key-Box.</AbstractText>
      <AbstractText Label="MATERIAL AND METHODS" NlmCategory="METHODS">Sixteen medical students were randomized to undergo a 10-day training consisting of performing 10 different exercises aimed at learning specific movements with the flexible ureteroscope, and how to catch and release stones with a nitinol basket using the Key-Box (n&#x2009;=&#x2009;8 students in the training group, n&#x2009;=&#x2009;8 students in the nontraining control group). Subsequently, an expert endourologist (O.T.) blindly assessed skills acquired by the whole cohort of students through two exercises on ureteroscope manipulation and one exercise on stone capture selected among those used for the training. A performance scale (1-5) assessing different steps of the procedure was used to evaluate each student. Time to complete the exercises was measured. Mann-Whitney Rank Sum test was used for comparisons between the two groups.</AbstractText>
      <AbstractText Label="RESULTS" NlmCategory="RESULTS">Mean scores obtained by trained students were significantly higher compared with those obtained by nontrained students (all p&#x2009;&lt;&#x2009;0.001). All trained students were able to complete the two exercises on ureteroscope manipulation within 3 minutes, whereas two students (25%) were not able to finish the exercise on stone capture. Conversely, four (50%) and six (75%) nontrained students were not able to finish one out of the two exercises on ureteroscope manipulation and the exercise on stone capture, respectively. The mean time to complete the three exercises was 76.3, 69.9, and 107 and 172.5, 137.9, and 168 seconds in the trained and nontrained groups, respectively (all p&#x2009;&lt;&#x2009;0.001).</AbstractText>
      <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">The K-Box(&#xAE;) seems to be a valid easy-to-use training model for initiating novel endoscopists to flexible ureteroscopy.</AbstractText>
    </Abstract>
    <AuthorList CompleteYN="Y">
      <Author ValidYN="Y">
        <LastName>Villa</LastName>
        <ForeName>Luca</ForeName>
        <Initials>L</Initials>
        <AffiliationInfo>
          <Affiliation>1 Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy .</Affiliation>
        </AffiliationInfo>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>&#x15E;ener</LastName>
        <ForeName>Tarik Emre</ForeName>
        <Initials>TE</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
        <AffiliationInfo>
          <Affiliation>3 Department of Urology, Marmara University School of Medicine , Istanbul, Turkey .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Somani</LastName>
        <ForeName>Bhaskar K</ForeName>
        <Initials>BK</Initials>
        <AffiliationInfo>
          <Affiliation>4 Department of Urology, University Hospital Southampton NHS Trust , Southampton, United Kingdom .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Cloutier</LastName>
        <ForeName>Jonathan</ForeName>
        <Initials>J</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
        <AffiliationInfo>
          <Affiliation>5 Department of Urology, University Hospital Centre of Quebec City , Quebec, Canada .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Buttic&#xE8;</LastName>
        <ForeName>Salvatore</ForeName>
        <Initials>S</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
        <AffiliationInfo>
          <Affiliation>6 Department of Urology, University of Messina , Messina, Italy .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Marson</LastName>
        <ForeName>Francesco</ForeName>
        <Initials>F</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
        <AffiliationInfo>
          <Affiliation>7 Department of Urology, Citt&#xE0; della Salute e della Scienza, Turin, Italy .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Doizi</LastName>
        <ForeName>Steeve</ForeName>
        <Initials>S</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Proietti</LastName>
        <ForeName>Silvia</ForeName>
        <Initials>S</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
        <AffiliationInfo>
          <Affiliation>8 Department of Urology, IRCCS San Raffaele Scientific Institute , Ville Turro Division, Milan, Italy .</Affiliation>
        </AffiliationInfo>
      </Author>
      <Author ValidYN="Y">
        <LastName>Traxer</LastName>
        <ForeName>Olivier</ForeName>
        <Initials>O</Initials>
        <AffiliationInfo>
          <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation>
        </AffiliationInfo>
      </Author>
    </AuthorList>
    <Language>eng</Language>
    <PublicationTypeList>
      <PublicationType UI="D016428">Journal Article</PublicationType>
    </PublicationTypeList>
    <ArticleDate DateType="Electronic">
      <Year>2016</Year>
      <Month>12</Month>
      <Day>16</Day>
    </ArticleDate>
  </Article>
  <MedlineJournalInfo>
    <Country>United States</Country>
    <MedlineTA>J Endourol</MedlineTA>
    <NlmUniqueID>8807503</NlmUniqueID>
    <ISSNLinking>0892-7790</ISSNLinking>
  </MedlineJournalInfo>
  <KeywordList Owner="NOTNLM">
    <Keyword MajorTopicYN="N">flexible ureteroscopy</Keyword>
    <Keyword MajorTopicYN="N">learning curve</Keyword>
    <Keyword MajorTopicYN="N">training model</Keyword>
    <Keyword MajorTopicYN="N">ureteroscopy curriculum</Keyword>
  </KeywordList>
</MedlineCitation>
<PubmedData>
  <History>
    <PubMedPubDate PubStatus="pubmed">
      <Year>2016</Year>
      <Month>11</Month>
      <Day>22</Day>
      <Hour>6</Hour>
      <Minute>0</Minute>
    </PubMedPubDate>
    <PubMedPubDate PubStatus="medline">
      <Year>2016</Year>
      <Month>11</Month>
      <Day>22</Day>
      <Hour>6</Hour>
      <Minute>0</Minute>
    </PubMedPubDate>
    <PubMedPubDate PubStatus="entrez">
      <Year>2016</Year>
      <Month>11</Month>
      <Day>22</Day>
      <Hour>6</Hour>
      <Minute>0</Minute>
    </PubMedPubDate>
  </History>
  <PublicationStatus>ppublish</PublicationStatus>
  <ArticleIdList>
    <ArticleId IdType="pubmed">27869504</ArticleId>
    <ArticleId IdType="doi">10.1089/end.2016.0677</ArticleId>
  </ArticleIdList>
 </PubmedData>
</PubmedArticle>
</PubmedArticleSet>

以下是我正在使用的代码,它工作得很好,除非PubMed数据库中的一篇文章的作者有多个从属关系:

代码语言:javascript
复制
 library(rentrez)
 library(XML)

 pubmedSearch <- entrez_search("pubmed", term = "flexible ureteroscope Simulation Model", 
                          retmax = 10)
 SearchResults <- entrez_fetch(db="pubmed", pubmedSearch$ids, rettype="xml", 
                          parsed=TRUE)

 xmlGetValue <- function(x, node){
   a <- xpathSApply(x, node, xmlValue)
   if(length(a) == 0) {a <- NA} else {a}
 }

 parse_paper <- function(paper){
    pmid <- xmlGetValue(paper, ".//ArticleId[@IdType='pubmed']")
    first_names <- xmlGetValue(paper, ".//Author/ForeName")
    last_names <- xmlGetValue(paper, ".//Author/LastName")
    affiliation <- xmlGetValue(paper, ".//AffiliationInfo/Affiliation")
    data.frame(pmid=pmid, first_names=first_names, last_names=last_names,
         affiliation=affiliation)
 }  

parse_multiple_papers <- function(papers){
  res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
  do.call(rbind.data.frame, res)
}

test_df <- parse_multiple_papers(SearchResults) 

任何帮助和支持都是非常感谢的。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-03-04 16:34:29

这个问题也是以issue @ rentrez's repository的形式出现的,这里给出了一个可能的解决方案的细节。我也会在这里包含这些代码

代码语言:javascript
复制
parse_author <- function(author){
  fn  <- xmlValue(author[["ForeName"]])
  ln  <- xmlValue(author[["LastName"]])
  aff <-paste(xpathApply(author, "AffiliationInfo/Affiliation", xmlValue), collapse="; ")
  list(forname=fn, lastname=ln, affiliation=aff)
}

parse_paper <- function(paper){
  author_info <- xpathApply(paper, ".//AuthorList/Author", parse_author)
  res <- do.call(rbind.data.frame, author_info)
  res$pmid <-xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue)
  res
}

parse_multiple_papers <- function(papers){
 res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
 do.call(rbind.data.frame, res)
}

head(parse_multiple_papers(SearchResults))
票数 3
EN

Stack Overflow用户

发布于 2017-03-04 16:30:58

您可以使用xml2purrr执行以下操作

代码语言:javascript
复制
require(xml2)
require(purrr)

doc <- read_xml(doc)
scope <- doc %>% xml_find_all("//author")
scope %>% map_df(~data.frame(
  first_names = xml_find_first(.x, "./forename") %>% xml_text,
  last_names = xml_find_first(.x, "./lastname") %>% xml_text,
  affiliation = xml_find_all(.x, ".//affiliation") %>% xml_text,
  stringsAsFactors = FALSE
))

这为您提供了:

代码语言:javascript
复制
   first_names last_names                                                                                             affiliation
1         Luca      Villa  1 Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy .
2         Luca      Villa            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
3   Tarik Emre      Şener            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
4   Tarik Emre      Şener                     3 Department of Urology, Marmara University School of Medicine , Istanbul, Turkey .
5    Bhaskar K     Somani      4 Department of Urology, University Hospital Southampton NHS Trust , Southampton, United Kingdom .
6     Jonathan   Cloutier            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
7     Jonathan   Cloutier                   5 Department of Urology, University Hospital Centre of Quebec City , Quebec, Canada .
8    Salvatore    Butticè            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
9    Salvatore    Butticè                                       6 Department of Urology, University of Messina , Messina, Italy .
10   Francesco     Marson            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
11   Francesco     Marson                             7 Department of Urology, Città della Salute e della Scienza, Turin, Italy .
12      Steeve      Doizi            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
13      Silvia   Proietti            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
14      Silvia   Proietti 8 Department of Urology, IRCCS San Raffaele Scientific Institute , Ville Turro Division, Milan, Italy .
15     Olivier     Traxer            2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .
票数 4
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/42593415

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档