首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何使用xml.etree.ElementTree将XML文件中的所有相关字段都转换为Python?

如何使用xml.etree.ElementTree将XML文件中的所有相关字段都转换为Python?
EN

Stack Overflow用户
提问于 2020-05-19 01:42:43
回答 2查看 132关注 0票数 0

我正在尝试从基因表达总括解析一个XML文件。我发现了如何获取一些数据字段,但我不知道如何获取像<Title>这样的信息。

我试着适应:如何将XML文件转换为漂亮的熊猫数据?,但只能获得一些信息。

如何将所有可用数据提取到熊猫数据中?

下面是XML文件的一个例子

代码语言:javascript
复制
<Sample iid="GSM2978341">
    <Status database="GEO">
      <Submission-Date>2018-02-05</Submission-Date>
      <Release-Date>2019-03-25</Release-Date>
      <Last-Update-Date>2019-03-25</Last-Update-Date>
    </Status>
    <Title>PDD_P2_70</Title>
    <Accession database="GEO">GSM2978341</Accession>
    <Type>SRA</Type>
    <Channel-Count>1</Channel-Count>
    <Channel position="1">
      <Source>AZ-LolCDE</Source>
      <Organism taxid="679895">Escherichia coli BW25113</Organism>
      <Characteristics tag="strain">
BW25113
      </Characteristics>
      <Characteristics tag="type">
Gram-negative bacteria
      </Characteristics>
      <Characteristics tag="moa">
cell wall synthesis inhibitor / lipoprotein
      </Characteristics>
      <Characteristics tag="phenotype">
EC90 of phenotype
      </Characteristics>
      <Characteristics tag="treatment time">
~ 25 min
      </Characteristics>
      <Characteristics tag="treatment concentration">
200 uM
      </Characteristics>
      <Treatment-Protocol>
bacteria were treated with different antibiotics for ~ 25 min till  ~OD 0.2  in 2 ml tubes
      </Treatment-Protocol>
      <Growth-Protocol>
bacteria were grown in iso-sensitest medium
      </Growth-Protocol>
      <Molecule>total RNA</Molecule>
      <Extract-Protocol>
after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer  (lysozyme &amp; proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification  using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).
For RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.
      </Extract-Protocol>
    </Channel>
    <Data-Processing>
Illumina CASAVA v1.8.2  software used for basecalling and fastq file generation
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)
Supplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample
    </Data-Processing>
    <Platform-Ref ref="GPL20227" />
    <Library-Strategy>RNA-Seq</Library-Strategy>
    <Library-Source>transcriptomic</Library-Source>
    <Library-Selection>cDNA</Library-Selection>
    <Instrument-Model>
      <Predefined>Illumina HiSeq 2500</Predefined>
    </Instrument-Model>
    <Contact-Ref ref="contrib1" />
    <Supplementary-Data type="unknown">
NONE
    </Supplementary-Data>
    <Relation type="BioSample" target="https://www.ncbi.nlm.nih.gov/biosample/SAMN08466802" />
    <Relation type="SRA" target="https://www.ncbi.nlm.nih.gov/sra?term=SRX3648429" />
  </Sample>

这是我正在处理的解析器,但它缺少了太多的字段。

代码语言:javascript
复制
import xml.etree.ElementTree as ET
import pandas as pd

def read_geo_xml(path, index_name=None):
    # Parse the XML tree
    tree = ET.parse(path)
    root = tree.getroot()
    # Extract the attributes
    data = defaultdict(dict)
    for record in root:
        id_record = record.attrib["iid"]
        for x in record.findall("*"):
            for y in x:
                for k,v in y.attrib.items():
                    data[id_record][(k,v)] = y.text.strip()

    # Create pd.DataFrame
    df = pd.DataFrame(data).T
    df.index.name = index_name
    return df

url = "https://pastebin.com/raw/AJp5pshP"
import requests
from io import StringIO
text = requests.get("https://pastebin.com/raw/AJp5pshP").text
xml_data = StringIO(text)
df = read_geo_xml(xml_data)
df.head()
#   taxid   tag
# 679895    strain  type    moa phenotype   treatment time  treatment concentration
# GSM2978339    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978340    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978341    Escherichia coli BW25113    BW25113 Gram-negative bacteria  cell wall synthesis inhibitor / lipoprotein EC90 of phenotype   ~ 25 min    200 uM
# GSM2978342    Escherichia coli BW25113    BW25113 Gram-negative bacteria  new hit EC90 of phenotype   ~ 25 min    50 uM
# GSM2978343    Escherichia coli BW25113    BW25113 Gram-negative bacteria  new hit EC90 of phenotype   ~ 25 min    50 uM

预期产出:

代码语言:javascript
复制
# Everything within a <field>  </field>
Submission-Date
Release-Date
Last-Update-Date
Title
Accession
Type
Channel-Count
Source
Organism
Treatment-Protocol
Growth-Protocol
Molecule
Data-Processing
Library-Strategy
Library-Source
Library-Selection
Instrument-Model
Supplemental Data

# Everything under <Characteristics>
strain
type
moa
phenotype
treatment time
treatment concentration

我现在只能从“特征”中提取

EN

回答 2

Stack Overflow用户

发布于 2020-05-19 02:59:12

我将使用帕塞尔提取标题数据,使用xpath

代码语言:javascript
复制
 data = """[your data above]"""
    selector = Selector(data)

获取特性节点的数据:

代码语言:javascript
复制
    #all characteristics node have an attribute tag,
    #which is not found in the others, so I'll use that
    #characteristics
tags = []
contents = []
for ent in selector.xpath(".//sample//*[@tag]"):
    contents.append(ent.xpath("./text()").get().strip())
    tags.append(ent.attrib.get('tag'))
xters = dict(zip(tags,contents))

从其他节点获取数据,但特征除外:

代码语言:javascript
复制
elements = []
vals = []

#this searches through the nodes and excludes characteristics
for ent in selector.xpath(".//sample//*[not(self::characteristics)]"):
    #some nodes have no text, so we have to cater to that
    if not ent.xpath("./text()").get():
        continue
    elements.append(ent.xpath("name(.)").get())
    vals.append(ent.xpath("./text()").get().strip())

#create dictionary from the two lists
#and append the xters dict to form one main dict
results = dict(zip(elements,vals))
results.update(xters)


print(results)

{'status': '',
 'submission-date': '2018-02-05',
 'release-date': '2019-03-25',
 'last-update-date': '2019-03-25',
 'title': 'PDD_P2_70',
 'accession': 'GSM2978341',
 'type': 'Gram-negative bacteria',
 'channel-count': '1',
 'channel': '',
 'source': 'AZ-LolCDE',
 'organism': 'Escherichia coli BW25113',
 'treatment-protocol': 'bacteria were treated with different antibiotics for ~ 25 min till  ~OD 0.2  in 2 ml tubes',
 'growth-protocol': 'bacteria were grown in iso-sensitest medium',
 'molecule': 'total RNA',
 'extract-protocol': "after treament bacteria were resuspended in QiaGen RNAprotect Bacteria Reagent (QiaGen #76506), incubated for 5min, centrifuged, and flash frozen on dry ice. Total RNA was extracted by incubating bacteria in Enzymatic Lysis Buffer  (lysozyme & proteinase K) for 5 min followed by addition of QiaGen RLT Lysis Buffer and RNA purification  using the QiaGen RNeasy Mini kit combined with DNase treatment on a solid support (QiaGen #74104). RNA quality assessment and quantification was performed using microfluidic chip analysis on an Agilent 2100 bioanalyzer (Agilent Technologies).\nFor RNA-sequencing library preparation, 1000 ng total RNA was used as input. First, bacterial ribosomal RNA was depleted using the Ribo-Zero Magnetic Kit Bacteria (Illumina #MRZB12424). After depletion, RNA was resuspended in TruSeq Total RNA Sample Prep Kit Fragmentation buffer (8.5 ul RNA and 8.5 buffer) and reversed transcribed into cDNA using random hexamer primer. Then cDNA was further processed for the construction of sequencing libraries according to the manufacturer's recommendations using the TruSeq Stranded mRNA Sample Prep Kit (Illimina #RS-122-2101). Sequencing was performed with the Illumina TruSeq SBS Kit v4-HS chemistry (Illumina #FC-401-4003) on an Illumina HiSeq2500 instrument with 50 cycles of 2x50 bp paired-end sequencing.",
 'data-processing': 'Illumina CASAVA v1.8.2  software used for basecalling and fastq file generation\nSequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096) genome using bowtie2\nReads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.\nGenome_build: Escherichia coli str. K-12 substr. MG1655, complete genome (GenBank: U00096)\nSupplementary_files_format_and_content: tab-delimited text files in GCT format include read counts of uniquely and fraction of multiple mapped reads (counts.gct.gz), and normalized counts RPKM (rpkms.gct.gz) values for each sample',
 'library-strategy': 'RNA-Seq',
 'library-source': 'transcriptomic',
 'library-selection': 'cDNA',
 'instrument-model': '',
 'predefined': 'Illumina HiSeq 2500',
 'supplementary-data': 'NONE',
 'strain': 'BW25113',
 'moa': 'cell wall synthesis inhibitor / lipoprotein',
 'phenotype': 'EC90 of phenotype',
 'treatment time': '~ 25 min',
 'treatment concentration': '200 uM'}

您可以将数据读入数据中:

代码语言:javascript
复制
pd.DataFrame.from_dict(results,orient='index')
票数 0
EN

Stack Overflow用户

发布于 2020-05-19 14:28:54

举个例子。

代码语言:javascript
复制
from simplified_scrapy import SimplifiedDoc, utils

def foo(ele, row):
  children = ele.children
  for a in ele:
      if a != 'html' and a != 'tag': row.append(ele[a])
  if children:
    for child in children:
      foo(child,row)
  elif ele['html']:
    row.append(ele['html'])

html = '''
<Sample iid="GSM2978341">
    <Status database="GEO">
      <Submission-Date>2018-02-05</Submission-Date>
      <Release-Date>2019-03-25</Release-Date>
      <Last-Update-Date>2019-03-25</Last-Update-Date>
    </Status>
    <Title>PDD_P2_70</Title>
    <Accession database="GEO">GSM2978341</Accession>
    <Type>SRA</Type>
</Sample>
'''
doc = SimplifiedDoc(html)
row = []
foo(doc,row)
print (row)

结果:

代码语言:javascript
复制
['GSM2978341', 'GEO', '2018-02-05', '2019-03-25', '2019-03-25', 'PDD_P2_70', 'GEO', 'GSM2978341', 'SRA']
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61881872

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档