我试图使用Biopython的Biopython解析函数解析PubMed中央XML文件。这就是我迄今为止尝试过的:
from Bio import Entrez
for xmlfile in glob.glob ('samplepmcxml.xml'):
print xmlfile
fh = open (xmlfile, "r")
read_xml (fh, outfp)
fh.close()
def read_xml (handle, outh):
records = Entrez.parse(handle)
for record in records:
print record我得到了以下错误:
Traceback (most recent call last):
File "3parse_info_from_pmc_nxml.py", line 78, in <module>
read_xml (fh, outfp)
File "3parse_info_from_pmc_nxml.py", line 10, in read_xml
for record in records:
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 137, in parse
self.parser.Parse(text, False)
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 165, in startNamespaceDeclHandler
raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces我已经下载了archivearticle.dtd文件。是否还有其他需要安装的DTD文件来描述PMC文件的架构?是否有人成功地使用Bio函数或任何其他方法来解析PMC文章?
谢谢你的帮忙!
发布于 2014-08-01 11:13:40
使用另一个解析器,如迷你型
from xml.dom import minidom
data = minidom.parse("pmc_full.xml")现在,根据您想要提取的数据,深入研究XML并从中获得乐趣:
for title in data.getElementsByTagName("article-title"):
for node in title.childNodes:
if node.nodeType == node.TEXT_NODE:
print node.datahttps://stackoverflow.com/questions/25075690
复制相似问题