首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用漂亮的汤解析TEI-XML

用漂亮的汤解析TEI-XML
EN

Stack Overflow用户
提问于 2022-09-13 13:30:30
回答 1查看 74关注 0票数 2

我试图解析GROBID输出的元数据(解析PDF格式的学术论文)。这些引用看起来像

原始的TEI-XML文件如下(通过soup = read_tei('paper1.tei.xml')读取)

代码语言:javascript
复制
<?xml version="1.0" encoding="UTF-8"?><html><body><tei xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
<teiheader xml:lang="en">
<filedesc>
<titlestmt>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</titlestmt>
<publicationstmt>
<publisher></publisher>
<availability status="unknown"><licence></licence></availability>
<date type="published" when="2022-09-05">September 5, 2022</date>
</publicationstmt>
<sourcedesc>
<biblstruct>
<analytic>
<author role="corresp">
<persname><forename type="first">Titus</forename><surname>Barik</surname></persname>
<email>titus@barik.net</email>
<affiliation key="aff0">
<orgname type="institution">Georgia Institute of Technology</orgname>
</affiliation>
</author>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</analytic>
<monogr>
<imprint>
<date type="published" when="2022-09-05">September 5, 2022</date>
</imprint>
</monogr>
<idno type="MD5">2E695CAEA5E3B30D896FE14E59153667</idno>
</biblstruct>
</sourcedesc>
</filedesc>
<encodingdesc>
<appinfo>
<application ident="GROBID" version="0.7.1" when="2022-09-08T11:25+0000">
<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
<ref target="https://github.com/kermitt2/grobid"></ref>
</application>
</appinfo>
</encodingdesc>
<profiledesc>
<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Here is a bit of text in the middle of the document.</p></div>
</abstract>
</profiledesc>
</teiheader>
<text xml:lang="en">
</text>
<back>
<div type="references">
<listbibl>
<biblstruct xml:id="b0">
<analytic>
<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>
<author>
<persname><forename type="first">S</forename><surname>Feiner</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Macintyre</surname></persname>
</author>
<author>
<persname><forename type="first">M</forename><surname>Haupt</surname></persname>
</author>
<author>
<persname><forename type="first">E</forename><surname>Solomon</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. UIST'93</title>
<meeting>UIST'93</meeting>
<imprint>
<date type="published" when="1993">1993</date>
<biblscope from="145" to="155" unit="page"></biblscope>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b1">
<analytic>
<title level="a" type="main">What's real about virtual reality</title>
<author>
<persname><forename type="first">F</forename><forename type="middle">P B</forename><genname>Jr</genname></persname>
</author>
</analytic>
<monogr>
<title level="j">IEEE Computer Graphics and Applications</title>
<imprint>
<biblscope unit="volume">19</biblscope>
<biblscope unit="issue">6</biblscope>
<biblscope from="16" to="27" unit="page"></biblscope>
<date type="published" when="1999-12">Nov.-Dec. 1999</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b2">
<analytic>
<title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>
<author>
<persname><forename type="first">D</forename><surname>Rémy</surname></persname>
</author>
<author>
<persname><forename type="first">J</forename><surname>Vouillon</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">Theory And Practice of Objects Systems</title>
<imprint>
<biblscope unit="volume">4</biblscope>
<biblscope unit="issue">1</biblscope>
<biblscope from="27" to="50" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b3">
<analytic>
<title level="a" type="main">Visualizing data mining models</title>
<author>
<persname><forename type="first">K</forename><surname>Thearling</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Becker</surname></persname>
</author>
<author>
<persname><forename type="first">D</forename><surname>Decosta</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. Integration of Data Mining and Data Visualization Workshop</title>
<meeting>Integration of Data Mining and Data Visualization Workshop</meeting>
<imprint>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b4">
<analytic>
<title level="a" type="main">Why no one uses functional languages</title>
<author>
<persname><forename type="first">P</forename><surname>Wadler</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">ACM SIGPLAN Notices</title>
<imprint>
<biblscope unit="volume">33</biblscope>
<biblscope from="23" to="27" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b5">
<analytic>
<title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>
<author>
<persname><forename type="first">Y</forename><surname>Wang</surname></persname>
</author>
<author>
<persname><forename type="first">C</forename><surname>Mackenzie</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. CHI'99</title>
<meeting>CHI'99</meeting>
<imprint>
<date type="published" when="1999-05">May 1999</date>
</imprint>
</monogr>
</biblstruct>
</listbibl>
</div>
</back>
</tei>
</body></html>

我有一个类试图提取引用的标题。

代码语言:javascript
复制
class TEIFile(object):
    @property
    def reference_titles(self):
        reference_data = self.soup.listbibl.find_all('title', type="main")

        result = []

        for reference in reference_data:
            layer1 = reference

            result.append(layer1)
          

        return result

回传

代码语言:javascript
复制
'\'[<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>, <title level="a" type="main">What\'s real about virtual reality</title>, <title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>, <title level="a" type="main">Visualizing data mining models</title>, <title level="a" type="main">Why no one uses functional languages</title>, <title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>]\' '

我现在很难将标题提取到一个list...how中,我能改进这一点吗?这样我就可以获得标题输出了吗?

EN

回答 1

Stack Overflow用户

发布于 2022-09-13 14:07:34

修改此示例:

代码语言:javascript
复制
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("sample.xml", "r") as file:
    # Read each line in the file, readlines() returns a list of lines
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")


result = bs_content.find_all("title")
for t in result:
    print(t.text)

结果:

代码语言:javascript
复制
Fuel Cell Technology An Annotated Bibliography
Fuel Cell Technology An Annotated Bibliography
Windows on the world: 2D windows for 3D augmented reality
Proc. UIST'93
What's real about virtual reality
IEEE Computer Graphics and Applications
Objective ML: An effective object-oriented extension to ML
Theory And Practice of Objects Systems
Visualizing data mining models
Proc. Integration of Data Mining and Data Visualization Workshop
Why no one uses functional languages
ACM SIGPLAN Notices
Object manipulation in virtual environments: Relative size matters
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73703884

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档