我试图解析GROBID输出的元数据(解析PDF格式的学术论文)。这些引用看起来像这
原始的TEI-XML文件如下(通过soup = read_tei('paper1.tei.xml')读取)
<?xml version="1.0" encoding="UTF-8"?><html><body><tei xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
<teiheader xml:lang="en">
<filedesc>
<titlestmt>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</titlestmt>
<publicationstmt>
<publisher></publisher>
<availability status="unknown"><licence></licence></availability>
<date type="published" when="2022-09-05">September 5, 2022</date>
</publicationstmt>
<sourcedesc>
<biblstruct>
<analytic>
<author role="corresp">
<persname><forename type="first">Titus</forename><surname>Barik</surname></persname>
<email>titus@barik.net</email>
<affiliation key="aff0">
<orgname type="institution">Georgia Institute of Technology</orgname>
</affiliation>
</author>
<title level="a" type="main">Fuel Cell Technology An Annotated Bibliography</title>
</analytic>
<monogr>
<imprint>
<date type="published" when="2022-09-05">September 5, 2022</date>
</imprint>
</monogr>
<idno type="MD5">2E695CAEA5E3B30D896FE14E59153667</idno>
</biblstruct>
</sourcedesc>
</filedesc>
<encodingdesc>
<appinfo>
<application ident="GROBID" version="0.7.1" when="2022-09-08T11:25+0000">
<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
<ref target="https://github.com/kermitt2/grobid"></ref>
</application>
</appinfo>
</encodingdesc>
<profiledesc>
<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Here is a bit of text in the middle of the document.</p></div>
</abstract>
</profiledesc>
</teiheader>
<text xml:lang="en">
</text>
<back>
<div type="references">
<listbibl>
<biblstruct xml:id="b0">
<analytic>
<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>
<author>
<persname><forename type="first">S</forename><surname>Feiner</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Macintyre</surname></persname>
</author>
<author>
<persname><forename type="first">M</forename><surname>Haupt</surname></persname>
</author>
<author>
<persname><forename type="first">E</forename><surname>Solomon</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. UIST'93</title>
<meeting>UIST'93</meeting>
<imprint>
<date type="published" when="1993">1993</date>
<biblscope from="145" to="155" unit="page"></biblscope>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b1">
<analytic>
<title level="a" type="main">What's real about virtual reality</title>
<author>
<persname><forename type="first">F</forename><forename type="middle">P B</forename><genname>Jr</genname></persname>
</author>
</analytic>
<monogr>
<title level="j">IEEE Computer Graphics and Applications</title>
<imprint>
<biblscope unit="volume">19</biblscope>
<biblscope unit="issue">6</biblscope>
<biblscope from="16" to="27" unit="page"></biblscope>
<date type="published" when="1999-12">Nov.-Dec. 1999</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b2">
<analytic>
<title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>
<author>
<persname><forename type="first">D</forename><surname>Rémy</surname></persname>
</author>
<author>
<persname><forename type="first">J</forename><surname>Vouillon</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">Theory And Practice of Objects Systems</title>
<imprint>
<biblscope unit="volume">4</biblscope>
<biblscope unit="issue">1</biblscope>
<biblscope from="27" to="50" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b3">
<analytic>
<title level="a" type="main">Visualizing data mining models</title>
<author>
<persname><forename type="first">K</forename><surname>Thearling</surname></persname>
</author>
<author>
<persname><forename type="first">B</forename><surname>Becker</surname></persname>
</author>
<author>
<persname><forename type="first">D</forename><surname>Decosta</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. Integration of Data Mining and Data Visualization Workshop</title>
<meeting>Integration of Data Mining and Data Visualization Workshop</meeting>
<imprint>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b4">
<analytic>
<title level="a" type="main">Why no one uses functional languages</title>
<author>
<persname><forename type="first">P</forename><surname>Wadler</surname></persname>
</author>
</analytic>
<monogr>
<title level="j">ACM SIGPLAN Notices</title>
<imprint>
<biblscope unit="volume">33</biblscope>
<biblscope from="23" to="27" unit="page"></biblscope>
<date type="published" when="1998">1998</date>
</imprint>
</monogr>
</biblstruct>
<biblstruct xml:id="b5">
<analytic>
<title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>
<author>
<persname><forename type="first">Y</forename><surname>Wang</surname></persname>
</author>
<author>
<persname><forename type="first">C</forename><surname>Mackenzie</surname></persname>
</author>
</analytic>
<monogr>
<title level="m">Proc. CHI'99</title>
<meeting>CHI'99</meeting>
<imprint>
<date type="published" when="1999-05">May 1999</date>
</imprint>
</monogr>
</biblstruct>
</listbibl>
</div>
</back>
</tei>
</body></html>我有一个类试图提取引用的标题。
class TEIFile(object):
@property
def reference_titles(self):
reference_data = self.soup.listbibl.find_all('title', type="main")
result = []
for reference in reference_data:
layer1 = reference
result.append(layer1)
return result回传
'\'[<title level="a" type="main">Windows on the world: 2D windows for 3D augmented reality</title>, <title level="a" type="main">What\'s real about virtual reality</title>, <title level="a" type="main">Objective ML: An effective object-oriented extension to ML</title>, <title level="a" type="main">Visualizing data mining models</title>, <title level="a" type="main">Why no one uses functional languages</title>, <title level="a" type="main">Object manipulation in virtual environments: Relative size matters</title>]\' '我现在很难将标题提取到一个list...how中,我能改进这一点吗?这样我就可以获得标题输出了吗?
发布于 2022-09-13 14:07:34
修改此示例:
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("sample.xml", "r") as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "lxml")
result = bs_content.find_all("title")
for t in result:
print(t.text)结果:
Fuel Cell Technology An Annotated Bibliography
Fuel Cell Technology An Annotated Bibliography
Windows on the world: 2D windows for 3D augmented reality
Proc. UIST'93
What's real about virtual reality
IEEE Computer Graphics and Applications
Objective ML: An effective object-oriented extension to ML
Theory And Practice of Objects Systems
Visualizing data mining models
Proc. Integration of Data Mining and Data Visualization Workshop
Why no one uses functional languages
ACM SIGPLAN Notices
Object manipulation in virtual environments: Relative size mattershttps://stackoverflow.com/questions/73703884
复制相似问题