首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >两个名称相同但位置不同的标记-- xml

两个名称相同但位置不同的标记-- xml
EN

Stack Overflow用户
提问于 2020-05-25 07:26:27
回答 3查看 151关注 0票数 0

我想用Python将XML文件变成JSON文件。我目前正在尝试从XML文件中提取信息,使其成为一个dict或dataframe。

以下是XML文件:

代码语言:javascript
复制
<?xml version="1.0" encoding="UTF-8"?>
<Terms>
    <Term>
        <Title>.177 (4.5mm) Airgun</Title>
        <Description>The standard airgun calibre for international target shooting.</Description>
        <RelatedTerms>
            <Term>
                <Title>Shooting sport equipment</Title>
                <Relationship>Narrower Term</Relationship>
            </Term>
        </RelatedTerms>
    </Term>
    <Term>
        <Title>.22</Title>
        <Description>A rimfire calibre, much used in target shooting and often synonymous with the term smallbore.</Description>
        <RelatedTerms>
            <Term>
                <Title>Shooting sport equipment</Title>
                <Relationship>Narrower Term</Relationship>
            </Term>
        </RelatedTerms>
    </Term>
    <Term>
        <Title>.22 Long Rifle</Title>
        <Description>The standard .22 rimfire cartridge for target rifle and pistol use.</Description>
        <RelatedTerms>
            <Term>
                <Title>Shooting sport equipment</Title>
                <Relationship>Narrower Term</Relationship>
            </Term>
        </RelatedTerms>
    </Term>
    <Term>
        <Title>.22 Short</Title>
        <Description>Used as a target shooting round for timed fire pistol competitions.</Description>
        <RelatedTerms>
            <Term>
                <Title>Shooting sport equipment</Title>
                <Relationship>Narrower Term</Relationship>
            </Term>
        </RelatedTerms>
    </Term>
</Terms>

当我去调用标题标签时,它会给我所有的标题标签。但是,我想分离嵌入在RelatedTerms标记中的主标题标记和标题标记。

代码语言:javascript
复制
xml_file = open('xml.xml', encoding='UTF-8') 
soup = BeautifulSoup(xml_file, 'lxml-xml', from_encoding='UTF-8')


Terms = soup.select('Terms > Term')
jsonObj = {"thesaurus": []}

for term in Terms:
    termDetail = {
        "Description": term.find('Description').text,
        "Title": term.find('Title').text
    }
    RelatedTerms = term.select('RelatedTerms > Term')
    if RelatedTerms:
        termDetail["RelatedTerms"] = []
        for rterm in RelatedTerms:
            termDetail["RelatedTerms"].append({
                "Title": rterm.find('Title').text,
                "Relationship": rterm.find('Relationship').text
            })
    jsonObj["thesaurus"].append(termDetail)

print(json.dumps(jsonObj))

好的,我已经更新了上面的代码,它主要起作用。但是,"Title": rterm.find('Title').text代码给出了错误

AttributeError:'NoneType‘对象没有属性'text’

我不知道为什么,因为里面有文字

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2020-05-25 09:19:52

我将使用帕塞尔来提取数据--您的数据是嵌入在术语和关系中的,因此相应地调整代码:

代码语言:javascript
复制
from parsel import Selector

data = """[your code above here]"""

selector = Selector(data)

#extract titles in Terms : 
title_in_terms = selector.xpath(".//terms/term/title/text()").getall()
title_in_terms
['.177 (4.5mm) Airgun', '.22', '.22 Long Rifle', '.22 Short']
#extract title in relationship terms: 
title_in_relationship_terms = selector.xpath(".//relatedterms/term/title/text()").getall()
title_in_relationship_terms
['Shooting sport equipment',
 'Shooting sport equipment',
 'Shooting sport equipment',
 'Shooting sport equipment']
票数 1
EN

Stack Overflow用户

发布于 2020-05-25 10:20:14

我创建了一个工作解决方案,它只使用您在代码中指定的包。看起来是这样的:

代码语言:javascript
复制
from bs4 import BeautifulSoup as bs
import lxml

xml_file = open('xml.xml', encoding='UTF-8')
soup = bs(xml_file, 'lxml-xml', from_encoding='UTF-8')

term = soup.find_all('Term')[0]
main_title = term.find_all('Title')[0]
related_terms = term.find_all('RelatedTerms')[0]
embedded_title = related_terms.find_all('Title')[0]

print(main_title.string)
print(embedded_title.string)

输出:

代码语言:javascript
复制
.177 (4.5mm) Airgun
Shooting sport equipment

代码强烈保证所有标记至少有一个指定的子标记。因此,如果您有一个没有这种保证的XML文件,则必须检查结果标记列表是否为空。

票数 1
EN

Stack Overflow用户

发布于 2020-05-25 18:31:11

仅使用BeautifulSoup,当xml_text是问题中的xml_text文本时,则使用以下脚本:

代码语言:javascript
复制
from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_text, 'xml')

data = []
for title, description in zip(soup.select('Terms > Term > Title'), soup.select('Terms > Term > Description')):
    data.append({'Title': title.get_text(strip=True),
                 'Description': description.get_text(strip=True),
                 'Related Terms': [(rel_title.get_text(strip=True), rel.get_text(strip=True)) for rel_title, rel in zip(
                        title.find_parent('Term').select('RelatedTerms > Term > Title'),
                        title.find_parent('Term').select('RelatedTerms > Term > Relationship') )]})

df = pd.DataFrame(data)
print(df)

创建Pandas dataframe:

代码语言:javascript
复制
                 Title                                        Description                                Related Terms
0  .177 (4.5mm) Airgun  The standard airgun calibre for international ...  [(Shooting sport equipment, Narrower Term)]
1                  .22  A rimfire calibre, much used in target shootin...  [(Shooting sport equipment, Narrower Term)]
2       .22 Long Rifle  The standard .22 rimfire cartridge for target ...  [(Shooting sport equipment, Narrower Term)]
3            .22 Short  Used as a target shooting round for timed fire...  [(Shooting sport equipment, Narrower Term)]
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61997647

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档