文章/答案/技术大牛

发布

社区首页 >问答首页 >在循环中使用TEI XML解析器

问在循环中使用TEI XML解析器
EN

Stack Overflow用户

提问于 2020-08-01 02:53:31

回答 1查看 220关注 0票数 0

我发现关于解析TEI的this文章在处理单个XML文件时非常有帮助。但是，我有一个完整的目录。我在dir中的循环没有执行，我不知道为什么。

from bs4 import BeautifulSoup
import os.path
import glob

tei_docs = "../input/tei-xml-files"
    def read_tei(tei_docs):
    os.chdir(tei_docs)
for i in glob.glob(os.path.join(tei_docs, "*.xml")):
    read_file(i)

def read_file(i):

with open(i, "r") as tei:
    soup = BeautifulSoup(tei, 'lxml')

soup.title.getText()

我得到了输出

NameError

回溯(最近一次调用)在

->1 soup.title.getText()

NameError:未定义名称“”soup“”

python

html

beautifulsoup

html-parsing

回答 1

Stack Overflow用户

发布于 2020-08-01 03:09:22

您的代码中有几个问题，包括缩进错误。我还用Beautiful Soup编写了一个脚本来读取XML文件中的标题，如下所示：

#!/usr/bin/env python
# coding: utf-8

from bs4 import BeautifulSoup
import os
from os.path import dirname, join
directory=("C:\\Users\\mbarg\\Documents\\corpus") # location of XML files on local drive

results=[] # create result list
for infile in os.listdir(directory):
    filename=join(directory, infile)
    indata=open(filename,"r", encoding="utf-8", errors="ignore") # UTF-8 encoding errors are ignored
    contents = indata.read()
    soup = BeautifulSoup(contents,'xml')
    titles = soup.find_all('title') # get item titles
    for title in titles:
        print(title.get_text())
        results.append(title.get_text())
print(results) # result list is shown on screen

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63197592

复制

相似问题

问在循环中使用TEI XML解析器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在循环中使用TEI XML解析器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在循环中使用TEI XML解析器
EN