我发现关于解析TEI的this文章在处理单个XML文件时非常有帮助。但是,我有一个完整的目录。我在dir中的循环没有执行,我不知道为什么。
from bs4 import BeautifulSoup
import os.path
import glob
tei_docs = "../input/tei-xml-files"
def read_tei(tei_docs):
os.chdir(tei_docs)
for i in glob.glob(os.path.join(tei_docs, "*.xml")):
read_file(i)def read_file(i):
with open(i, "r") as tei:
soup = BeautifulSoup(tei, 'lxml')soup.title.getText()
我得到了输出
NameError
回溯(最近一次调用)在
->1 soup.title.getText()
NameError:未定义名称“”soup“”
发布于 2020-08-01 03:09:22
您的代码中有几个问题,包括缩进错误。我还用Beautiful Soup编写了一个脚本来读取XML文件中的标题,如下所示:
#!/usr/bin/env python
# coding: utf-8
from bs4 import BeautifulSoup
import os
from os.path import dirname, join
directory=("C:\\Users\\mbarg\\Documents\\corpus") # location of XML files on local drive
results=[] # create result list
for infile in os.listdir(directory):
filename=join(directory, infile)
indata=open(filename,"r", encoding="utf-8", errors="ignore") # UTF-8 encoding errors are ignored
contents = indata.read()
soup = BeautifulSoup(contents,'xml')
titles = soup.find_all('title') # get item titles
for title in titles:
print(title.get_text())
results.append(title.get_text())
print(results) # result list is shown on screenhttps://stackoverflow.com/questions/63197592
复制相似问题