文章/答案/技术大牛

发布

问使用urlib时生成etree错误
EN

Stack Overflow用户

提问于 2015-12-05 15:01:41

回答 1查看 707关注 0票数 0

我试图用这篇文章的解决方案将一个HTML表解析为python (2.7)。当我尝试使用字符串的前两种方法之一时(如示例中所示)，它的工作原理非常完美。但是，当我试图在etree.xml页面上使用urlib时，我会得到一个错误。我检查了每一个解决方案，我传递的变量也是str。关于下列代码：

from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)

我知道这个错误：

文件"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py"，第9行，表= etree.XML(s) 文件"lxml.etree.pyx"，第2723行，在lxml.etree.XML (src/lxml/lxml.etree.c:52448) 文件"parser.pxi"，第1573行，lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932) 文件"parser.pxi"，第1452行，lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774) 文件"parser.pxi"，第960行，在lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389) lxml.etree._ParserContext._handleParseResultDoc中的文件"parser.pxi"，第564行(src/lxml/lxml.etree.c:71739) 文件"parser.pxi"，第645行，在lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614) 文件"parser.pxi"，第585行，在lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)中lxml.etree.XMLSyntaxError:开始和结束标记不匹配:链接第8行和标题，第8行，第48列

对于这个代码：

from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)

我知道这个错误：

回溯(最近一次调用)：文件"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py"，第6行，表= ET.XML(s) 文件"C:\Python27\lib\xml\etree\ElementTree.py"，第1300行，以XML parser.feed(文本)表示文件"C:\Python27\lib\xml\etree\ElementTree.py"，第1642行，在feed self._raiseerror(v)中文件"C:\Python27\lib\xml\etree\ElementTree.py"，第1506行，在_raiseerror引发错误xml.etree.ElementTree.ParseError:不匹配标签:第8行，第111列

elementtree

python

python-2.7

html-parsing

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-12-06 18:11:30

虽然它们可能看起来是相同的标记类型，但HTML并不像XML那样严格，必须遵循标记规则(打开/关闭节点、转义实体等)。因此，XML可能不允许使用HTML。

因此，考虑使用etree的HTML()函数来解析页面。此外，您还可以使用XPath来针对您要提取或使用的特定区域。下面是一个试图提取主页表的示例。请注意，网页使用了相当多的嵌套表。

from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))

# PARSE PAGE
htmlpage = etree.HTML(s)

# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")

for row in htmltable:
    print(row)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/34106948

复制

相似问题

问使用urlib时生成etree错误
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用urlib时生成etree错误EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用urlib时生成etree错误
EN