文章/答案/技术大牛

发布

问使用Python阅读网页
EN

Stack Overflow用户

提问于 2010-08-09 23:13:21

回答 2查看 905关注 0票数 0

我正在尝试读取和处理Python中的一个网页，其中的代码行如下：

              <div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">

我目前只对艺术家名称(AC/DC)和专辑名称(Live)感兴趣。我可以用libxml2dom读取和打印它们，但是我不知道如何区分这些链接，因为每个链接的节点值都是None。

一种显而易见的方法是一次读取输入行，但是有没有更聪明的方法来处理这个html文件，这样我就可以创建两个单独的列表，其中每个索引都与另一个相匹配，或者创建一个包含此信息的结构？

import urllib
import sgmllib
import libxml2dom

def collect_text(node):
  "A function which collects text inside 'node', returning that text."

  s = ""
  for child_node in node.childNodes:
    if child_node.nodeType == child_node.TEXT_NODE:
        s += child_node.nodeValue
    else:
        s += collect_text(child_node)
  return s

  f = urllib.urlopen("/home/x/Documents/rym_list.html")

  s = f.read()

  doc = libxml2dom.parseString(s, html=1)

  links = doc.getElementsByTagName("a")
  for link in links:
    print "--\nNode " , artist.childNodes
    if artist.localName == "artist":
      print "artist"
    print collect_text(artist).encode('utf-8')

  f.close()

python

libxml2

回答 2

Stack Overflow用户

回答已采纳

发布于 2010-08-10 00:19:45

考虑到超文本标记语言的小片段，我不知道这在整个页面上是否有效，但这里是如何使用lxml.etree和xpath提取“AC/DC”和“Live”的。

>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[@class="or_q_artist"]/a/text()|//td[@class="or_q_album"]/a/text()')
['AC/DC', 'Live']

票数 2

Stack Overflow用户

发布于 2010-08-10 04:15:48

看看您是否可以使用jQuery样式的DOM/CSS选择器在javascript中解决这个问题，以获取您想要的元素/文本。
如果您可以获得python的BeautifulSoup副本，那么您应该可以在几分钟内完成工作。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/3441447

复制

相似问题

问使用Python阅读网页
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python阅读网页EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python阅读网页
EN