文章/答案/技术大牛

发布

问抓取html页面
EN

Stack Overflow用户

提问于 2015-02-05 22:45:33

回答 2查看 92关注 0票数 0

我想从代码中给出的html页面中获取五部电影的电影标题、年份、分级、类型和运行时间。它们位于名为results的表的行中。

from bs4 import BeautifulSoup
import urllib2

def read_from_url(url, num_m=5):
    html_string =  urllib2.urlopen(url)
    soup = BeautifulSoup(html_string)
    movie_table = soup.find('table', 'results')  # table of movie
    list_movies = []
    count = 0
    for row in movie_table.find_all("tr"):
        dict_each_movie = {}
        title = title.encode("ascii", "ignore")  # getting title
        dict_each_movie["title"] = title
        year = year.encode("ascii","ignore")     # getting year
        dict_each_movie["year"] = year
        rank = rank.encode("ascii","ignore")     # getting rank
        dict_each_movie["rank"] = rank
        # genres = []  # getting genres of a movie
        runtime = runtime.encode("ascii","ignore")     # getting rank
        dict_each_movie["runtime"] = runtime
        list_movies.append(dict_each_movie)
        count+=1
        if count==num_of_m:
            break
    return list_movies

print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)

预期输出：

[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]

web-crawler

beautifulsoup

python

web-scraping

回答 2

Stack Overflow用户

发布于 2015-02-05 22:58:42

您正在访问一个尚未声明的变量。当解释器看到title.encode("ascii", "ignore")时，它会查找之前未声明的变量title。Python不可能知道title是什么，因此你不能对它调用encode。年份和排名也是如此。请改用：

title = 'How to Beat a Bully'.encode('ascii','ignore')

票数 1

Stack Overflow用户

发布于 2015-02-06 22:33:02

为什么？

使用CSS选择器使您的生活更轻松。

<table>
 <tr class="my_class">
  <td id="id_here">

     <a href = "link_here"/>First Link</a>

  </td>
  <td id="id_here">

     <a href = "link_here"/>Second Link</a>

  </td>
 </tr>
</table>

    for tr in movie_table.select("tr.my_class"):
            for td in tr.select("td#id_here"):
                print("Link " + td.select("a")[0]["href"])
                print("Text "+ td.select("a")[0].text)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/28346884

复制

相似问题

问抓取html页面
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取html页面EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取html页面
EN