我想从代码中给出的html页面中获取五部电影的电影标题、年份、分级、类型和运行时间。它们位于名为results的表的行中。
from bs4 import BeautifulSoup
import urllib2
def read_from_url(url, num_m=5):
html_string = urllib2.urlopen(url)
soup = BeautifulSoup(html_string)
movie_table = soup.find('table', 'results') # table of movie
list_movies = []
count = 0
for row in movie_table.find_all("tr"):
dict_each_movie = {}
title = title.encode("ascii", "ignore") # getting title
dict_each_movie["title"] = title
year = year.encode("ascii","ignore") # getting year
dict_each_movie["year"] = year
rank = rank.encode("ascii","ignore") # getting rank
dict_each_movie["rank"] = rank
# genres = [] # getting genres of a movie
runtime = runtime.encode("ascii","ignore") # getting rank
dict_each_movie["runtime"] = runtime
list_movies.append(dict_each_movie)
count+=1
if count==num_of_m:
break
return list_movies
print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)预期输出:
[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]发布于 2015-02-05 22:58:42
您正在访问一个尚未声明的变量。当解释器看到title.encode("ascii", "ignore")时,它会查找之前未声明的变量title。Python不可能知道title是什么,因此你不能对它调用encode。年份和排名也是如此。请改用:
title = 'How to Beat a Bully'.encode('ascii','ignore')发布于 2015-02-06 22:33:02
为什么?
使用CSS选择器使您的生活更轻松。
<table>
<tr class="my_class">
<td id="id_here">
<a href = "link_here"/>First Link</a>
</td>
<td id="id_here">
<a href = "link_here"/>Second Link</a>
</td>
</tr>
</table>
for tr in movie_table.select("tr.my_class"):
for td in tr.select("td#id_here"):
print("Link " + td.select("a")[0]["href"])
print("Text "+ td.select("a")[0].text)https://stackoverflow.com/questions/28346884
复制相似问题