首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用IMDb BeautifulSoup对网页进行IMDb抓取

使用IMDb BeautifulSoup对网页进行IMDb抓取
EN

Stack Overflow用户
提问于 2015-03-07 06:14:27
回答 2查看 2.8K关注 0票数 4

我刚开始使用WebScraping/Python和BeautifulSoup,而且我的代码在工作上遇到了困难。

我想刮一下url:http://m.imdb.com/feature/bornondate“以获得:

  • 名人的名字
  • 名人形象
  • 职业
  • 最佳作品

这一页上的十位名人。我不知道我做错了什么。

这是我的代码:

代码语言:javascript
复制
import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前没有任何东西被打印出来。此外,如果这是模糊的,你能解释如何做的只是名人的名字,例如?

这是第一个名人的html代码,如果这有帮助的话:

代码语言:javascript
复制
<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-03-07 06:40:22

首先,IMDb “使用条件”明确禁止屏幕抓取。

机器人和屏幕抓取:您不能使用数据挖掘,机器人,屏幕抓取,或类似的数据收集和提取工具,除非得到我们明确的书面同意,如下所述。

尝试探索IMDb JSON,而不是web抓取方法。

您当前的问题是-在特定日期出生的人员列表是通过单独调用IMDb API加载的,并包含一个javascript逻辑。

现在最简单的选择是切换到selenium浏览器自动化工具。使用无头PhantomJS浏览器的工作示例

代码语言:javascript
复制
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

指纹:

代码语言:javascript
复制
http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn
票数 4
EN

Stack Overflow用户

发布于 2015-08-26 13:58:34

我在做同样的任务。URLlib库加载web的静态内容。使用selenium获得完整的html,其中也包括动态内容。如果使用urllib2库,生成的html将是

代码语言:javascript
复制
<span class="loading"></span>

希望能帮上忙。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/28912004

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档