我想做一个端到端的情感分析项目,从数据收集开始。为此,我从IMDB评论开始,特别是从这个页面开始:
http://www.imdb.com/title/tt2137109/reviews?start=0
我将使用scrapy来实现这一点,从以下代码中我可以获得评论和标题:
import requests
from scrapy.http import TextResponse
import urlparse
from urlparse import urljoin
base_url = "http://www.imdb.com/title/tt2137109/reviews?start=0"
r = requests.get(base_url)
response = TextResponse(r.url, body=r.text, encoding='utf-8')
title = response.xpath('//*[contains(@id,"title")]//text()').re('".+"')[0]
reviews = response.xpath('//*[contains(@id,"1")]/p/text()').extract()我遇到的问题是,我如何抓取站点以获取随机样本?我正在寻找一个10k标题的样本,我计划在5-10天内获得,以避免不必要的打击,网站和被禁止。
有一些像前250名list这样的起点:但我在寻找一个随机样本。
发布于 2020-01-21 21:34:51
**#All U.S. Released Movies: 1972-2016 #**
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
df=pd.DataFrame()
for f in range(4,101):
print(f)
url = "https://www.imdb.com/list/ls057823854/?st_dt=&mode=detail&page="+str(f)+"&sort=release_date,desc"
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-detail')
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
plots=[]
genre=[]
lengths=[]
collections=[]
starss=[]
directors=[]
for container in movie_containers:
if container.find('div', class_ = 'ratings-metascore') is not None or None:
name = container.h3.a.text
names.append(name)
imdb =container.find('span', class_ = 'ipl-rating-star__rating').text
imdb_ratings.append(imdb)
year = container.h3.find('span', class_ = 'lister-item-year').text
years.append(year)
m_score = container.find('span', class_ = 'metascore').text
metascores.append(int(m_score))
b=container.find_all('span', attrs = {'name':'nv'})
vote = b[0].text
votes.append(vote)
if len(b)==2:
collection=b[1].text
collections.append(collection)
else:
collections.append('0')
par=container.find_all('p')
length= par[0].find('span',class_='genre').text
genre.append(length)
length= par[0].find('span',class_='runtime').text
lengths.append(length)
plot= par[1].text
plots.append(plot)
stars_director=container.find_all('p')[2].text
directors.append(stars_director)
test_df = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes,
'Plot':plots,
'genre':genre,
'duration':lengths,
"revenue":collections,
"directors":directors
})
df=pd.concat([df,test_df])
if f%10==0:
df.to_csv(str(f)+"page.csv")发布于 2020-01-21 21:47:53
这是我能想到的可能的想法
并下载名为title.akas.tsv.gz的压缩文件
这包含了imdb数据库中moveis的所有分片,编写一个函数从文件中选择随机数并遍历列表
通过将代码更改为
import requests
from scrapy.http import TextResponse
import urlparse
from urlparse import urljoin
def random_list():
#this should read the file and give random tiles in a list form once called.
for i in random_list():
base_url = f"http://www.imdb.com/title/{i}/reviews?start=0"
r = requests.get(base_url)
response = TextResponse(r.url, body=r.text, encoding='utf-8')
title = response.xpath('//*[contains(@id,"title")]//text()').re('".+"')[0]
reviews = response.xpath('//*[contains(@id,"1")]/p/text()').extract()下面是imdb的文档。
https://stackoverflow.com/questions/44580538
复制相似问题