我正在努力抓取IMDB,以获得1000部前1000部电影的列表,并获得有关它们的一些细节。但是,当我运行它时,它不是获得前50部电影,而是进入下一页的下一页,而是重复循环,在我的数据库中输入相同的50个条目20次。
# Dataframe template
data = pd.DataFrame(columns=['ID','Title','Genre','Summary'])
#Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
#Run for top 1000
for start in range(1,1001,50):
getPageContent(start)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id,title,genres,summary]
#Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)1.肖申克救赎剧\n两名被监禁的男子在许多你们.1.黑暗骑士的行动,犯罪,戏剧\n当被称为小丑的威胁时.2.开始行动,冒险,偷公司机密的小偷.4.搏击俱乐部戏剧\南失眠症办公室工作人员和魔鬼.. 46 47通常的嫌疑犯犯罪,戏剧,神秘\n唯一的幸存者讲述了曲折的事件. 47 48。杜鲁门秀喜剧,戏剧\南保险推销员发现他的整个l..。48 . 49.复仇者:无限战争行动,冒险,科幻\n复仇者及其盟友一定是威利.49 50。钢铁侠行动,冒险,科幻\n在被囚禁在阿富汗的洞穴后\n 50 .肖申克救赎戏剧\n两个被囚禁的男人在许多你们.
发布于 2022-11-01 07:16:09
删除'getPageContent‘函数中的'start’变量。它每次都分配'start=1‘。
#Get page data function
def getPageContent(start=1):
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs发布于 2022-11-01 03:14:00
我无法测试这段代码。请参阅内联评论,我认为这是主要问题。
# Dataframe template
data = pd.DataFrame(columns=['ID', 'Title', 'Genre', 'Summary'])
# Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start=' + str(
start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
# Run for top 1000
# for start in range(1, 1001, 50): # 50 is a
# step value so this gets every 50th movie
# Try 2 loops
start = 0
for group in range(0, 1001, 50):
for item in range(group, group + 50):
getPageContent(item)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id, title, genres, summary]
# Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)https://stackoverflow.com/questions/74270977
复制相似问题