我使用https://www.realtor.com/realestateagents/phoenix_az//pg-2作为我的起点。我想从第2页转到第5页,以及中间的每一页,同时收集姓名和号码。我正在完美地收集第2页的信息,但是我不能让它转到下一页,而不是插入一个新的url。我正在尝试设置一个循环来自动完成此操作,但是,在编写了我认为是一个循环的代码后,我只是在抓取器停止之前仅获取了第2页(起始点)上的信息。我是个新手,也尝试过多种方法,但都不能正常工作。
下面是目前的完整代码。
import requests
from requests import get
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import numpy as np
from numpy import arange
import pandas as pd
from time import sleep
from random import randint
headers = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'}
my_url = 'https://www.realtor.com/realestateagents/phoenix_az//pg-2'
#opening up connection, grabbing the page
uClient = uReq(my_url)
#read page
page_html = uClient.read()
#close page
uClient.close()
pages = np.arange(2, 3, 1)
for page in pages:
page = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-" , headers=headers)
#html parsing
page_soup = soup(page_html, "html.parser")
#finds all realtors on page
containers = page_soup.findAll("div",{"class":"agent-list-card clearfix"})
#creating csv file
filename = "phoenix.csv"
f = open(filename, "w")
headers = "agent_name, agent_number\n"
f.write(headers)
#controlling scrape speed
for container in containers:
try:
name = container.find('div', class_='agent-name text-bold')
agent_name = name.a.text.strip()
except AttributeError:
print("-")
try:
number = container.find('div', class_='agent-phone hidden-xs hidden-xxs')
agent_number = number.text.strip()
except AttributeError:
print("-")
except NameError:
print("-")
try:
print("name: " + agent_name)
print("number: " + agent_number)
except NameError:
print("-")
try:
f.write(agent_name + "," + agent_number + "\n")
except NameError:
print("-")
f.close()发布于 2020-10-01 05:14:17
我不确定这是否是您需要的,但这里有一个基于您的示例的有效(和简化的)代码,它抓取了前五页。
如果你仔细观察,我正在使用一个for loop通过将页码附加到url来“移动”页面。然后,获取HTML,解析它以获取agent div,获取名称和编号(如果为None,则添加N/A),最后将列表转储到一个csv文件。
编辑:为了匹配评论,我添加了一个城市Pheonix和一个wait_for功能,它可以在1到10秒内随时停止脚本,这是可以调整的。
import csv
import random
import time
import requests
from bs4 import BeautifulSoup
realtor_data = []
for page in range(1, 6):
print(f"Scraping page {page}...")
url = f"https://www.realtor.com/realestateagents/phoenix_az/pg-{page}"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for agent_card in soup.find_all("div", {"class": "agent-list-card clearfix"}):
name = agent_card.find("div", {"class": "agent-name text-bold"}).find("a")
number = agent_card.find("div", {"itemprop": "telephone"})
realtor_data.append(
[
name.getText().strip(),
number.getText().strip() if number is not None else "N/A",
"Pheonix",
],
)
wait_for = random.randint(1, 10)
print(f"Sleeping for {wait_for} seconds...")
time.sleep(wait_for)
with open("data.csv", "w") as output:
w = csv.writer(output)
w.writerow(["NAME:", "PHONE NUMBER:"])
w.writerows(realtor_data)输出:
包含房地产经纪人姓名和电话号码的.csv文件。
NAME: PHONE NUMBER: CITY:
------------------------ --------------- -------
Shawn Rogers (480) 313-7031 Pheonix
The Jason Mitchell Group (480) 470-1993 Pheonix
Kyle Caldwell (602) 390-2245 Pheonix
THE VALENTINE GROUP N/A Pheonix
Nancy Wolfe (602) 418-1010 Pheonix
Rhonda DuBois (623) 418-2970 Pheonix
Sabrina Hurley (602) 410-1985 Pheonix
Bryan Adams (480) 375-1292 Pheonix
DeAnn Fry (623) 748-3818 Pheonix
Esther P Goh (480) 703-3836 Pheonix
...发布于 2020-10-01 04:27:09
你必须在页面之间移动:
page_html = requests.get("https://www.realtor.com/realestateagents/phoenix_az//pg-"+ str(page), headers=headers)
#html parsing
page_soup = soup(page_html, "html.parser")另外,变量名也有误,应该是page_html
https://stackoverflow.com/questions/64144929
复制相似问题