我正在尝试探索zillow的住房数据进行分析。但我发现,我从Zillow抓取的数据比列表中的数据要少得多。
这里有一个例子:
我们可以看到有76条记录。如果我使用google chrome扩展: Zillow-to-excel,列表中的所有76个房子都可以被抓取。https://chrome.google.com/webstore/detail/zillow-to-excel/aecdekdgjlncaadbdiciepplaobhcjgi/related
但是当我使用Python请求抓取zillow数据时,只能抓取18-20条记录。下面是我的代码:
import requests
import json
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
cnt=0
stop_check=0
ele=[]
url='https://www.zillow.com/birmingham-al-35216/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6',
'upgrade-insecure-requests': '1',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i in range(1,2):
params = {
'searchQueryState':'{"pagination":{"currentPage":'+str(i)+'},"usersSearchTerm":"35216","mapBounds":{"west":-86.83314614582643,"east":-86.73781685417354,"south":33.32843303639682,"north":33.511017584543204},"regionSelection":[{"regionId":73386,"regionType":7}],"isMapVisible":true,"filterState":{"sort":{"value":"globalrelevanceex"},"ah":{"value":true}},"isListVisible":true,"mapZoom":13}'
}
page=requests.get(url, headers=headers,params=params,timeout=2)
sp=soup(page.content, 'lxml')
lst=sp.find_all('address',{'class':'list-card-addr'})
ele.extend(lst)
print(i, len(lst))
if len(lst)==0:
stop_check+=1
if stop_check>=3:
print('stop on three empty')Headers和params来自使用chrome开发工具的web。我还尝试了其他搜索,发现我只能在每个页面上刮前9-11条记录。
我知道有一个zillow API,但它可以用于一般搜索,就像zipcode中的所有房屋一样。所以我想试试网络抓取。
我可以有一些建议如何修复我的代码吗?
非常感谢!
发布于 2021-07-19 18:09:51
你可以试一下
import requests
import json
url = 'https://www.zillow.com/search/GetSearchPageState.htm'
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'upgrade-insecure-requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
houses = []
for page in range(1, 3):
params = {
"searchQueryState": json.dumps({
"pagination": {"currentPage": page},
"usersSearchTerm": "35216",
"mapBounds": {
"west": -86.97413567189196,
"east": -86.57244804982165,
"south": 33.346263857015515,
"north": 33.48754107532057
},
"mapZoom": 12,
"regionSelection": [
{
"regionId": 73386, "regionType": 7
}
],
"isMapVisible": True,
"filterState": {
"isAllHomes": {
"value": True
},
"sortSelection": {
"value": "globalrelevanceex"
}
},
"isListVisible": True
}),
"wants": json.dumps(
{
"cat1": ["listResults", "mapResults"],
"cat2": ["total"]
}
),
"requestId": 3
}
# send request
page = requests.get(url, headers=headers, params=params)
# get json data
json_data = page.json()
# loop via data
for house in json_data['cat1']['searchResults']['listResults']:
houses.append(house)
# show data
print('Total houses - {}'.format(len(houses)))
# show info in houses
for house in houses:
if 'brokerName' in house.keys():
print('{}: {}'.format(house['brokerName'], house['price']))
else:
print('No broker: {}'.format(house['price']))Total houses - 76
RealtySouth-MB-Crestline: $424,900
eXp Realty, LLC Central: $259,900
ARC Realty Mountain Brook: $849,000
Ray & Poynor Properties: $499,900
Hinge Realty: $1,550,000
...附注:如果我帮助您,请不要忘记将答案标记为正确:)
https://stackoverflow.com/questions/68430574
复制相似问题