我试图在网上刮600多个真实的国家网站的清单。名称、价格、面积和valueperm2都是强制性的,所有的页面都有它们,所以很容易刮掉它们。但其他功能,如房间数量,套房,车库和税收价格不是强制性的,然后我得到一个灵活的长度和订单的元素在soup.findAll('h6',class_ ='mb-0 text-normal')上。
我尝试创建键和值来存储在数据字典中,但是当我尝试使用k2和v2时,可能是因为一些清单中只有一个可选的特性。关于使用len(soup.findAll('h6',class_ ='mb-0 text-normal'))创建添加这些可选特性的条件方法的思考,butg
productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('li', class_ = 'property-list__item')
for item in productlist:
for link in item.find_all('meta',itemprop = 'url'):
productlinks.append(baseurl + link['content'])
for link in productlinks:
r = requests.get(link)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
k1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n ','').strip().split(':')[0]
v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n ','').strip().split(':')[1].strip()
k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[0]
v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[1].strip()
data = {'name':name,
'value':value,
'area':area,
'valueperm2':valueperm2,
k1:v1,
k2:v2
}然后我得到输出
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-74-6ee7d6edeb81> in <module>
9 v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n ','').strip().split(':')[1].strip()
10 k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[0]
---> 11 v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[1].strip()
12
13 ap = {'name':name,
IndexError: list index out of range发布于 2022-07-31 03:55:07
我试图运行您的代码,但无法重现问题,因为我没有“baseUrl”。
但是,在将列表中的各个项分配到soup.findAll、k2、v2 (等)变量之前,您应该能够检查“v1 (‘h6’,class_ =‘mb-0-text-normal’)”的长度。
例如,
results = soup.findAll('h6',class_ ='mb-0 text-normal')
if len(results) >= 2:
v1 = results[0].text.replace('\r\n ','').strip().split(':')[1].strip()
k2 = results[1].text.replace('\r\n ','').strip().split(':')[0]
v2 = results[1].text.replace('\r\n ','')您可能需要根据您正在实现的特定逻辑重新排序或修改此顺序,但按照这些思路编写的代码应该可以工作。
发布于 2022-07-31 05:00:51
发生此错误的原因如下:
从页面中提取动态字段的独立函数将是一个更好的避免选项(代码重复,不必要的错误)。
def dynamic_portion(soup):
temp_data = {}
for item in soup.findAll('h6',class_ ='mb-0 text-normal'):
item = item.text.split(':')
if len(item)==2:
key,val = map(str.strip,item)
temp_data[key]=val
return temp_data 您可以按照以下方式将其集成到代码中:
productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('li', class_ = 'property-list__item')
for item in productlist:
for link in item.find_all('meta',itemprop = 'url'):
productlinks.append(baseurl + link['content'])
for link in productlinks:
r = requests.get(link)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
value = 1
price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
data = {'name':name,
'value':value,
'area':area,
'valueperm2':valueperm2
}
temp_data = dynamic_portion(soup)
data.update(temp_data)
breakhttps://stackoverflow.com/questions/73180509
复制相似问题