文章/答案/技术大牛

发布

社区首页 >问答首页 >使用for循环从多个页面抓取Web第2部分

问使用for循环从多个页面抓取Web第2部分
EN

Stack Overflow用户

提问于 2020-12-22 02:23:48

回答 2查看 53关注 0票数 0

我最初的问题是：

“我已经创建了一个网络抓取工具，用于从列出的房屋中挑选数据。

当涉及到更换页面时，我遇到了问题。我确实让循环从1变成了某个数字。

问题是:在这个网页中，最后一个“页面”可以一直不同。现在是70，但明天可能是68或72。例如，如果我的范围是(1-74)，它将多次打印最后一页，因为如果超过最大值，页面总是加载最后一页。“

然后我得到了Ricco D的帮助，他写了代码，它会知道什么时候停止：

import requests
from bs4 import BeautifulSoup as bs

url='https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1000'
page=requests.get(url)
soup = bs(page.content,'html.parser')

last_page = None
pages = []

buttons=soup.find_all('button', class_= "Pagination__button__3H2wX")
for button in buttons:
    pages.append(button.text)

print(pages)

这个很好用。

但是，当我试图将它与我的原始代码组合在一起时，我遇到了错误：

Traceback (most recent call last):
  File "C:/Users/Käyttäjä/PycharmProjects/Etuoviscaper/etuovi.py", line 29, in <module>
    containers = page_soup.find("div", {"class": "ListPage__cardContainer__39dKQ"})
  File "C:\Users\Käyttäjä\PycharmProjects\Etuoviscaper\venv\lib\site-packages\bs4\element.py", line 2173, in __getattr__
    raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

这是我得到的错误。

有没有什么办法让这部作品上线？谢谢

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re
import requests

my_url = 'https://www.etuovi.com/myytavat-asunnot/oulu?haku=M1582971026&sivu=1'

filename = "asunnot.csv"
f = open(filename, "w")
headers = "Neliöt; Hinta; Osoite; Kaupunginosa; Kaupunki; Huoneistoselitelmä; Rakennusvuosi\n"
f.write(headers)

page = requests.get(my_url)
soup = soup(page.content, 'html.parser')

pages = []
buttons = soup.findAll("button", {"class": "Pagination__button__3H2wX"})
for button in buttons:
    pages.append(button.text)


last_page = int(pages[-1])

for sivu in range(1, last_page):

    req = requests.get(my_url + str(sivu))
    page_soup = soup(req.text, "html.parser")
    containers = page_soup.findAll("div", {"class": "ListPage__cardContainer__39dKQ"})

    for container in containers:
        size_list = container.find("div", {"class": "flexboxgrid__col-xs__26GXk flexboxgrid__col-md-4__2DYW-"}).text
        size_number = re.findall("\d+\,*\d+", size_list)
        size = ''.join(size_number)  # Asunnon koko neliöinä

        prize_line = container.find("div", {"class": "flexboxgrid__col-xs-5__1-5sb flexboxgrid__col-md-4__2DYW-"}).text
        prize_number_list = re.findall("\d+\d+", prize_line)
        prize = ''.join(prize_number_list[:2])  # Asunnon hinta

        address_city = container.h4.text

        address_list = address_city.split(', ')[0:1]
        address = ' '.join(address_list)  # osoite

        city_part = address_city.split(', ')[-2]  # kaupunginosa

        city = address_city.split(', ')[-1]  # kaupunki

        type_org = container.h5.text
        type = type_org.replace("|", "").replace(",", "").replace(".", "")  # asuntotyyppi

        year_list = container.find("div", {"class": "flexboxgrid__col-xs-3__3Kf8r flexboxgrid__col-md-4__2DYW-"}).text
        year_number = re.findall("\d+", year_list)
        year = ' '.join(year_number)

        print("pinta-ala: " + size)
        print("hinta: " + prize)
        print("osoite: " + address)
        print("kaupunginosa: " + city_part)
        print("kaupunki: " + city)
        print("huoneistoselittelmä: " + type)
        print("rakennusvuosi: " + year)

        f.write(size + ";" + prize + ";" + address + ";" + city_part + ";" + city + ";" + type + ";" + year + "\n")

f.close()

for-loop

web-scraping

python

html

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-12-22 07:03:27

您的主要问题与您使用soup的方式有关。首先导入BeautifulSoup as soup，然后在创建第一个BeautifulSoup-instance时覆盖此名称：

soup = soup(page.content, 'html.parser')

从现在开始，soup上将不再是名称库BeautifulSoup，而是您刚刚创建的对象。因此，当您再往下一些行尝试创建一个新实例(page_soup = soup(req.text, "html.parser"))时，这将失败，因为soup不再引用BeautifulSoup。

所以最好的方法是像这样正确地导入库：from bs4 import BeautifulSoup (或者像bs那样导入和使用它)，然后更改两个实例化行，如下所示：

soup = BeautifulSoup(page.content, 'html.parser') # this is Python2.7-syntax btw

和

page_soup = BeautifulSoup(req.text, "html.parser") # this is Python3-syntax btw

如果你使用的是Python3，正确的requests-syntax应该是page.text而不是page.content，因为.content在Python3中返回bytes，这不是你想要的(因为BeautifulSoup需要一个str)。如果你使用的是req.content 2.7，你可能应该把req.text改成Python2.7。

祝好运。

票数 1

Stack Overflow用户

发布于 2020-12-22 02:44:07

使用class name查找元素似乎不是最好的idea..because。接下来的所有元素都使用相同的类名。

因为语言的原因，我不知道你到底在找什么。I suggest..you go to website>press f12>press ctrl+f>type xpath..See what elements you get.If you know xpaths阅读这篇文章。https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65398056

复制

相似问题

问使用for循环从多个页面抓取Web第2部分
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用for循环从多个页面抓取Web第2部分EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用for循环从多个页面抓取Web第2部分
EN