文章/答案/技术大牛

发布

社区首页 >问答首页 >用Python抓取价格AirBnB数据

问用Python抓取价格AirBnB数据
EN

Stack Overflow用户

提问于 2022-04-24 16:48:09

回答 1查看 747关注 0票数 0

我一直试图在网上刮一个空气bnb网站，以获得的价格没有多少运气。我已经成功地引进了其他感兴趣的领域(家庭描述，家的位置，评论等)。以下是我尝试过的失败之处。我认为网页上的“价格”是一个'span类‘，而其他类是'div类’，这就是我的问题所在，但我在猜测。

我使用的网址是：foOVSAshSYvdbpbS

这可以作为输入放在下面的代码中。

如能提供任何协助，将不胜感激。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from bs4 import BeautifulSoup
import requests
from IPython.display import IFrame

input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
    airbnb_list = input_string.split(",")
    x = 0
    y = len(airbnb_list)
    while y >= x:
        print(x+1 , '.) ' , airbnb_list[x])
        x=x+1
        if y == x:
            break
    #print(airbnb_list[len(airbnb_list)])
except:
    print("""Please separate list by a ','""")

a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
               "Location":'', "Home Type":'', "House Desc":''}])

for x in range(len(airbnb_list)):
        url = airbnb_list[x]
        soup = BeautifulSoup(requests.get(url).content, 'html.parser')
        stars = soup.find(class_='_c7v1se').get_text()
        desc = soup.find(class_='_12nksyy').get_text()
        size = soup.find(class_='_jro6t0').get_text()
        #checkIn = soup.find(class_='_1acx77b').get_text()
        checkIn = soup.find(class_='_12aeg4v').get_text()
        #checkOut = soup.find(class_='_14tl4ml5').get_text()
        checkOut = soup.find(class_='_12aeg4v').get_text()
        Rules = soup.find(class_='cihcm8w dir dir-ltr').get_text()
        #location = soup.find(class_='_9ns6hl').get_text()
        location = soup.find(class_='_152qbzi').get_text()
        HomeType = soup.find(class_='_b8stb0').get_text()
        title = soup.title.string

        print('Stars: ', stars)
        print('')
        #Home Type
        print('Home Type: ', HomeType)
        print('')
        #Space Description
        print('Description: ', desc)
        print('')
        print('Rental size: ',size)
        print('')
        #CheckIn
        print('Check In: ', checkIn)
        print('')
        #CheckOut
        print('Check Out: ', checkOut)
        print('')
        #House Rules
        print('House Rules: ',Rules)
        print('')
        #print(soup.find("button", {"id":"#Id name of the button"}))
        #Home Location
        print('Home location: ', location)
        #Dates available
        #print('Dates available: ', soup.find(class_='_1yhfti2').get_text())
        print('===================================================================================')

        df = pd.DataFrame([{"Title":title, "Stars": stars, "Size":size, "Check In":checkIn, "Check Out":checkOut, "Rules":Rules,
                       "Location":location, "Home Type":HomeType, "House Desc":desc}])
        a = a.append(df)

        #Attemping to print the price tag on the website
        print(soup.find_all('span', {'class': '_tyxjp1'}))
        print(soup.find(class_='_tyxjp1').get_text())


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-2d9689dbc836> in <module>
      1 #print(soup.find_all('span', {'class': '_tyxjp1'}))
----> 2 print(soup.find(class_='_tyxjp1').get_text())

AttributeError: 'NoneType' object has no attribute 'get_text'

python

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-05-13 10:47:16

我看到您正在使用requests模块来刮airbnb。这个模块非常通用，并且可以在有静态内容的网站上工作。然而，它有一个主要的缺点:它不呈现由javascript创建的内容。这是一个问题，因为现在大多数网站使用javascript创建额外的html元素，一旦用户登陆网页。

airbnb价格块就是这样创建的--使用javascript。

有很多方法可以刮掉这类内容。我最喜欢的方法是使用硒。它基本上是一个库，允许您启动一个真正的浏览器，并使用您选择的编程语言与它进行通信。

下面是如何轻松使用selenium的方法。

首先，把它设置好。注意无头选项，它可以切换和关闭。如果您想查看浏览器如何加载网页，请关闭它。

# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)

然后，导航到网站。

# navigate to airbnb
driver.get(url)

接下来，等待价格块加载。它对我们来说可能几乎是瞬间的，但取决于你的互联网连接速度，它可能需要几秒钟。

# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)

最后，打印价格

# print the price
print(price_element.get_attribute('innerHTML'))

我将我的代码添加到您的示例中，这样您就可以使用它了。

import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By

input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
    airbnb_list = input_string.split(",")
    x = 0
    y = len(airbnb_list)
    while y >= x:
        print(x+1 , '.) ' , airbnb_list[x])
        x=x+1
        if y == x:
            break
    #print(airbnb_list[len(airbnb_list)])
except:
    print("""Please separate list by a ','""")

a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
               "Location":'', "Home Type":'', "House Desc":''}])

# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)

for x in range(len(airbnb_list)):
        url = airbnb_list[x]
        soup = BeautifulSoup(requests.get(url).content, 'html.parser')

        # navigate to airbnb
        driver.get(url)

        # wait until the price block loads
        timeout = 10
        expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
        price_element = WebDriverWait(driver, timeout).until(expectation)

        # print the price
        print(price_element.get_attribute('innerHTML'))

请记住，您的IP最终可能会因为刮AirBnb而被禁止。要解决这个问题，使用代理IP并对其进行旋转始终是一个好主意。遵循本旋转代理教程以避免被阻塞。

希望这能帮上忙！

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71990458

复制

相似问题

问用Python抓取价格AirBnB数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python抓取价格AirBnB数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python抓取价格AirBnB数据
EN