文章/答案/技术大牛

发布

社区首页 >问答首页 >分页Webscraping Python3 3- BS4 - While循环

问分页Webscraping Python3 3- BS4 - While循环
EN

Stack Overflow用户

提问于 2018-06-30 10:12:55

回答 2查看 100关注 0票数 0

我完成了一页的刮刀，为下一页提取了href。

我不能让刮刀在一个循环后的每一页。我尝试了一段时间的True循环，但是这会扼杀我第一页的结果。

这段代码非常适合第一页：

import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup

myurl = ('https://www.podiuminfo.nl/concertagenda/')
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

filename = "db.csv"
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)

DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")

def NextPage():
    np = page_soup.findAll("section", {"class":"next_news"})
    np = np[0].find('a').attrs['href']
    print(np)

for days in DayContainer: 
    shows = days.findAll("span", {"class":"concert_uitverkocht"})

    for soldout in shows:
        if shows:
            soldoutPlu = shows[0].parent.parent.parent

            artist = soldoutPlu.findAll("div", {"class":"td_2"})
            artist = artist[0].text.strip()

            venue = soldoutPlu.findAll("div", {"class":"td_3"})
            venue = venue[0].text

            city = soldoutPlu.findAll("div", {"class":"td_4"})
            city = city[0].text

            date = shows[0].parent.parent.parent.parent.parent
            date = date.findAll("section", {"class":"concert_agenda_date"})
            date = date[0].text
            date = date.strip().replace("\n", " ")
            print("Datum gevonden!")

            print("Artiest: " + artist)
            print("Locatie: " + venue)
            print("Stad: " + city) 
            print("Datum: " + date+ "\n")

            f.write(artist + "," + date + "," + city + "," + venue + "\n")

        else: 
            pass

NextPage()

我认为不需要使用page url + number方法，因为我可以使用findAll从每个页面提取正确的url。我是新来的，所以这个错误一定很愚蠢。

谢谢你帮忙！

python-3.x

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-06-30 15:25:39

尝试下面的脚本，让所需的字段遍历不同的页面，并相应地将它们写入csv文件。我试图清理您的重复编码，并采用稍微更干净的方法来代替它。试试看：

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='

with open("output.csv","w",newline="",encoding="utf-8") as infile:
    writer = csv.writer(infile)
    writer.writerow(['Artist','Venue','City'])

    pagenum = -1   #make sure to get the content of the first page as well which is "0" in the link
    while True:
        pagenum+=1
        res = urlopen(link.format(pagenum)).read()
        soup = BeautifulSoup(res, "html.parser")
        container = soup.find_all("section",class_="concert_rows_info")
        if len(container)<=1:break  ##as soon as there is no content the scraper should break out of the loop

        for items in container:
            artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
            venue = items.find(class_="td_3").get_text(strip=True)
            city = items.find(class_="td_4").get_text(strip=True)
            writer.writerow([artist,city,venue])
            print(f'{artist}\n{venue}\n{city}\n')

票数 0

Stack Overflow用户

发布于 2018-06-30 10:57:56

你的错误

您必须获取您在文件末尾找到的url，您只需调用NextPage()，但它所做的只是打印出url。

那是你的错误:)

import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup

filename = "db.csv"
#at the beginning of the document you create the file in  'w'-write mode
#but later you should open it in "A"-append mode  because 'W'-write will rewrite the file
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
f.close()

#create a function url_fetcher that everytime will go and fetch the html
def url_fetcher(url):
    myurl = (url)
    uClient = ireq(myurl)
    page_html = uClient.read()
    uClient.close()
    page_soup = soup(page_html, "html.parser")
    DayContainer = page_soup.findAll("section",{"class":"overflow"})
    print("Days on page: " + str(len(DayContainer)) + "\n")
    get_artist(DayContainer, page_soup)

#here you have to call the url otherwize it wont work
def NextPage(page_soup):
    np = page_soup.findAll("section", {"class":"next_news"})
    np = np[0].find('a').attrs['href']
    url_fetcher(np)

#in get artist you have some repeatings but you can tweak alittle bit and it will work
def get_artist(DayContainer, page_soup):
    for days in DayContainer:
        shows = days.findAll("span", {"class":"concert_uitverkocht"})

        for soldout in shows:
            print(soldout)
            if shows:
                soldoutPlu = shows[0].parent.parent.parent

                artist = soldoutPlu.findAll("div", {"class":"td_2"})
                artist = artist[0].text.strip()

                venue = soldoutPlu.findAll("div", {"class":"td_3"})
                venue = venue[0].text

                city = soldoutPlu.findAll("div", {"class":"td_4"})
                city = city[0].text

                date = shows[0].parent.parent.parent.parent.parent
                date = date.findAll("section", {"class":"concert_agenda_date"})
                date = date[0].text
                date = date.strip().replace("\n", " ")
                print("Datum gevonden!")

                print("Artiest: " + artist)
                print("Locatie: " + venue)
                print("Stad: " + city)
                print("Datum: " + date+ "\n")
                with open(filename, "a") as f:
                    f.write(artist + "," + date + "," + city + "," + venue + "\n")

            else:
                pass
        NextPage(page_soup)
url_fetcher('https://www.podiuminfo.nl/concertagenda/')

重述

为了更容易理解，我做了一个很大的循环，但它有效:)

您需要对db.csv中的名称和日期进行一些调整，这样就不会有重复的名称和日期

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51113933

复制

相似问题

问分页Webscraping Python3 3- BS4 - While循环
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问分页Webscraping Python3 3- BS4 - While循环EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问分页Webscraping Python3 3- BS4 - While循环
EN