我完成了一页的刮刀,为下一页提取了href。
我不能让刮刀在一个循环后的每一页。我尝试了一段时间的True循环,但是这会扼杀我第一页的结果。
这段代码非常适合第一页:
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
myurl = ('https://www.podiuminfo.nl/concertagenda/')
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
filename = "db.csv"
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
def NextPage():
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
print(np)
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage()我认为不需要使用page url + number方法,因为我可以使用findAll从每个页面提取正确的url。我是新来的,所以这个错误一定很愚蠢。
谢谢你帮忙!
发布于 2018-06-30 15:25:39
尝试下面的脚本,让所需的字段遍历不同的页面,并相应地将它们写入csv文件。我试图清理您的重复编码,并采用稍微更干净的方法来代替它。试试看:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='
with open("output.csv","w",newline="",encoding="utf-8") as infile:
writer = csv.writer(infile)
writer.writerow(['Artist','Venue','City'])
pagenum = -1 #make sure to get the content of the first page as well which is "0" in the link
while True:
pagenum+=1
res = urlopen(link.format(pagenum)).read()
soup = BeautifulSoup(res, "html.parser")
container = soup.find_all("section",class_="concert_rows_info")
if len(container)<=1:break ##as soon as there is no content the scraper should break out of the loop
for items in container:
artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
venue = items.find(class_="td_3").get_text(strip=True)
city = items.find(class_="td_4").get_text(strip=True)
writer.writerow([artist,city,venue])
print(f'{artist}\n{venue}\n{city}\n')发布于 2018-06-30 10:57:56
你的错误
您必须获取您在文件末尾找到的url,您只需调用NextPage(),但它所做的只是打印出url。
那是你的错误:)
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
filename = "db.csv"
#at the beginning of the document you create the file in 'w'-write mode
#but later you should open it in "A"-append mode because 'W'-write will rewrite the file
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
f.close()
#create a function url_fetcher that everytime will go and fetch the html
def url_fetcher(url):
myurl = (url)
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
get_artist(DayContainer, page_soup)
#here you have to call the url otherwize it wont work
def NextPage(page_soup):
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
url_fetcher(np)
#in get artist you have some repeatings but you can tweak alittle bit and it will work
def get_artist(DayContainer, page_soup):
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
print(soldout)
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
with open(filename, "a") as f:
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage(page_soup)
url_fetcher('https://www.podiuminfo.nl/concertagenda/')重述
为了更容易理解,我做了一个很大的循环,但它有效:)
您需要对db.csv中的名称和日期进行一些调整,这样就不会有重复的名称和日期
https://stackoverflow.com/questions/51113933
复制相似问题