首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python抓取不工作--多页项目

Python抓取不工作--多页项目
EN

Stack Overflow用户
提问于 2022-10-21 23:03:10
回答 1查看 24关注 0票数 0

我对网络抓取相当陌生,我一直在做一个项目,从一个工作银行网站上抓取数据。我想知道为什么我的代码不起作用。当作为一个单一站点运行时,它工作得很好,我不知道我在多个页面上做错了什么。

导入库

代码语言:javascript
复制
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
from time import sleep
from random import randint
import datetime

连接到网站并提取数据

代码语言:javascript
复制
#for page in range(37001458,37001470):
pages = np.arange (37001458, 37001470, 1)
data = []

for page in pages:
    
    URL = 'https://www.jobbank.gc.ca/jobsearch/jobposting/' + str(page)
    sleep(randint(1,5))
    
# To find Your User-Agent: https://httpbin.org/get
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"}


#for single page scraping - delete everything before this line except headers and readdress response
page = requests.get(URL, headers=headers)

soup1 = BeautifulSoup(page.content, "html.parser")

soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

#page = page + 1

try:
      job_title = soup2.find(property='title').get_text()
except:
      job_title = ''

try:
    date_posted = soup2.find(property='datePosted').get_text()
except:
    date_posted = ''

try:
    company = soup2.find(property='hiringOrganization').get_text()
except:
    company = ''

try:
    address = soup2.find(property='streetAddress').get_text()
except:
    address = ''

try:
    city = soup2.find(property='addressLocality').get_text()
except:
    city = ''

try:
    province = soup2.find(property='addressRegion').get_text()
except:
    province = ''

try:
    wage = soup2.find(property='minValue').get_text()
except:
    wage = ''

try:
    wage_reference = soup2.find(property='unitText').get_text()
except:
    wage_reference = ''

try:
    work_hours = soup2.find(property='workHours').get_text()
except:
    work_hours = ''

try:
    employment_type = soup2.find(property='employmentType').get_text()
except:
    employment_type = ''

try:
    language = soup2.find(property='qualification').get_text()
except:
    language = ''

try:
    required_education = soup2.find(property='educationRequirements qualification').get_text()
except:
    required_education = ''

try:
    required_experience = soup2.find(property='experienceRequirements qualification').get_text()
except:
    required_experience = ''

try:
    skills = soup2.find(property='experienceRequirements').get_text()
except:
    skills = ''

try:
    employment_groups = soup2.find(id='employmentGroup').get_text()
except:
    employment_groups = ''

数据清洗

代码语言:javascript
复制
job_title = job_title.strip()

date_posted = date_posted.strip()[10:]

company = company.strip()

address = address.strip()

city = city.strip()

province = province.strip()

wage = wage.strip()

wage_reference = wage_reference.strip()

work_hours = work_hours.strip()

employment_type = employment_type.strip()

language = language.strip()

required_education = required_education.strip()

required_experience = required_experience.strip()

skills = skills.strip()

employment_groups = employment_groups.strip()[238:] 

print(job_title)
print(date_posted)
print(company)
print(address)
print(city)
print(province)
print(wage)
print(wage_reference)
print(work_hours)
print(employment_type)
print(language)
print(required_education)
print(required_experience)
print(skills)
print(employment_groups)

用于跟踪数据收集时间的输出时间戳

代码语言:javascript
复制
import datetime

today = datetime.date.today()

print(today)

将数据输入csv文件(先前创建)

代码语言:javascript
复制
import csv

header = ['Job Title', 'Date Posted', 'Company', 'Address', 'City', 'Province', 'Wage', 'Wage Reference', 'Work Hours', 'Employment Type', 'Language', 'Required Education', 'Required Experience', 'Skills', 'Employment Groups']
values = [job_title, date_posted, company, address, city, province, wage, wage_reference, work_hours, employment_type, language, required_education, required_experience, skills, employment_groups]


with open('CanadaJobBankWebScraperDataset.csv', 'a+', newline='', encoding='utf8') as f:
    writer = csv.writer(f)
    writer.writerow(values)

熊猫

代码语言:javascript
复制
df = pd.read_csv(r'C:\Users\AM\CanadaJobBankWebScraperDataset.csv')

print(df)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-10-21 23:09:32

代码语言:javascript
复制
pages = np.arange (37001458, 37001470, 1)

for page in pages:
    
    URL = 'https://www.jobbank.gc.ca/jobsearch/jobposting/' + str(page)
    sleep(randint(1,5))

page = requests.get(URL, headers=headers)

此循环创建一个url字符串,然后立即在末尾用不同的数字重新创建它,然后再重复一次,一次又一次。当循环结束时,实际上只有最终的url值仍然存在。

requests.get()调用(以及所有相关的处理)需要在for循环中。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74159918

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档