我想使用BeautifulSoup在我的字段中以批处理模式搜索工作。我会有一个网址的列表,都是由雇主职业网页组成的。如果搜索在职务标题中找到关键字GIS,我希望它返回到职务公告的链接。
我会给出一些情况:
第一个公司网站需要关键字搜索。此页面的结果是:
https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20
我希望它能退回以下内容:
https://jobs-challp.icims.com/jobs/2432/gis-specialist/job
https://jobs-challp.icims.com/jobs/2369/gis-specialist/job
第二个网站不需要关键字搜索:
https://www.smartrecruiters.com/SpectraForce1/
我希望它能退回以下内容:
https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician
这是我所能得到的:
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.smartrecruiters.com/SpectraForce1/'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
text = soup.get_text()
if 'GIS ' in text:
print 'Job Found!'有两个问题:1)当然,这会返回一个已找到作业的确认,但不会返回到作业本身的链接。)对于第一家公司网站,没有使用此方法找到这两个相关职位。我通过扫描soup.get_text()的输出来检查它,并发现它在返回的文本中没有包含职务名称。
如有任何帮助或其他建议,将不胜感激。
谢谢!
发布于 2014-01-15 00:56:39
开始吧!
此代码将找到与包含“GIS”的字符串的所有链接。我需要添加&in_iframe=1以使第一个链接工作。
import urllib2
from bs4 import BeautifulSoup
urls = ['https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20&in_iframe=1',
'https://www.smartrecruiters.com/SpectraForce1/']
for url in urls:
soup = BeautifulSoup(urllib2.urlopen(url))
print 'Scraping {}'.format(url)
for link in soup.find_all('a'):
if 'GIS' in link.text:
print '--> TEXT: ' + link.text.strip()
print '--> URL: ' + link['href']
print ''输出:
Scraping https://jobs-challp.icims.com/jobs/search?ss=1&searchKeyword=gis&searchCategory=&searchLocation=&latitude=&longitude=&searchZip=&searchRadius=20&in_iframe=1
--> TEXT: GIS Specialist
--> URL: https://jobs-challp.icims.com/jobs/2432/gis-specialist/job?in_iframe=1
--> TEXT: GIS Specialist
--> URL: https://jobs-challp.icims.com/jobs/2369/gis-specialist/job?in_iframe=1
Scraping https://www.smartrecruiters.com/SpectraForce1/
--> TEXT: Technical Specialist/ Research Analyst/ GIS/ Engineering Technician
--> URL: https://www.smartrecruiters.com/SpectraForce1/74985505-technical-specialist
--> TEXT: GIS Specialist
--> URL: https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
--> TEXT: GIS Technician
--> URL: https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician发布于 2014-01-15 01:02:31
有一种方法:
from bs4 import BeautifulSoup
import urllib2
import re
url = 'https://www.smartrecruiters.com/SpectraForce1/'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
titles = [i.get_text() for i in soup.findAll('a', {'target':'_blank'})]
jobs = [re.sub('\s+',' ',title) for title in titles]
links = [i.get('href') for i in soup.findAll('a', {'target':'_blank'})]
for i,j in enumerate(jobs):
if 'GIS' in j:
print links[i]如果您现在运行这个程序,它将打印:
https://www.smartrecruiters.com/SpectraForce1/74985505-technical-specialist
https://www.smartrecruiters.com/SpectraForce1/74966857-gis-specialist
https://www.smartrecruiters.com/SpectraForce1/74944180-gis-technician发布于 2014-01-15 01:13:17
这是我的尝试,但和上面差不多:
from bs4 import BeautifulSoup
from urllib2 import urlopen
def work(url):
soup = BeautifulSoup(urlopen(url).read())
for i in soup.findAll("a", text=True):
if "GIS" in i.text:
print "Found link "+i["href"].replace("?in_iframe=1", "")
urls = ["https://jobs-challp.icims.com/jobs/search?pr=0&searchKeyword=gis&searchRadius=20&in_iframe=1", "https://www.smartrecruiters.com/SpectraForce1/"]
for i in urls:
work(i)它定义了一个函数" work ()“来完成实际的工作,从远程服务器获取页面;使用urlopen(),因为它看起来像您想使用urllib2,但我建议您使用Python-请求;然后它使用findAll()查找所有a元素(链接),并对每个链接检查”a“是否在链接的文本中,如果是,则打印链接的href属性。
然后使用列表理解定义URL列表(在本例中仅定义2个URL),然后运行列表中每个URL的work()函数,并将其作为参数传递给函数。
https://stackoverflow.com/questions/21126768
复制相似问题