链接标签"a“具有以下文本:”迷你游戏公司(YC W11)正在蒙特利尔招聘高级工程师,QC (workable.com)“
我想存储“米诺游戏”,“高级工程师”,“蒙特利尔”和"workable.com“在sqlite3。
请建议,我该怎么做呢?
发布于 2019-05-27 15:05:34
假设你正在抓取https://news.ycombinator.com/jobs,这应该是可行的:
import re, sqlite3
conn = sqlite3.connect('jobs.db')
c = conn.cursor()
c.execute('''CREATE TABLE jobs
(company text, position text, location text, source real)''')
company_pattern = re.compile(r'(.+)(hiring|looking|wants|is )', re.IGNORECASE)
source_pattern = re.compile(r'\(([^)]+)\)$')
location_pattern = re.compile(r'in (.*)|(remote)', re.IGNORECASE)
position_pattern = re.compile(r'(?:hiring|looking|wants) (.*)', re.IGNORECASE)
clean_up_pattern = re.compile(r'\(([^)]+)\)| is | for | in |a ', re.IGNORECASE)
# Load up <a> nodes into elements here
for element in elements:
element = element.text
source = source_pattern.findall(element)[0].strip()
element = element.replace('(' + source + ')', '')
company = clean_up_pattern.sub('', company_pattern.findall(element)[0][0])
try:
location = location_pattern.findall(element)[0][0].strip()
except IndexError:
location = 'Not stated'
element = element.replace(location, '')
position = clean_up_pattern.sub('', position_pattern.findall(element)[0])
c.execute("INSERT INTO jobs VALUES (company, position, location, source)")
conn.commit()
conn.close()这将分析那里大约80%的招聘信息。如果需要捕获更多内容,请调整正则表达式。
https://stackoverflow.com/questions/56320203
复制相似问题