我有一个原文,看起来是这样的:
我们是AMS。我们是一家全球性的劳动力解决方案公司;我们通过建立、重组和优化员工队伍,使企业能够在一个不断变化的时代蓬勃发展。我们的员工队伍解决方案(CWS)是我们提供的服务之一;我们作为客户招聘团队的延伸,提供专业的临时和临时资源。
我们目前正在与我们的客户,皇家伦敦。
伦敦皇家银行是一家与众不同的金融服务公司。作为英国最大的共同生活、养老金和投资公司,我们由会员拥有,为他们工作,而不是为了股东的利益而工作。我们发展迅速,并被公认为英国排名最高的工作场所之一。
如今,伦敦皇家银行管理的资金超过1,140亿英镑,在英国和爱尔兰的6个办事处工作的员工约为3,500人。我们一直在努力成为我们专业市场的专家,建立一个值得信赖的品牌--我们的团队有很多奖项可供展示。不管你想加入哪支球队,无论你扮演什么角色,我们都会帮助你有所作为。
我们正在寻找一个业务分析师的6个月的合同,总部设在伦敦。
角色的用途:
您将与内部数据团队一起研究业务和相关报告中的新功能。项目的一部分将涉及系统升级。
作为业务分析师,您将负责:
查看数据集,提取信息,并能够查看SQL脚本、编写报表序列、分析数据。能够理解和交付数据,提出问题和挑战需求,理解数据之旅/映射文档。
我们正在向您寻求的技能、属性和功能包括:
在scrum团队内以及与业务users
如果您有兴趣申请此职位并符合上述标准,请单击此链接申请并与我们的采购专家联系。。
AMS是一家招聘流程外包公司,在提供其部分服务时,可能被视为作为职业介绍所或职业介绍所。
我已经使用下面的分割和提取文字根据字幕从原来的html使用美丽的汤。基本上,目的是:
下面的代码说明了这一点:
from fake_useragent import UserAgent
import requests
def headers():
ua = UserAgent()
chrome_header = ua.chrome
headers = {'User-Agent': chrome_header}
return headers
headers = headers()
r5 = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)
soup_description = BS(r5.text, 'html.parser')
j_description = soup_description.find('span', {'itemprop':'description'})
j_description_subtitles = [j.text for j in j_description.find_all('strong')]
sub_titles_in_description = [el for el in j_description_subtitles if ":" in el]
total_length_of_sub_titles = len(sub_titles_in_description)
total_length_of_strong_tags = len(j_description_subtitles)
Position_of_first_sub_title = j_description_subtitles.index(sub_titles_in_description[0])
Position_of_last_sub_title = j_description_subtitles.index(sub_titles_in_description[-1])
# If the position of the last subtitle text does not equal the total number of strong tags, then split the final output by the next indexed position in the list.
if Position_of_last_sub_title != total_length_of_strong_tags:
text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}| {j_description_subtitles[Position_of_last_sub_title+1]}',j_description.text)[1:Position_of_last_sub_title]
else:
text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}',j_description.text)[1:]
final_dict_with_sub_t_n_prec_txt= {
sub_titles_in_description[0]: text_after_sub_t[0],
sub_titles_in_description[1]: text_after_sub_t[1],
sub_titles_in_description[2]: text_after_sub_t[2]
}问题是基于字幕的文本分割。这太手动了,而且已经尝试过其他方法,但没有用以使这种动态。我将如何使这个部分充满活力,因为在未来的文本中,字幕的数量将有所不同。
发布于 2022-05-15 05:13:40
通过使用css selectors来选择元素,您可以简化或使其更通用,例如,p:has(strong:-soup-contains(":"))将选择具有带有:的子<strong>的所有<p>。使用find_next_sibling()获取附加信息
dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))注意:将|作为分隔器添加到get_text()__中,因此在本例中,稍后您可以拆分list元素。您还可以用空格get_text(' ',strip=True)替换它。
示例
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)
soup = BeautifulSoup(r.text, 'html.parser')
data = dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))
print(data)输出
{'Purpose of the Role:': 'You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades',
'As the Business Analyst, you will be responsible for:': 'Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.',
'The skills, attributes and capabilities we are seeking from you include:': 'Strong communication both verbal and written|Strong teamworking within the scrum team and with other BAs and directly with business users|Significant asset management experience|Working knowledge of the key data sets that are used by an asset manager|Experience of Master Data Management tools, ideally IHS Markit EDM|Agile working experience|Ability to write user stories to detail the requirements that both the development team and the QA team will use|Strong SQL skills, ideally using Microsoft SQL Server|Experience of managing data interface mapping documentation|Familiarity with data modelling concepts|Project experience based on ETL and Data Warehousing advantageous|Technical (development) background advantageous|Have an asset management background.|Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable.'}https://stackoverflow.com/questions/72245378
复制相似问题