文章/答案/技术大牛

发布

社区首页 >问答首页 >对于每一个新的文本，我如何根据给定文本中的多个字幕动态地分割文本？

问对于每一个新的文本，我如何根据给定文本中的多个字幕动态地分割文本？
EN

Stack Overflow用户

提问于 2022-05-15 03:42:25

回答 1查看 53关注 0票数 0

我有一个原文，看起来是这样的：

我们是AMS。我们是一家全球性的劳动力解决方案公司；我们通过建立、重组和优化员工队伍，使企业能够在一个不断变化的时代蓬勃发展。我们的员工队伍解决方案(CWS)是我们提供的服务之一；我们作为客户招聘团队的延伸，提供专业的临时和临时资源。

我们目前正在与我们的客户，皇家伦敦。

伦敦皇家银行是一家与众不同的金融服务公司。作为英国最大的共同生活、养老金和投资公司，我们由会员拥有，为他们工作，而不是为了股东的利益而工作。我们发展迅速，并被公认为英国排名最高的工作场所之一。

如今，伦敦皇家银行管理的资金超过1,140亿英镑，在英国和爱尔兰的6个办事处工作的员工约为3,500人。我们一直在努力成为我们专业市场的专家，建立一个值得信赖的品牌--我们的团队有很多奖项可供展示。不管你想加入哪支球队，无论你扮演什么角色，我们都会帮助你有所作为。

我们正在寻找一个业务分析师的6个月的合同，总部设在伦敦。

角色的用途：

您将与内部数据团队一起研究业务和相关报告中的新功能。项目的一部分将涉及系统升级。

作为业务分析师，您将负责：

查看数据集，提取信息，并能够查看SQL脚本、编写报表序列、分析数据。能够理解和交付数据，提出问题和挑战需求，理解数据之旅/映射文档。

我们正在向您寻求的技能、属性和功能包括：

在scrum团队内以及与业务users

Significant资产管理experience

Working (业务manager

Experience of Master Data management tools )有着良好的口头和书面沟通能力强的

强团队合作，最理想的情况是IHS Markit EDM

Agile with experience

Ability编写用户故事，详细描述开发团队和* QA团队将使用

强大的SQL技能的需求。理想的情况下，使用Microsoft Server

Experience管理数据接口映射documentation

Familiarity与基于ETL的数据建模concepts

Project经验和数据仓库advantageous

Technical (开发)背景advantageous

Have，一个资产管理background.

Thinkfolio和Murex将是理想的，EDM平台知识将是可取的。此客户端将只接受通过雇佣模式操作的员工。

如果您有兴趣申请此职位并符合上述标准，请单击此链接申请并与我们的采购专家联系。。

AMS是一家招聘流程外包公司，在提供其部分服务时，可能被视为作为职业介绍所或职业介绍所。

我已经使用下面的分割和提取文字根据字幕从原来的html使用美丽的汤。基本上，目的是：

将粗体文本的html摘录分开。
从这个粗体文本列表中提取那些既粗体又具有':‘的文本，在它们中表示它是一个合法的，从粗体文本列表中找出第一个和最后一个合法字幕的位置。如果在最后一个副标题的text.
Conduct下面有其他粗体文本缺少“：”，这将有助于拆分文本--基于以下条件:最后一个副标题确实是粗体文本列表中的最后一个元素，如果没有，则进一步分割文本以将副标题的文本与其他文本分开。

下面的代码说明了这一点：

from fake_useragent import UserAgent
import requests
def headers():
    ua = UserAgent()
    chrome_header = ua.chrome
    headers = {'User-Agent': chrome_header}
    return headers

headers = headers()

r5 = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)

soup_description = BS(r5.text, 'html.parser')
j_description = soup_description.find('span', {'itemprop':'description'})
j_description_subtitles = [j.text for j in j_description.find_all('strong')]
sub_titles_in_description = [el for el in j_description_subtitles if ":" in el]

total_length_of_sub_titles = len(sub_titles_in_description)
total_length_of_strong_tags = len(j_description_subtitles)
Position_of_first_sub_title = j_description_subtitles.index(sub_titles_in_description[0])
Position_of_last_sub_title = j_description_subtitles.index(sub_titles_in_description[-1])

# If the position of the last subtitle text does not equal the total number of strong tags, then split the final output by the next indexed position in the list.
if Position_of_last_sub_title != total_length_of_strong_tags:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}| {j_description_subtitles[Position_of_last_sub_title+1]}',j_description.text)[1:Position_of_last_sub_title]
else:
    text_after_sub_t= re.split(f'{sub_titles_in_description[0]}|{sub_titles_in_description[1]}|{sub_titles_in_description[-1]}',j_description.text)[1:]

final_dict_with_sub_t_n_prec_txt= {
    sub_titles_in_description[0]: text_after_sub_t[0],
    sub_titles_in_description[1]: text_after_sub_t[1],
    sub_titles_in_description[2]: text_after_sub_t[2]
    
}

问题是基于字幕的文本分割。这太手动了，而且已经尝试过其他方法，但没有用以使这种动态。我将如何使这个部分充满活力，因为在未来的文本中，字幕的数量将有所不同。

python

string

list

split

python-re

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-05-15 05:13:40

通过使用css selectors来选择元素，您可以简化或使其更通用，例如，p:has(strong:-soup-contains(":"))将选择具有带有:的子<strong>的所有<p>。使用find_next_sibling()获取附加信息

dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

注意：将|作为分隔器添加到get_text()__中，因此在本例中，稍后您可以拆分list元素。您还可以用空格get_text(' ',strip=True)替换它。

示例

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.reed.co.uk/jobs/business-analyst/46819093?source=searchResults&filter=%2fjobs%2fbusiness-jobs-in-london%3fagency%3dTrue%26direct%3dTrue", headers=headers, timeout=20)

soup = BeautifulSoup(r.text, 'html.parser')

data = dict((e.text,e.find_next_sibling().get_text('|',strip=True)) for e in soup.select('[itemprop="description"] p:has(strong:-soup-contains(":"))'))

print(data)

输出

{'Purpose of the Role:': 'You will be working with the internal data squad looking at new functionality within the business and associated reporting. Part of project will involve system upgrades',
 'As the Business Analyst, you will be responsible for:': 'Looking at data sets, extracting the information and be able to look at SQL scripts, write report sequences, analyse data. Be able to understand and deliver data, ask questions and challenge requirements, understand the data journey/mapping documents.',
 'The skills, attributes and capabilities we are seeking from you include:': 'Strong communication both verbal and written|Strong teamworking within the scrum team and with other BAs and directly with business users|Significant asset management experience|Working knowledge of the key data sets that are used by an asset manager|Experience of Master Data Management tools, ideally IHS Markit EDM|Agile working experience|Ability to write user stories to detail the requirements that both the development team and the QA team will use|Strong SQL skills, ideally using Microsoft SQL Server|Experience of managing data interface mapping documentation|Familiarity with data modelling concepts|Project experience based on ETL and Data Warehousing advantageous|Technical (development) background advantageous|Have an asset management background.|Thinkfolio and Murex would be ideal, EDM platform knowledge would be desirable.'}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72245378

复制

相似问题

问对于每一个新的文本，我如何根据给定文本中的多个字幕动态地分割文本？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对于每一个新的文本，我如何根据给定文本中的多个字幕动态地分割文本？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对于每一个新的文本，我如何根据给定文本中的多个字幕动态地分割文本？
EN