首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在从段落中提取某些句子后,如何用<mark>标记对它们进行包装,同时为最终输出保留相同的段落格式?

在从段落中提取某些句子后,如何用<mark>标记对它们进行包装,同时为最终输出保留相同的段落格式?
EN

Stack Overflow用户
提问于 2019-06-20 13:48:25
回答 3查看 127关注 0票数 0

我有一个html文件,它只包含<p><a>标记。就像下面-

代码语言:javascript
复制
<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>In 2016, Theresa May’s<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title=""> rivals withdrew before the final round</a>. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.</p>

我要做的是提取具有特定属性的句子,例如:包含Britainparty的句子。然后用<mark>标记标记整个句子,同时保持段落的格式。

为了达到这个目标-

  1. 我首先删除所有的标签,以获得干净的段落与干净的句子。
  2. 然后我用空间性提取句子
代码语言:javascript
复制
with open('a.html') as f:
  given_text = f.read()    # Read from the file
#given_text = '' #copy paste the above html as string
nlp = spacy.load('en')
doc = nlp(given_text)
  1. 最后,我使用for sent in doc.sents对句子进行迭代,并使用regex来确定句子是否应该被标记。

但是这种方法的问题在于,一旦我清理了文本(删除了所有的<p><a>标记),我就失去了对各个段落的考虑。所以,一旦我用标签标记了句子,我就得到了一个巨大的字符串。

如何在仍然能够迭代句子以标记它们时保留<p>格式?

除了几个句子突出显示外,我们的想法是按照我们得到输入的方式进行输出。

EN

回答 3

Stack Overflow用户

发布于 2019-06-20 14:04:18

你可以尝试做这样的事情:

  1. 查找带有britainparty的句子。我使用re模块作为正则表达式。
  2. 通过添加<mark>替换,添加开头end (句子的)。

在这里,代码:

代码语言:javascript
复制
text = """<p> For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current < a href = "https://www.theguardian.com/politics/conservative-leadership" title = "" > Conservative party leadership contest </a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method. < /p > <p > In 2016, Theresa May’s < a href = "https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title = "" > rivals withdrew before the final round < /a > . In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent. < /p >
"""



sentences_to_modify = re.findall(r"([^.]*?(party|Britain)[^.]*\.)", text)

for sentence in sentences_to_modify:
    text = text.replace(sentence[0], "<mark>"+sentence[0]+"<mark>")

print(text)
# <mark><p> For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations.<mark> For example, if the current < a href = "https://www.theguardian.<mark>com/politics/conservative-leadership" title = "" >
# Conservative party leadership contest < /a > proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method. < mark > < / p > <p > In 2016, Theresa May’s < a href = "https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title = "" > rivals withdrew before the final round < /a > . In previous applications of the
# rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning
# of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent. < /p >

希望这能帮上忙!

票数 0
EN

Stack Overflow用户

发布于 2019-06-20 15:40:09

这里有一个选择

代码语言:javascript
复制
from bs4 import BeautifulSoup

html_doc = '''<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>In 2016, Theresa May’s<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title=""> rivals withdrew before the final round</a>. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.</p>'''
src_soup = BeautifulSoup(html_doc, 'html.parser')
dst_soup = BeautifulSoup('', 'html.parser')

WORDS_TO_LOOK_FOR = ['Britain', 'party']


def mark_if_needed(text):
    # can be improved using regex
    for word in WORDS_TO_LOOK_FOR:
        if word in text:
            return '<mark>' + text + '</mark>'
    return text


p_elements = src_soup.find_all('p')
for p in p_elements:
    a_elements = p.find_all('a')
    p.string = mark_if_needed(p.text)
    dst_soup.append(p)
    for a in a_elements:
        a.string = mark_if_needed(a.text)
        p.append(a)

print(dst_soup.prettify())

输出

代码语言:javascript
复制
<p>
 &lt;mark&gt;For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current Conservative party leadership contest proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.&lt;/mark&gt;
 <a href="https://www.theguardian.com/politics/conservative-leadership" title="">
  &lt;mark&gt;Conservative party leadership contest&lt;/mark&gt;
 </a>
</p>
<p>
 In 2016, Theresa May’s rivals withdrew before the final round. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.
 <a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title="">
  rivals withdrew before the final round
 </a>
</p>
票数 0
EN

Stack Overflow用户

发布于 2019-06-27 19:05:23

经过几天的尝试,我终于想出了怎么做。以下是相同的完整示例代码-

代码语言:javascript
复制
import re    
import spacy
from bs4 import BeautifulSoup

nlp = spacy.load('en_core_web_sm')

html_doc = '''<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. This sentence should not be marked.</p> <p> This sentence should not be marked. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This is an unmarked random sentence. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. Another unmarked random sentnce.</p>'''

src_soup = BeautifulSoup(html_doc, 'html.parser') 
dst_soup = BeautifulSoup('', 'html.parser')

word_re = "Britain"

def mark_if_needed(text):
    doc = nlp(text)
    for sent in doc.sents:
        check = re.search(word_re, sent.text)
        if check is None:
            yield (0, sent.text)
        else:
            yield (1, sent.text)

p_elements = src_soup.find_all('p')
for p in p_elements:
    s = BeautifulSoup()
    pp = BeautifulSoup()
    par = pp.new_tag('p')

    for sent in mark_if_needed(p.text):
        if sent[0] is 1:
            m = s.new_tag('mark') 
            m.append(sent[1])
            par.append(m)

        else:
            par.append(sent[1])

    dst_soup.append(par)

print(dst_soup.prettify())
html = dst_soup.prettify("utf-8")
with open("output.html", "wb") as file:
    file.write(html)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56687443

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档