我有一个html文件,它只包含<p>和<a>标记。就像下面-
<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>In 2016, Theresa May’s<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title=""> rivals withdrew before the final round</a>. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.</p>我要做的是提取具有特定属性的句子,例如:包含Britain或party的句子。然后用<mark>标记标记整个句子,同时保持段落的格式。
为了达到这个目标-
with open('a.html') as f:
given_text = f.read() # Read from the file
#given_text = '' #copy paste the above html as string
nlp = spacy.load('en')
doc = nlp(given_text)for sent in doc.sents对句子进行迭代,并使用regex来确定句子是否应该被标记。但是这种方法的问题在于,一旦我清理了文本(删除了所有的<p>和<a>标记),我就失去了对各个段落的考虑。所以,一旦我用标签标记了句子,我就得到了一个巨大的字符串。
如何在仍然能够迭代句子以标记它们时保留<p>格式?
除了几个句子突出显示外,我们的想法是按照我们得到输入的方式进行输出。
发布于 2019-06-20 14:04:18
你可以尝试做这样的事情:
britain或party的句子。我使用re模块作为正则表达式。<mark>替换,添加开头和end (句子的)。在这里,代码:
text = """<p> For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current < a href = "https://www.theguardian.com/politics/conservative-leadership" title = "" > Conservative party leadership contest </a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method. < /p > <p > In 2016, Theresa May’s < a href = "https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title = "" > rivals withdrew before the final round < /a > . In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent. < /p >
"""
sentences_to_modify = re.findall(r"([^.]*?(party|Britain)[^.]*\.)", text)
for sentence in sentences_to_modify:
text = text.replace(sentence[0], "<mark>"+sentence[0]+"<mark>")
print(text)
# <mark><p> For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations.<mark> For example, if the current < a href = "https://www.theguardian.<mark>com/politics/conservative-leadership" title = "" >
# Conservative party leadership contest < /a > proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method. < mark > < / p > <p > In 2016, Theresa May’s < a href = "https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title = "" > rivals withdrew before the final round < /a > . In previous applications of the
# rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning
# of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent. < /p >希望这能帮上忙!
发布于 2019-06-20 15:40:09
这里有一个选择
from bs4 import BeautifulSoup
html_doc = '''<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>In 2016, Theresa May’s<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title=""> rivals withdrew before the final round</a>. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.</p>'''
src_soup = BeautifulSoup(html_doc, 'html.parser')
dst_soup = BeautifulSoup('', 'html.parser')
WORDS_TO_LOOK_FOR = ['Britain', 'party']
def mark_if_needed(text):
# can be improved using regex
for word in WORDS_TO_LOOK_FOR:
if word in text:
return '<mark>' + text + '</mark>'
return text
p_elements = src_soup.find_all('p')
for p in p_elements:
a_elements = p.find_all('a')
p.string = mark_if_needed(p.text)
dst_soup.append(p)
for a in a_elements:
a.string = mark_if_needed(a.text)
p.append(a)
print(dst_soup.prettify())输出
<p>
<mark>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current Conservative party leadership contest proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</mark>
<a href="https://www.theguardian.com/politics/conservative-leadership" title="">
<mark>Conservative party leadership contest</mark>
</a>
</p>
<p>
In 2016, Theresa May’s rivals withdrew before the final round. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.
<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title="">
rivals withdrew before the final round
</a>
</p>发布于 2019-06-27 19:05:23
经过几天的尝试,我终于想出了怎么做。以下是相同的完整示例代码-
import re
import spacy
from bs4 import BeautifulSoup
nlp = spacy.load('en_core_web_sm')
html_doc = '''<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. This sentence should not be marked.</p> <p> This sentence should not be marked. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This is an unmarked random sentence. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. Another unmarked random sentnce.</p>'''
src_soup = BeautifulSoup(html_doc, 'html.parser')
dst_soup = BeautifulSoup('', 'html.parser')
word_re = "Britain"
def mark_if_needed(text):
doc = nlp(text)
for sent in doc.sents:
check = re.search(word_re, sent.text)
if check is None:
yield (0, sent.text)
else:
yield (1, sent.text)
p_elements = src_soup.find_all('p')
for p in p_elements:
s = BeautifulSoup()
pp = BeautifulSoup()
par = pp.new_tag('p')
for sent in mark_if_needed(p.text):
if sent[0] is 1:
m = s.new_tag('mark')
m.append(sent[1])
par.append(m)
else:
par.append(sent[1])
dst_soup.append(par)
print(dst_soup.prettify())
html = dst_soup.prettify("utf-8")
with open("output.html", "wb") as file:
file.write(html)https://stackoverflow.com/questions/56687443
复制相似问题