文章/答案/技术大牛

发布

问python spacy语句分词器
EN

Stack Overflow用户

提问于 2020-09-23 13:57:39

回答 1查看 957关注 0票数 2

我想用spacy把文章中的句子去掉。

nlp = English()  # just the language with no model
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

是否有可能提高分句器绕过规则的可靠性，例如，在像“no”这样的首字母缩略词之后从不将句子分开。

想想看，我当然有一堆非常技术性和特殊的缩略语。

你会怎么做？

python

spacy

sentence

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-23 15:36:48

您可以编写一个自定义函数，通过使用基于规则的句子拆分方法来更改默认行为。例如：

import spacy

text = "The formula is no. 45. This num. represents the chemical properties."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Before:", [sent.text for sent in doc.sents])

def set_custom_boundaries(doc):
    pattern_a = ['no', 'num']
    for token in doc[:-1]:
        if token.text in pattern_a and doc[token.i + 1].text == '.':
            doc[token.i + 2].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])

这将给你想要的句子分裂。

Before: ['The formula is no.', '45.', 'This num.', 'represents the chemical properties.']
After: ['The formula is no. 45.', 'This num. represents the chemical properties.']

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64029623

复制

相似问题

问python spacy语句分词器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python spacy语句分词器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python spacy语句分词器
EN