我使用benepar解析器将句子解析为树。在解析字符串时,如何防止benepar解析器拆分特定的子字符串?
例如,令牌gonna被benepar拆分成两个令牌gon和na,我不想要这两个令牌。
代码示例,附带先决条件:
pip install spacy benepar
python -m nltk.downloader punkt benepar_en3
python -m spacy download en_core_web_md如果我跑:
import benepar, spacy
import nltk
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_md')
if spacy.__version__.startswith('2'):
nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
else:
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
print(sent._.parse_string)它将输出:
(S (NP (DT This)) (VP (VBZ is) (VP (TO gon) (VP (TO na) (VP (VB be) (NP (NN fun)))))) (. .))问题是令牌gonna被分成两个令牌-- gon和na。我怎么才能防止这种情况?
发布于 2022-09-06 21:58:19
使用nlp.tokenizer.add_special_case
import benepar, spacy
import nltk
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_md')
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case(u'gonna', [{ORTH: u'gonna'}])
if spacy.__version__.startswith('2'):
nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
else:
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
print(sent._.parse_string)输出:
(S (NP (DT This)) (VP (VBZ is) (VP (TO gonna) (VP (VB be) (NP (NN fun))))) (. .))https://stackoverflow.com/questions/73628064
复制相似问题