首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何为spacy的Sence2vec实现标记句子

如何为spacy的Sence2vec实现标记句子
EN

Stack Overflow用户
提问于 2017-09-24 11:19:13
回答 1查看 415关注 0票数 0

SpaCy已经实现了一个sense2vec word嵌入包,他们记录了here

向量都是WORD|POS形式的。例如,这句话

代码语言:javascript
复制
Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of trouble

需要转换为

代码语言:javascript
复制
Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT I|PRON think|VERB effects|NOUN computers|NOUN have|VERB on|ADP people|NOUN are|VERB great|ADJ learning|NOUN skills/affects|NOUN because|ADP they|PRON give|VERB us|PRON time|NOUN to|PART chat|VERB with|ADP friends/new|ADJ people|NOUN ,|PUNCT helps|VERB us|PRON learn|VERB about|ADP the|DET globe(astronomy|NOUN )|PUNCT and|CONJ keeps|VERB us|PRON out|ADP of|ADP trouble|NOUN !|PUNCT

以便能够被sense2vec预先训练的嵌入所解释,并且能够以sense2vec格式存在。

如何做到这一点?

EN

回答 1

Stack Overflow用户

发布于 2017-09-24 11:22:31

基于SpaCy's bin/merge.py实现,它做了需要做的事情:

代码语言:javascript
复制
from spacy.en import English
import re

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'ENT',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}



nlp = False;
def tag_words_in_sense2vec_format(passage):
    global nlp; 
    if(nlp == False): nlp = English()
    if isinstance(passage, str): passage = passage.decode('utf-8',errors='ignore');
    doc = nlp(passage);
    return transform_doc(doc);

def transform_doc(doc):
    for index, ent in enumerate(doc.ents):
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
        #if index % 100 == 0: print ("enumerating at entity index " + str(index));
    #for np in doc.noun_chunks:
    #    while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
    #        np = np[1:]
    #    np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for index, sent in enumerate(doc.sents):
        if sent.text.strip():
            strings.append(' '.join(represent_word(w) for w in sent if not w.is_space))
        #if index % 100 == 0: print ("converting at sentence index " + str(index));
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''
def represent_word(word):
    if word.like_url:
        return '%%URL|X'
    text = re.sub(r'\s', '_', word.text)
    tag = LABELS.get(word.ent_type_, word.pos_)
    if not tag:
        tag = '?'
    return text + '|' + tag

哪里

代码语言:javascript
复制
print(tag_words_in_sense2vec_format("Dear local newspaper, ..."))

结果:

代码语言:javascript
复制
 Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT ...
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/46386259

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档