文章/答案/技术大牛

发布

社区首页 >问答首页 >如何为spacy的Sence2vec实现标记句子

问如何为spacy的Sence2vec实现标记句子
EN

Stack Overflow用户

提问于 2017-09-24 11:19:13

回答 1查看 415关注 0票数 0

SpaCy已经实现了一个sense2vec word嵌入包，他们记录了here

向量都是WORD|POS形式的。例如，这句话

Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of trouble

需要转换为

Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT I|PRON think|VERB effects|NOUN computers|NOUN have|VERB on|ADP people|NOUN are|VERB great|ADJ learning|NOUN skills/affects|NOUN because|ADP they|PRON give|VERB us|PRON time|NOUN to|PART chat|VERB with|ADP friends/new|ADJ people|NOUN ,|PUNCT helps|VERB us|PRON learn|VERB about|ADP the|DET globe(astronomy|NOUN )|PUNCT and|CONJ keeps|VERB us|PRON out|ADP of|ADP trouble|NOUN !|PUNCT

以便能够被sense2vec预先训练的嵌入所解释，并且能够以sense2vec格式存在。

如何做到这一点？

sense2vec

python

nlp

spacy

回答 1

Stack Overflow用户

发布于 2017-09-24 11:22:31

基于SpaCy's bin/merge.py实现，它做了需要做的事情：

from spacy.en import English
import re

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'ENT',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}



nlp = False;
def tag_words_in_sense2vec_format(passage):
    global nlp; 
    if(nlp == False): nlp = English()
    if isinstance(passage, str): passage = passage.decode('utf-8',errors='ignore');
    doc = nlp(passage);
    return transform_doc(doc);

def transform_doc(doc):
    for index, ent in enumerate(doc.ents):
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
        #if index % 100 == 0: print ("enumerating at entity index " + str(index));
    #for np in doc.noun_chunks:
    #    while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
    #        np = np[1:]
    #    np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for index, sent in enumerate(doc.sents):
        if sent.text.strip():
            strings.append(' '.join(represent_word(w) for w in sent if not w.is_space))
        #if index % 100 == 0: print ("converting at sentence index " + str(index));
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''
def represent_word(word):
    if word.like_url:
        return '%%URL|X'
    text = re.sub(r'\s', '_', word.text)
    tag = LABELS.get(word.ent_type_, word.pos_)
    if not tag:
        tag = '?'
    return text + '|' + tag

哪里

print(tag_words_in_sense2vec_format("Dear local newspaper, ..."))

结果：

 Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT ...

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46386259

复制

相似问题

问如何为spacy的Sence2vec实现标记句子
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何为spacy的Sence2vec实现标记句子EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何为spacy的Sence2vec实现标记句子
EN