sense2vec的文档提到了3个主要文件--第一个是merge_text.py。自从merge_text.py尝试打开由bzip2压缩的文件以来,我已经尝试了几种类型的输入- txt,csv,bzip2文件。
该文件位于:https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
此脚本需要哪种类型的输入格式?此外,如果有人可以建议如何训练模型。
发布于 2017-03-29 23:20:56
我对sense2vec中的代码样本进行了扩展和调整。
您可以从下面的输入文本开始:
“就沙特及其动机而言,这也非常简单。沙特人擅长金钱和算术。面对两个痛苦的选择:要么赔钱,将目前的产量维持在每桶60美元,要么每天从市场上撤出200万桶,然后损失更多钱--这是一个简单的选择:选择痛苦程度较低的道路。如果有次要原因,比如伤害美国致密石油生产商或伤害伊朗和俄罗斯,那很好,但实际上只是钱的问题。”
要这样做:
如|高级|简单|高级|高级|沙特|高级|高级|沙特人|高级|高级|简单|高级|高级|沙特人|高级|好|修饰|高级|货币|名词和|高级|算术|名词与|高级|简单|高级|沙特人|高级|算术||名词为|高级|高级算术|名词为|高级| saudi_arabia|ENT |高级|简单|高级|高级|沙特人|高级|高级|沙特人|高级||高级|高级|简单|高级
这是代码。如果你有什么问题,请告诉我。
我可能很快就会在github.com/woltob上发布它。
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)您可以使用以下方法在Tensorboard中使用Gensim可视化您的模型:https://github.com/ArdalanM/gensim2tensorboard
我还将调整此代码以使用sense2vec方法(例如,单词在预处理步骤中变为小写,只需在代码中将其注释掉)。
编码愉快,woltob
发布于 2016-08-09 16:10:34
输入文件应该是bzipped格式的json。要使用纯文本文件,只需编辑merge_text.py,如下所示:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']https://stackoverflow.com/questions/37946008
复制相似问题