给定一个句子“RoBERTa是BERT的一个高度优化的版本。”,我需要用RoBERTa获得这个句子中每个单词的嵌入。我试着在网上查看示例代码,但没有找到明确的答案。
我的观点如下:
tokens = roberta.encode(headline)
all_layers = roberta.extract_features(tokens, return_all_hiddens=True)
embedding = all_layers[0]
n = embedding.size()[1] - 1
embedding = embedding[:,1:n,:]其中,embedding[:,1:n,:]用于仅提取句子中单词的嵌入,而不提取开始和结束标记。
这是正确的吗?
发布于 2021-07-26 20:00:22
TOKENIZER_PATH = "../input/roberta-transformers-pytorch/roberta-base"
ROBERTA_PATH = "../input/roberta-transformers-pytorch/roberta-base"
text= "How are you? I am good."
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)
##how the words are broken into tokens
print(tokenizer.tokenize(text))
##the format of a encoding
print(tokenizer.batch_encode_plus([text]))
##op wants the input id
print(tokenizer.batch_encode_plus([text])['input_ids'])
##op wants the input id without first and last token
print(tokenizer.batch_encode_plus([text])['input_ids'][0][1:-1])输出:
{'input_ids':[0,6179,32,47,116,38,524,205,4,2],'attention_mask':[1,1,1,1,1,1,1,1,1]}
[0,6179,32,47,116,38,524,205,4,2]
6179,32,47,116,38,524,205,4
而且不用担心"Ġ“字符。它只表示单词前面有一个空格。
https://stackoverflow.com/questions/60824589
复制相似问题