首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >BERT获得语句嵌入

BERT获得语句嵌入
EN

Stack Overflow用户
提问于 2021-10-10 17:32:31
回答 2查看 6.4K关注 0票数 4

我正在从此页复制代码。我已经下载了BERT模型到我的本地系统,并得到了句子嵌入。

我有大约50万个句子,我需要嵌入句子,这需要很长的时间。

  1. 有没有办法加快这一进程?
  2. 发送一批句子而不是一次一句对你有帮助吗?

代码语言:javascript
复制
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

corpa=["i am a boy","i live in a city"]



storage=[]#list to store all embeddings

for text in corpa:
    # Add the special tokens.
    marked_text = "[CLS] " + text + " [SEP]"

    # Split the sentence into tokens.
    tokenized_text = tokenizer.tokenize(marked_text)

    # Map the token strings to their vocabulary indeces.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = [1] * len(tokenized_text)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    storage.append((text,sentence_embedding))

######update 1

我根据提供的答案修改了我的代码。它不是在进行整批处理。

代码语言:javascript
复制
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)


storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
    
    tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
    segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
    print (tokens_tensor)
    print (segments_tensors)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    print (sentence_embedding[:10])
    storage.append((text,sentence_embedding))

我可以更新从for循环到下面的前2行。但是,只有当所有句子在标记化后都具有相同的长度时,它们才能工作。

代码语言:javascript
复制
tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])

此外,在这种情况下,outputs = model(tokens_tensor, segments_tensors)失败。

在这种情况下,我怎样才能全面执行批处理呢?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-10-19 11:23:45

关于你最初的问题:你没有什么可以做的。伯特算法计算量很大。最好的方法是使用BertTokenizerFast而不是常规的BertTokenizer。“快速”版本效率要高得多,您将看到大量文本的不同之处。

这么说,我必须警告你,平均BERT字嵌入并不能为这个句子创建很好的嵌入。见帖子。从你的问题中,我猜想你想做某种语义相似性搜索。尝试使用其中一个开源模型

票数 2
EN

Stack Overflow用户

发布于 2021-10-11 05:55:12

可以加速您的工作流的最简单的方法之一是批处理数据。在当前的实现中,您在每次迭代中只提供一个句子,但是有一种使用批处理数据的能力!

现在,如果您愿意自己实现这个部分,我强烈建议您以这种方式使用tokenizer来准备数据。

代码语言:javascript
复制
batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]}

但是有一种更简单的方法,使用FeatureExtractionPipeline和全面的文档!这个应该是这样的:

代码语言:javascript
复制
from transformers import pipeline

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction(["Hello I'm a single sentence",
                               "And another sentence",
                               "And the very very last one"])

实际上,UPDATE1稍微修改了代码,但每次只传递一个示例,而不是批处理形式。如果我们想坚持您的实现批处理,则如下所示:

代码语言:javascript
复制
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.eval()
sentences = [ 
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
            ]
batch_size = 4  
for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
    with torch.no_grad():
        
        outputs = model(**encoded)
        
    
    print(outputs.last_hidden_state.size())

产出:

代码语言:javascript
复制
torch.Size([4, 50, 768]) # batch_size * max_length * hidden dim
torch.Size([4, 50, 768])
torch.Size([1, 50, 768]) 

UPDATE2

关于将批处理数据填充到最大长度中所提到的内容,有两个问题。NO,因为在训练阶段,该模型以批次形式给出了可变长度的输入语句,设计人员引入了一个特定的参数来指导上的模型,应该注意的地方!第二,如何处理这些垃圾数据?使用attention mask参数,您只能对相关数据执行平均值操作!

所以代码会被修改成这样:

代码语言:javascript
复制
for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
    with torch.no_grad():
        
        outputs = model(**encoded)
    lhs = outputs.last_hidden_state
    attention = encoded['attention_mask'].reshape((lhs.size()[0], lhs.size()[1], -1)).expand(-1, -1, 768)
    embeddings = torch.mul(lhs, attention)
    denominator = torch.count_nonzero(embeddings, dim=1)
    summation = torch.sum(embeddings, dim=1)
    mean_embeddings = torch.div(summation, denominator)
票数 6
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69517460

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档