我用GPU对我的模型进行了微调,但是推理过程非常缓慢,我认为这是因为推断默认使用CPU。这是我的推理代码:
txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()下面是我的第二个推断代码,它使用管道(针对不同的模型):
classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)如何迫使变压器库在GPU上进行更快的推断?我尝试过添加model.to(torch.device("cuda")),但这会引发错误:
Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu我认为这个问题与未发送到GPU的数据有关。这里有一个类似的问题:pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu
我将如何发送数据到GPU有和没有管道?任何建议都是非常感谢的。
发布于 2022-02-09 15:28:00
在进行推断之前,您也应该将输入转移到CUDA:
device = torch.device('cuda')
# transfer model
model.to(device)
# define input and transfer to device
encoding = tokenizer.encode_plus(txt,
add_special_tokens=True,
truncation=True,
padding="max_length",
return_attention_mask=True,
return_tensors="pt")
encoding = encoding.to(device)
# inference
output = model(**encoding)请注意,nn.Module.to在原地,而torch.Tensor.to不是(它做一个副本!)
https://stackoverflow.com/questions/71050697
复制相似问题