我在用变压器。将BERT嵌入到我的输入中。使用它,没有管道,我可以获得恒定的输出,但不能用管道,因为我不能传递参数给它。
我如何为我的管道传递与变压器相关的参数?
# These are BERT and tokenizer definitions
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
inputs = ['hello world']
# Normally I would do something like this to initialize the tokenizer and get the result with constant output
tokens = tokenizer(inputs,padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model(**tokens)[0].detach().numpy().shape
# using the pipeline
pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)
# or other option
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
nlp=pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)
# to call the pipeline
nlp("hello world")我尝试过几种类似于上面列出的选项的方法,但是在输出大小不变的情况下无法获得结果。可以通过设置令牌程序参数来实现恒定的输出大小,但不知道如何为管道提供参数。
知道吗?
发布于 2021-09-16 21:06:50
max_length令牌化参数不支持每次违约 (即不向max_length应用填充),但您可以创建自己的类并覆盖此行为:
from transformers import AutoTokenizer, AutoModel
from transformers import FeatureExtractionPipeline
from transformers.tokenization_utils import TruncationStrategy
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
inputs = ['hello world']
class MyFeatureExtractionPipeline(FeatureExtractionPipeline):
def _parse_and_tokenize(
self, inputs, max_length, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs
):
"""
Parse arguments and tokenize
"""
# Parse arguments
if getattr(self.tokenizer, "pad_token", None) is None:
padding = False
inputs = self.tokenizer(
inputs,
add_special_tokens=add_special_tokens,
return_tensors=self.framework,
padding=padding,
truncation=truncation,
max_length=max_length
)
return inputs
mynlp = MyFeatureExtractionPipeline(model=model, tokenizer=tokenizer)
o = mynlp("hello world", max_length = 500, padding='max_length', truncation=True)让我们比较一下输出的大小:
print(len(o))
print(len(o[0]))
print(len(o[0][0]))输出:
1
500
768请注意:这只适用于变压器4.10.X和以前的版本。该团队目前正在重构管道类,未来的发布将需要不同的调整(也就是说,一旦重新分解的管道被释放,它就无法工作)。
https://stackoverflow.com/questions/69196995
复制相似问题