首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >面向GPT2和T5的句子分类模型API?

面向GPT2和T5的句子分类模型API?
EN

Stack Overflow用户
提问于 2020-06-24 18:09:19
回答 2查看 2.9K关注 0票数 3

我已经成功地使用模型来使用BERTForSequenceClassification类和API进行句子分类。我用它做了一句感情分析和两句NLI。

我可以看到其他模型有类似的类,例如XLNetForSequenceClassificationRobertaForSequenceClassification。这种类型的句子分类通常需要在表示整个句子的密集向量之上放置一个分类器层。

现在我正在尝试使用GPT2T5模型。但是,当我查看每个可用的类和API时,没有等效的"ForSequenceClassification“类。例如,对于GPT2,有GPT2ModelGPT2LMHeadModelGPT2DoubleHeadsModel类。也许我对GPT2和T5的研究不够熟悉,但我确信这两种模型都能够进行句子分类。

所以我的问题是:

  1. 对于GPT2和T5,我应该使用哪些Huggingface类来进行一句分类?
  2. 我应该使用哪些类来进行两句(句子对)分类(比如自然语言推理)?

谢谢你的帮助。

EN

回答 2

Stack Overflow用户

发布于 2020-07-01 21:06:43

您需要使用GPT2Model类来生成文本的句子嵌入。一旦嵌入将它们提供给线性NN和softmax函数以获得逻辑,下面是一个使用GPT2进行文本分类的组件(仍在进行中,因此我对建议持开放态度),它遵循了我刚才描述的逻辑:

代码语言:javascript
复制
from torch_model_base import TorchModelBase
import torch
import torch.nn as nn
import torch.utils.data
from transformers import GPT2Tokenizer, GPT2Model
import random
from spacy.util import minibatch, compounding
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
import pandas as pd
from typing import List, Tuple


def mean_across_all_tokens(hidden_states):
    return torch.mean(hidden_states[-1], dim=1)

def sum_all_tokens(hidden_states):
    return torch.sum(hidden_states[-1], dim=1)

def concat_all_tokens(hidden_states):
    batch_size, max_tokens, emb_dim = hidden_states[-1].shape
    return torch.reshape(hidden_states[-1], (batch_size, max_tokens * emb_dim))



class GPT2SequenceClassifierModel(nn.Module):
    def __init__(
            self,
            hidden_size: int,
            num_classes: int,
            gpt_model_name: str,
            max_seq_length: int = 280,
            embedding_func=mean_across_all_tokens,
            combine_sentence_tokens=True
    ):
        super(GPT2SequenceClassifierModel, self).__init__()
        self.hidden_size = hidden_size
        self.fc1 = nn.Linear(hidden_size, num_classes)
        self.model = GPT2Model.from_pretrained(
            gpt_model_name,
            output_hidden_states=True
        )
        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt_model_name)
        self.combine_sentence_tokens = combine_sentence_tokens;
        self.embedding_func = embedding_func;
        self.model.eval()
        self.max_length = max_seq_length

    def _tokenize(self, text_list: List[str]) -> Tuple[torch.tensor, torch.tensor]:
        # Tokenize the text with the provided tokenizer
        #self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        self.tokenizer.add_special_tokens({'cls_token': '[CLS]'})
        self.model.resize_token_embeddings(len(self.tokenizer))
        input_ids = self.tokenizer.batch_encode_plus(text_list,
                                                     add_special_tokens=True,
                                                     max_length=self.max_length,
                                                     pad_to_max_length=True
                                                     )["input_ids"]

        return torch.LongTensor(input_ids)

    def _tokenize_and_predict(self, text_list: List[str]) -> torch.tensor:
        input_ids_tensor = self._tokenize(text_list)
        out = self.model(input_ids=input_ids_tensor)
        hidden_states = out[2]
        if (self.combine_sentence_tokens):
            return self.embedding_func(hidden_states)
        else:
            return hidden_states[-1];


    def forward(self, text_list: List[str]):
        """
        :param input_ids: (torch.LongTensor of shape (batch_size, input_ids_length))
        :return: logits for class
        """
        if isinstance(text_list, pd.Series):
            text_list = text_list.tolist()
        with torch.no_grad():
            # fine tuning GPT2 model is too expensive, so won't do it
            gpt_out = self._tokenize_and_predict(text_list)
        batch_size = len(text_list)
        assert gpt_out.shape == (batch_size, self.hidden_size)
        prediction_vector = self.fc1(gpt_out)  # (batch_size , max_len, num_classes)
        logits = torch.softmax(prediction_vector, dim=1)
        return logits


class GPT2Classifier(TorchModelBase):
    """GPT2 + NN head for classification problems.
    The network will work for any kind of classification task.

    Parameters
    ----------
    embed_dim: dimension of byte-pair/token embeddings generated by the model, check the model card(n_embd prop), since each model is compatible with only 1 no. of dimensions
    max_seq_length: max tokens in a sequence(n_positions param in hugging face model config), if sequenc is shorter will get padded
    """
    def __init__(self,
            model_name="distilgpt2",
                 embed_dim=768,
                 max_seq_length=1024,
                 **kwargs
                 ):
        self.model_name = model_name
        self.embed_dim = embed_dim
        self.max_seq_length = max_seq_length
        self.model = None # call fit() to set this
        self.tokenizer = None  # call fit() to set this
        self.classes = None # call fit() to set this
        super(GPT2Classifier, self).__init__(**kwargs)
        self.params += ['model_name']

    def fit(self, X, y):
        """Standard `fit` method.

        Parameters
        ----------
        X : np.array
        y : array-like
        Returns
        -------
        self

        """
        self.classes = list(set(y))
        self.model = GPT2SequenceClassifierModel(
            hidden_size=self.embed_dim,
            num_classes=len(self.classes),
            gpt_model_name=self.model_name,
            max_seq_length=self.max_seq_length
        )
        self.opt = self.optimizer(
            self.model.parameters()
        )
        self.model.train()
        loss = nn.CrossEntropyLoss()
        print("Training... max iters: ", self.max_iter)
        for ephoc in range(self.max_iter):
            print("ephoc no: ", ephoc)
            zipped_data = list(zip(X,y))
            random.shuffle(zipped_data)
            batches = minibatch(zipped_data, size=self.batch_size)
            for batch in batches:
                X_batch, y_batch = zip(*batch)
                batch_preds = self.model(X_batch)
                err = loss(batch_preds, torch.LongTensor(y_batch))
                # Backprop:
                self.opt.zero_grad()
                err.backward()
                self.opt.step()
        return self

    def predict_proba(self, X):
        """Predicted probabilities for the examples in `X`.

        Parameters
        ----------
        X : np.array

        Returns
        -------
        np.array with shape (len(X), self.n_classes_)

        """
        self.model.eval()
        with torch.no_grad():
            preds = self.model(X)
            preds = preds.numpy()
            return preds

    def predict(self, X):
        """Predicted labels for the examples in `X`. These are converted
        from the integers that PyTorch needs back to their original
        values in `self.classes_`.

        Parameters
        ----------
        X : np.array

        Returns
        -------
        list of length len(X)

        """
        probs = self.predict_proba(X)
        return [self.classes[i] for i in probs.argmax(axis=1)]
票数 2
EN

Stack Overflow用户

发布于 2020-07-02 21:28:53

那么,为什么不使用GPT2LMHeadModel本身的代码作为一个灵感:

代码语言:javascript
复制
class MyGPT2LMHeadModel(GPT2PreTrainedModel):
    def __init__(self, config, num_classes):
        super().__init__(config)
        self.transformer = GPT2Model.from_pretrained('gpt2')
        #self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.lm_head = nn.Linear(config.n_embd, num_classes, bias=False)

...

    def forward(...):
        hidden_states = self.transformer(...)[0]
        lm_logits = self.lm_head(hidden_states)
...
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62561471

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档