我想使用T5对IMDB进行情感分析。我的数据集的格式如下:
# train data
f = open("train.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], line[1], ",".join(line[2:])] for line in lines]
train = pd.DataFrame(lines[1:])
train = train.drop(train.columns[0], axis=1) # drop first column
print("\ntrain set size:", train.shape)
print("\nNumber of positives: ", train[1].astype(int).sum())
train = train.rename(columns={1: 'sentiment', 2: 'review'})
imdb_reviews = train["review"]
sentiments = train["sentiment"]
sentiments = [int(v) for v in sentiments]
sentiments=pd.DataFrame(sentiments)
sentiments=sentiments.rename(columns={0:'sentiment'})
sentiments = sentiments["sentiment"].tolist()
# test data
f = open("test.csv", "r")
lines = f.readlines()
lines = [line.strip().split(",") for line in lines]
lines = [[line[0], ",".join(line[1:])] for line in lines]
test = pd.DataFrame(lines[1:])
id_test = test[0]
print("\ntest set:", test.shape)
test = pd.DataFrame(test[1])
print("Number of test sentences: {:,}\n".format(test.shape[0]))
test = test.rename(columns={1:'review'})我找到了这段代码,但是我不知道如何使它适应我自己的数据格式。如果你能告诉我怎么做的话,我将不胜感激。该训练集包括25000个观测值,其中10%应作为验证集。
发布于 2023-04-13 19:24:35
该代码是原始Python和Pandas库的组合。使用为此任务设计的拥抱面库可能更有用。
类似于:
# Load imbd dataset
from datasets import load_dataset
imdb = load_dataset("imdb")
# Load T5 specific tokenizer
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small', model_max_length=512)
# Apply T5 specific tokenizer to the imdb dataset
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function, batched=True)https://datascience.stackexchange.com/questions/120532
复制相似问题