我正在尝试使用拥抱变压器库和这个数据集来训练和评估仇恨检测模型。模特的表现是次要的,只是试着去做。我已经对数据进行了预处理,并将其标记如下:
import pandas as pd
import numpy as np
from numpy.random import RandomState
import re
import preprocessor as p
from transformers import AutoTokenizer
# Loading raw data
original_data = pd.read_csv('../data/data.csv')
# Make a random test and train split
rng = RandomState()
train = original_data.sample(frac=0.7, random_state=rng)
test = original_data.loc[~df.index.isin(train.index)]
# Preprocessing: remove special characters using RegEx
REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\\\$)|(\>)|(\<)|(\{)|(\})")
REPLACE_WITH_SPACE = re.compile("(<br\s/><br\s/?)|(-)|(/)|(:).")
# Custum function to clean the datasets
def clean_tweets(df):
tempArr = []
for line in df:
# send to tweet_processor
tmpL = p.clean(line)
# remove puctuation
tmpL = REPLACE_NO_SPACE.sub("", tmpL.lower()) # convert all tweets to lower cases
tmpL = REPLACE_WITH_SPACE.sub(" ", tmpL)
tempArr.append(tmpL)
return tempArr
# clean training data
train_tweet = clean_tweets(train["tweet"])
train_tweet = pd.DataFrame(train_tweet)
# append cleaned tweets to the testing data
train["clean_tweet"] = train_tweet
# clean the test data
test_tweet = clean_tweets(test["tweet"])
test_tweet = pd.DataFrame(test_tweet)
# append cleaned tweets to the training data
test["clean_tweet"] = test_tweet
# Tokenisation so the inputs are ready for the model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")上述代码为测试数据生成一个表,如下所示:

据我所知,下一部分应该是模型训练部分,并提取仇恨推特的百分比。有关于执行的建议吗?
发布于 2021-10-25 06:40:54
https://datascience.stackexchange.com/questions/103389
复制相似问题