首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >ValuError在单簧管imblearn.over_sampling中的计数

ValuError在单簧管imblearn.over_sampling中的计数
EN

Stack Overflow用户
提问于 2017-11-13 16:46:08
回答 1查看 3.9K关注 0票数 3

由于数据集不平衡,我一直试图对其进行过采样。我正在进行二进制文本分类,并希望在我的两个类之间保持1的比率。我正在尝试用击打装置来解决这个问题。

我遵循了本教程:https://beckernick.github.io/oversampling-modeling/

然而,我遇到一个错误,它说:

ValueError:无法将字符串转换为浮动

这是我的代码:

代码语言:javascript
复制
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score
from imblearn.over_sampling import SMOTE

data = pd.read_csv("dataset.csv")

nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 10))),
    ('tfidf_transformer', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])

for train_indices, test_indices in k_fold.split(data):

    train_text = data.iloc[train_indices]['sentence'].values
    train_y = data.iloc[train_indices]['isRelevant'].values

    test_text = data.iloc[test_indices]['sentence'].values
    test_y = data.iloc[test_indices]['isRelevant'].values

    sm = SMOTE(ratio = 1.0)
    train_text_res, train_y_res = sm.fit_sample(train_text, train_y)

    nb_pipeline.fit(train_text, train_y)
    predictions = nb_pipeline.predict(test_text)

    nb_conf_mat += confusion_matrix(test_y, predictions)
    score1 = f1_score(test_y, predictions)
    nb_f1_scores.append(score1)

print("F1 Score: ", sum(nb_f1_scores)/len(nb_f1_scores))
print("Confusion Matrix: ")
print(nb_conf_mat)

有人能告诉我我哪里出了问题吗?没有这两条线,我的程序运行得很好。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-11-13 17:56:52

在对文本数据进行矢量化之后,在对分类器进行拟合之前,您应该进行过度采样。这意味着拆分代码中的管道。代码的相关部分应该如下所示:

代码语言:javascript
复制
nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 10))),
    ('tfidf_transformer', TfidfTransformer())
])

k_fold = KFold(n_splits = 10)
nb_f1_scores = []
nb_conf_mat = np.array([[0, 0], [0, 0]])

for train_indices, test_indices in k_fold.split(data):

    train_text = data.iloc[train_indices]['sentence'].values
    train_y = data.iloc[train_indices]['isRelevant'].values

    test_text = data.iloc[test_indices]['sentence'].values
    test_y = data.iloc[test_indices]['isRelevant'].values

    vectorized_text = nb_pipeline.fit_transform(train_text)

    sm = SMOTE(ratio = 1.0)
    train_text_res, train_y_res = sm.fit_sample(vectorized_text, train_y)

    clf = MultinomialNB()
    clf.fit(train_text_res, train_y_res)
    predictions = clf.predict(nb_pipeline.transform(test_text))
票数 6
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47269418

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档