文章/答案/技术大牛

发布

社区首页 >问答首页 >当决策树更平衡、更准确时，多类问题的神经网络分类器怎么能只在一类中分类呢？

问当决策树更平衡、更准确时，多类问题的神经网络分类器怎么能只在一类中分类呢？
EN

Data Science用户

提问于 2020-04-14 16:28:18

回答 1查看 71关注 0票数 1

我想为有四个类的数据框架创建一个分类器。每一行只能有一个类。我有两个预测模型:一个神经网络和一个树分类器。但他们把每个人都放在一个班，在训练期间，因此在考试期间。

神经网络只在一类

中进行分类

问题是，我的神经网络的分类是：

我在这里称这个模型为：

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

model = create_model(x_train.shape[1], y_train.shape[1])
epochs =  30
batch_sz = 64

print("Beginning model training with batch size {} and {} epochs".format(batch_sz, epochs))

checkpoint = ModelCheckpoint("lc_model.h5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto', period=1)


# train the model
history = model.fit(x_train.to_numpy(),
                y_train.to_numpy(),
                validation_split=0.2,
                epochs=epochs,  
                batch_size=batch_sz, 
                verbose=2,
#                 class_weight = weights, # class_weight tells the model to "pay more attention" to samples from an under-represented grade class.
#                 callbacks=[checkpoint]
                   )

# revert to the best model encountered during training
model = load_model("lc_model.h5", compile=False)

下面是模型的体系结构：

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.constraints import MaxNorm
# from tensorflow.python.compiler.tensorrt import trt_convert as trt

def create_model(input_dim, output_dim):
    print(output_dim)
    # create model
    model = Sequential()
    print("sequential")
    # input layer
    model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=MaxNorm(3)))
    model.add(Dropout(0.2))

    # hidden layer
    model.add(Dense(60, activation='relu', kernel_constraint=MaxNorm(3)))
    model.add(Dropout(0.2))

    # output layer
    model.add(Dense(output_dim, activation='softmax'))

    # Compile model
    model.compile(loss=focal_loss(alpha=1), loss_weights=None, optimizer='nadam', metrics=['accuracy'])

    return model

这里是x_train的一部分。

id reg  0.0_x   1.0_x   17.0    21.0    30.0    40.0    50.0    60.0    70.0    Célibataire     Divorcé(e)  Marié et j'ai des enfants à charge  Marié et je n'ai pas encore d'enfants à charge  Refus de répondre   Veuf (ve)   1er cycle universitaire / Licence   2e cycle universtaire / Master  3e cycle universtaire / Doctorat    BTS     Je n'ai jamais été à l'école    Niveau collège  Niveau lycée    Niveau primaire     Autre. Merci de préciser :@NS$  Infirme     J'ai une société    Je ne travaille pas     Je suis commerçant  Je suis encore étudiant     Je suis independent     Je suis journalier, je travaille de temps à autre   Je suis retraité    Je travaille dans le secteur privé  Je travaille dans le secteur public     0.0_y   250.0   3750.0  7500.0  8750.0  11250.0     11500.0     18750.0     25000.0     35000.0     45000.0     50000.0     0.0_x.1     1.0_y   0.0_y.1     1.0_x.1     Je ne suis pas d'accord     Je suis d'accord    False_x     True_y  False_y     True_x  False_x.1   True_y.1    False_y.1   True_x.1    False_x.2   True_y.2    False_y.2   True_x.2    False_x.3   True_y.3    False_y.3   True_x.3    False_x.4   True_y.4    False_y.4   True_x.4    False_x.5   True_y.5    False_y.5   True_x.5    0.0_x.2     1.0_y.1     0.0_y.2     1.0_x.2     0.0_x.3     1.0_y.2     0.0_y.3     1.0
0   NaN     0   1   0   0   1   0   0   0   0   1   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   1   0   0   1   0   1   0   1   1   0   1   0   0   1   1   0   1   0   1   0   1   0   1   0   1   0   1   0   0   1   1   0   0   1   0   1
1   NaN     0   1   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0   1   0   0   1   1   0   1   0   1   0   0   1   1   0   1   0   1   0   1   0   0   1   1   0   1   0   1   0   1   0   1   0   0   0
...

下面是y_train的一部分：

   Voting intention in 2021_Cast a blank vote   Voting intention in 2021_I know who I will be voting for in 2021    Voting intention in 2021_I won't vote   Voting intention in 2021_I'm going to vote in 2021 but don't know for who
0                0                                  1                                                                       0                                                                       0   
1                0                                  0                                                                       0                                                                       1
...

所以当我试着测试这个模型的时候，它并不比随机的好：

sequential
Beginning model training with batch size 64 and 30 epochs
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of samples seen.
Train on 768 samples, validate on 192 samples
Epoch 1/30
768/768 - 1s - loss: -inf - acc: 0.2448 - val_loss: -inf - val_acc: 0.2708
Epoch 2/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708
...
Epoch 30/30
768/768 - 0s - loss: -inf - acc: 0.2409 - val_loss: -inf - val_acc: 0.2708

事实上，准确度略低于25%，这是我从随机选择课程中可以预料到的结果。它似乎永远学不到任何东西，因为失去的总是-inf。

因此，我在测试集上计算了模型的精度，结果更加糟糕。实际上，使用以下代码：

import numpy as np
from sklearn.metrics import f1_score

y_pred = model.predict(x_test.to_numpy())

# Revert one-hot encoding to classes
y_pred_classes = pd.DataFrame((y_pred.argmax(1)[:,None] == np.arange(y_pred.shape[1])),
                              columns=y_test.columns,
                              index=y_test.index)

y_test_vals = y_test.idxmax(1)
y_pred_vals = y_pred_classes.idxmax(1)

# F1 score
# Use idxmax() to convert back from one-hot encoding
f1 = f1_score(y_test_vals, y_pred_vals, average='weighted')
print("Test Set Accuracy: {:.2%}   (But results would have been better if trained on the FULL dataset)".format(f1))

我不明白，这是我在另一个贷款分类问题上设计的一种架构。

我明白了：Test Set Accuracy: 10.92%

加权的

：

上述所有模型均未加权或无焦损。

我试着用不同的方法来处理班级的不平衡，比如重采样。没有重采样，我是用重量做的：

weights = df_en2['Voting intention in 2021'].value_counts(normalize=True)
weights = weights.sort_index().tolist()
weights = {0: 1 / weights[0],
           1: 1 / weights[1], 
           2: 1 / weights[2],
           3: 1 / weights[3]}

其中，dfen_2是为x_train、y_train、x_test、y_test提供spilt_data()函数的数据格式，您可以找到这里 (除了贷款分类问题外，基本上是相同的架构)。

多类树分类器

与之相比，如果使用树分类器，如果将max_depth留给None，则叶子会被展开，直到所有的叶子都是纯的，或者直到所有的叶子都包含少于min_samples_split的样本。我的平均准确率为42%。

# importing necessary libraries 
from sklearn import datasets 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 

# dividing X, y into train and test data 
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) 

# training a DescisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier 
dtree_model = DecisionTreeClassifier().fit(x_train, y_train) 
dtree_predictions = dtree_model.predict(x_test) 

# creating a confusion matrix 
cm = confusion_matrix(y_test.values.argmax(axis = 1), dtree_predictions.argmax(axis = 1))

它还会：

array([[0.19047619, 0.15873016, 0.45454545, 0.27118644],
       [0.15873016, 0.38095238, 0.2       , 0.30508475],
       [0.15873016, 0.15873016, 0.4       , 0.22033898],
       [0.19047619, 0.19047619, 0.21818182, 0.38983051]])

classification

decision-trees

multiclass-classification

class-imbalance

回答 1

Data Science用户

发布于 2020-04-14 20:33:37

对于多标签分类，您希望在最后一层使用binary_crossentropy丢失和sigmoid激活，因为每个类都可以取0到1之间的值。

票数 1

页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://datascience.stackexchange.com/questions/72314

复制

相似问题

问当决策树更平衡、更准确时，多类问题的神经网络分类器怎么能只在一类中分类呢？
EN

神经网络只在一类

：

多类树分类器

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当决策树更平衡、更准确时，多类问题的神经网络分类器怎么能只在一类中分类呢？EN

神经网络只在一类

：

多类树分类器

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当决策树更平衡、更准确时，多类问题的神经网络分类器怎么能只在一类中分类呢？
EN