文章/答案/技术大牛

发布

社区首页 >问答首页 >categorical_crossentropy期望目标是二进制矩阵

问categorical_crossentropy期望目标是二进制矩阵
EN

Stack Overflow用户

提问于 2020-01-23 20:18:42

回答 1查看 1.6K关注 0票数 0

首先，我不是一名程序员，但我正在自学深度学习，用我自己的数据集承担一个真正的项目。我的情况可以分为以下几个部分：

我正在尝试承担一个多类文本分类项目。我有一个有1000个例子的语料库，每个例子都有4个可能的标签(A1，A2，B1，B2)，它们是相互排斥的。所有示例都位于单独的文件夹和单独的.txt文件中。

经过大量的努力和一些男人的泪水，我设法把下面的代码放在一起：

import os
import string
import keras
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import re
import numpy as np
import tensorflow as tf
from numpy import array
from sklearn.model_selection import KFold



from numpy.random import seed
seed(1)

tf.random.set_seed(1)

root="D:/bananaCorpus"
train_dir=os.path.join(root,"train")

texts=[]
labels=[]

for label in ["A1","A2","B1","B2"]:
     directory=os.path.join(train_dir,label)
     for fname in os.listdir(directory):
         if fname[-4:]==".txt":
             f = open(os.path.join(directory, fname),encoding="cp1252")
             texts.append(f.read())
             f.close()
             if label == 'A1':
                 labels.append(0)
             elif label=="A2":
                       labels.append(1)
             elif label=="B1":
                  labels.append(2)
             else:
                labels.append(3)

print(texts)
print(labels)
print("Corpus Length", len( root), "\n")
print("The total number of reviews in the train dataset is", len(texts),"\n")
stops = set(stopwords.words("english"))
print("The number of stopwords used in the beginning: ", len(stops),"\n")
print("The words removed from the corpus will be",stops,"\n")


## This adds new words or terms from words_to_add list to the stop_words
words_to_add=[]
[stops.append(w) for w in words_to_add]

##This removes the words or terms from the words_to_remove list,
##so that they are no longer included in stopwords
words_to_remove=["i","having"]
[stops.remove(w) for w in words_to_remove ]

texts=[[w.lower() for w  in word_tokenize("".join(str(review))) if  w not in stops and w not in string.punctuation and len(w)>2 and w.isalpha()]for review in texts ]

print("costumized stopwords: ", stops,"\n")
print("count of costumized stopwords",len(stops),"\n")
print("**********",texts,"\n")

#vectorization
#tokenizing the raw data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

maxlen = 50
training_samples = 200
validation_samples = 10000
max_words = 10000

#delete?
tokens=keras.preprocessing.text.text_to_word_sequence(str(texts))
print("Sequence of tokens: ",tokens,"\n")

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

print("Tokens:", sequences,"\n")
word_index = tokenizer.word_index
print("Unique tokens:",word_index,"\n") 
print(' %s unique tokens in total.' % len(word_index,),"\n")
print("Unique tokens: ", word_index,"\n")
print("Dictionary of words and their count:", tokenizer.word_counts,"\n" )
print(" Number of docs/seqs used to fit the Tokenizer:", tokenizer.document_count,"\n")
print(tokenizer.word_index,"\n")
print("Dictionary of words and how many documents each appeared in:",tokenizer.word_docs,"\n")

data = pad_sequences(sequences, maxlen=maxlen, padding="post")
print("padded data","\n")
print(data)

#checking the encoding with a new document
text2="I like to study english in the morning and play games in the afternoon"
text2=[w.lower() for w  in word_tokenize("".join(str(text2))) if  w not in stops and w not in string.punctuation
          and len(w)>2 and w.isalpha()]
sequences = tokenizer.texts_to_sequences([text2])
text2 = pad_sequences(sequences, maxlen=maxlen, padding="post")
print("padded text2","\n")
print(text2)


#cross-validation
labels = np.asarray(labels)

print('Shape of data tensor:', data.shape,"\n")
print('Shape of label tensor:', labels.shape,"\n")
print("labels",labels,"\n")



kf = KFold(n_splits=4, random_state=None, shuffle=True)
kf.get_n_splits(data)

print(kf)
KFold(n_splits=4, random_state=None, shuffle=True)
for train_index, test_index in kf.split(data):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = data[train_index], data[test_index]
    y_train, y_test = labels[train_index], labels[test_index]

#Pretrained embedding
glove_dir = 'D:\glove'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'),encoding="utf-8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print("Found %s words vectors fom GLOVE."% len(embeddings_index))

#Preparing the Glove word-embeddings matrix to pass to the embedding layer(max_words, embedding_dim)
embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# define vocabulary size (largest integer value)


# define model
from keras.models import Sequential
from keras.layers import Embedding,Flatten,Dense
from keras import layers
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))#vocabulary size + the size of glove version +max len of input documents.
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

#Loading pretrained word embeddings and Freezing the Embedding layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
history=model.fit(X_train, y_train, epochs=6,verbose=2)
# evaluate
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))

然而，我得到了这个错误：

Traceback (most recent call last):
  File "D:/banana.py", line 177, in <module>
    history=model.fit(X_train, y_train, epochs=6,verbose=2)
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\training.py", line 1154, in fit
    batch_size=batch_size)
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\training.py", line 642, in _standardize_user_data
    y, self._feed_loss_fns, feed_output_shapes)
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\training_utils.py", line 284, in check_loss_and_target_compatibility
    ' while using as loss `categorical_crossentropy`. '
ValueError: You are passing a target array of shape (3, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

从keras.utils导入to_categorical

y_binary = to_categorical(y_int)

Alternatively, you can use the loss function `sparse_categorical_crossentropy` instead, which does expect integer targets.

我尝试了错误消息中提到的所有方法，但都无济于事。经过研究，我得出的结论是，该模型并不是试图预测多个类别，这就是为什么categorical_crossentropy损失不被接受的原因。然后我意识到，如果我将其更改为binary cross-entropy，错误就会消失，这实际上是对这不能作为多类分类模型工作的确认。我可以做什么来调整我的代码，使其按预期工作？我是不是不走运，不得不开始一个完全不同的项目？

任何类型的指导都会对我和我的心理健康有很大的帮助。

loss-function

multiclass-classification

cross-entropy

deep-learning

nlp

回答 1

Stack Overflow用户

发布于 2020-01-23 20:37:18

您应该进行两个更改。首先，网络输出中的神经元数量应该与类的数量相匹配，并使用softmax激活：

model.add(Dense(4, activation='softmax'))

那么你应该使用sparse_categorical_crossentropy损失，因为你不是一个人在编码标签：

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

然后，模型应该能够无错误地进行训练。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59878418

复制

相似问题

问categorical_crossentropy期望目标是二进制矩阵
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问categorical_crossentropy期望目标是二进制矩阵EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问categorical_crossentropy期望目标是二进制矩阵
EN