文章/答案/技术大牛

发布

社区首页 >问答首页 >基于递归网络的影评分类

问基于递归网络的影评分类
EN

Stack Overflow用户

提问于 2021-03-26 17:10:58

回答 2查看 187关注 0票数 5

据我所知和研究，数据集中的序列可以是不同长度的；如果训练过程中的每一批都包含相同长度的序列，我们就不需要填充或截断它们。

为了实现和应用它，我决定将批处理大小设置为1，并在IMDB电影分类数据集上训练我的RNN模型。我添加了我在下面编写的代码。

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding

max_features = 10000
batch_size = 1

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=32))
model.add(SimpleRNN(units=32, input_shape=(None, 32)))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", 
                  loss="binary_crossentropy", metrics=["acc"])

history = model.fit(x_train, y_train, 
                     batch_size=batch_size, epochs=10, 
                     validation_split=0.2)

acc = history.history["acc"]
loss = history.history["loss"]
val_acc = history.history["val_acc"]
val_loss = history.history["val_loss"]

epochs = range(len(acc) + 1)
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()

我遇到的错误是，由于输入numpy数组中的列表组件，无法将输入转换为张量格式。然而，当我更改它们时，我继续得到类似类型的错误。

错误消息：

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

我无法处理这个问题。有人能在这一点上帮我吗？

recurrent-neural-network

python

tensorflow

keras

deep-learning

回答 2

Stack Overflow用户

发布于 2021-03-26 17:26:23

使用序列填充

有两个问题。您需要首先在文本序列上使用pad_sequences。而且在SimpleRNN中也没有这样的参数input_shape。尝试使用以下代码：

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
batch_size = 1

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)


model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size, 
                         epochs=10, validation_split=0.2)

Here是官方的代码示例，它可能会对您有所帮助。

在嵌入层中使用Mask进行序列填充

根据您的评论和信息，似乎可以使用variable-length输入序列，也可以检查this和this。但是，我仍然可以说，在大多数情况下，实践者更喜欢pad序列的统一长度；因为这是令人信服的。选择非均匀或可变输入序列长度是某种特殊情况；类似于我们希望视觉模型的输入图像大小可变的情况。

然而，在这里，我们将添加关于padding的信息，以及如何在训练时间内mask出填充值，从技术上讲，这似乎是可变长度的输入训练。希望这能让你信服。让我们首先了解一下pad_sequences是做什么的。通常在序列数据中，非常常见的情况是，每个训练样本都有不同的长度。让我们考虑以下输入：

raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

这3个训练样本的长度不同，分别为3、5和6。我们下一步要做的是通过添加一些值(通常是0或-1)来使它们的长度相等--无论是在序列的开头还是结尾。

tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)

array([[   0,    0,    0,  711,  632,   71],
       [   0,   73,    8, 3215,   55,  927],
       [  83,   91,    1,  645, 1253,  927]], dtype=int32)

我们可以设置padding = "post"来设置序列末尾的pad值。但它建议在使用RNN层时使用"post"填充，以便能够使用层的CuDNN实现。但是，仅供参考，您可能会注意到我们设置的maxlen = 6是最大的输入序列长度。但它不必是最高的输入序列长度，因为如果数据集变得更大，它可能会在计算上变得昂贵。我们可以将它设置为5，假设我们的模型可以在这个长度内学习特征表示，这是一种超参数。这带来了另一个参数truncating。

tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)

array([[   0,    0,  711,  632,   71],
       [  73,    8, 3215,   55,  927],
       [  91,    1,  645, 1253,  927]], dtype=int32

好的，现在我们有了一个填充的输入序列，所有的输入都是统一长度的。现在，我们可以在训练时间内mask出这些额外的填充值。我们将告诉模型某些部分的数据是填充的，这些数据应该被忽略。这种机制就是掩蔽。因此，这是一种告诉sequence-processing层输入中的某些时间步长丢失的方法，因此在处理数据时应该跳过。有三种方法可以在Keras模型中引入输入掩码：

在调用支持掩码参数的层(例如keras. layers.Masking layer.

Configure layers )时，
手动添加带有mask_zero=True.
Pass a keras.layers.Embedding参数的掩码层。

在这里，我们将仅通过配置Embedding层进行演示。它有一个名为mask_zero的参数，默认情况下设置为False。如果我们将其设置为True，那么序列中包含索引的0将被跳过。False条目表示在处理期间应忽略相应的时间步长。

padd_input = tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)

embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)

[[   0    0    0  711  632   71]
 [   0   73    8 3215   55  927]
 [  83   91    1  645 1253  927]]

tf.Tensor(
[[False False False  True  True  True]
 [False  True  True  True  True  True]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)

下面是如何在Embedding(Layer)类中计算它的。

  def compute_mask(self, inputs, mask=None):
    if not self.mask_zero:
      return None

    return tf.not_equal(inputs, 0)

这里有一个问题，如果我们将mask_zero设置为True，那么索引0就不能在词汇表中使用。根据文档

mask_zero:布尔值，输入值0是否是应该屏蔽的特殊“填充”值。当使用可能需要可变长度输入的递归层时，这是有用的。如果这是True，则模型中的所有后续层都需要支持掩码，否则将引发异常。如果mask_zero设置为True，那么索引0就不能用于词汇表(input_dim应该等于词汇表的大小+ 1)。

所以，我们至少要使用max_features + 1。Here是对此的一个很好的解释。

下面是使用这些代码的完整示例。

# get the data 
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)

# check highest sequence lenght 
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))

max_features = 20000  # Only consider the top 20k words
maxlen = 350  # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512

print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])

# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words 
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, 
                                  maxlen=maxlen, dtype="int32", 
                                  padding="post", truncating="post", 
                                  value=0.)

print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])

您的模型定义现在应该是

model = Sequential()
model.add(Embedding(
           input_dim=max_features + 1,
           output_dim=32, 
           mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, 
                    batch_size=256, 
                    epochs=1, validation_split=0.2)

639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036

参考

票数 2

Stack Overflow用户

发布于 2021-04-13 02:18:57

不使用序列填充

在序列建模中，填充对于输入序列的可变长度不是必须的。在TensorFlow中，沿某些轴具有可变元素数量的张量称为 ragged ，我们使用tf.ragged.RaggedTensor来处理粗糙数据。例如：

# variable length input sequences 
ragged_list = [
    [0, 1, 2, 3],
    [4, 5],
    [6, 7, 8],
    [9]]

# convert to ragged tensor that handle such variable length inputs 
tf.ragged.constant(ragged_list).shape
shape: [4, None]

因此，我们可以在序列建模中使用参差不齐的输入数据，而不再需要填充序列以获得统一的输入长度。

DataSet

import tensorflow as tf 
import warnings, numpy as np 
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 

# maxlen = 200 # No maximum length but whatever 
batch_size = 256
max_features = 20000  # Only consider the top 20k words

(x_train, y_train), (x_test, y_test) = \
              tf.keras.datasets.imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")

25000 Training sequences
25000 Validation sequences

# quick check 
x_train[:3]

array([list([1, 14, 22, 16, 43, 53, ....]),
       list([....]),
       list([...]),

转换为不完整的

现在，我们将其转换为处理可变大小序列的粗糙张量。

x_train = tf.ragged.constant(x_train)
x_test  = tf.ragged.constant(x_test)

# quick check 

x_train[:3]
<tf.RaggedTensor [[1, 14, 22, 16, 43, 53, ...] [...] [...]]

x_train.shape, x_test.shape
(TensorShape([25000, None]), TensorShape([25000, None]))

模型

# Input for variable-length sequences of integers
inputs = tf.keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = tf.keras.layers.Embedding(max_features, 128)(inputs)
x = tf.keras.layers.SimpleRNN(units=32)(x)
# Add a classifier
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_1 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 32)                5152      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
=================================================================
Total params: 2,565,185
Trainable params: 2,565,185
Non-trainable params: 0
_________________________________________________________________

编译和训练

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"])
model.fit(x_train, y_train, batch_size=batch_size, verbose=2, 
          epochs=10, validation_data=(x_test, y_test))

Epoch 1/10
113s 1s/step - loss: 0.6273 - acc: 0.6295 - val_loss: 0.4188 - val_acc: 0.8206
Epoch 2/10
109s 1s/step - loss: 0.4895 - acc: 0.8041 - val_loss: 0.4703 - val_acc: 0.8040
Epoch 3/10
109s 1s/step - loss: 0.3513 - acc: 0.8661 - val_loss: 0.3996 - val_acc: 0.8337
Epoch 4/10
110s 1s/step - loss: 0.2450 - acc: 0.9105 - val_loss: 0.3945 - val_acc: 0.8420
Epoch 5/10
109s 1s/step - loss: 0.1437 - acc: 0.9559 - val_loss: 0.4085 - val_acc: 0.8422
Epoch 6/10
109s 1s/step - loss: 0.0767 - acc: 0.9807 - val_loss: 0.4310 - val_acc: 0.8429
Epoch 7/10
109s 1s/step - loss: 0.0380 - acc: 0.9932 - val_loss: 0.4784 - val_acc: 0.8437
Epoch 8/10
110s 1s/step - loss: 0.0288 - acc: 0.9946 - val_loss: 0.5039 - val_acc: 0.8564
Epoch 9/10
110s 1s/step - loss: 0.0957 - acc: 0.9615 - val_loss: 0.5687 - val_acc: 0.8575
Epoch 10/10
109s 1s/step - loss: 0.1008 - acc: 0.9637 - val_loss: 0.5166 - val_acc: 0.8550

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66813950

复制

相似问题

问基于递归网络的影评分类
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于递归网络的影评分类EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于递归网络的影评分类
EN