首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Keras IMDB情感分析

Keras IMDB情感分析
EN

Stack Overflow用户
提问于 2018-10-12 00:47:48
回答 1查看 600关注 0票数 1

我是ML的新手,我正在尝试使用Keras对IMDB数据集进行情感分析,这是基于我找到的一个教程。下面的代码运行,在测试数据上的准确率约为90%。然而,当我尝试预测两个简单的句子(一个是正面的,一个是负面的)时,它给出的正面和负面的值分别约为0.50和0.73,其中正面应为0.71,负面应小于0.1,这是本教程中显示的结果。

你知道问题出在哪里吗?

代码语言:javascript
复制
from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.models import *
from keras.layers import *
import numpy as np

top_words = 5000  # 5000
# first tuple is data and sentiment lists,
# the second is testing data with sentiment
# https://keras.io/datasets/
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=top_words)

# reverse lookup
word_to_id = imdb.get_word_index()
'''word_to_id = {k: (v + INDEX_FROM) for k, v in word_to_id.items()}'''
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

# Truncate and pad the review sequences, to take only the first 500 words
max_review_length = 500
x_train = sequence.pad_sequences(x_train, maxlen=max_review_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_review_length)

# Build the model

# embedding translates the words in a n dimensional
# space so "hi" becomes (0.2,0.1,0.5) in a 3 dimensional space
# it is the first layer of the network
embedding_vector_length = 32  # dimensions

# https://keras.io/getting-started/sequential-model-guide/
model = Sequential()  # sequential is a linear stack of layers

# layer of 500 x 32
model.add(
    Embedding(
        top_words,  # how many words to consider based on count
        embedding_vector_length,  # dimensions
        input_length=max_review_length))  # input vector
model.add(LSTM(100))  # the parameter are the memory units of the LSTM
# If you want you can replace LSTM by a flatten layer
# model.add(LSTM(100))
# model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))  # output 0<y<1 for every x
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])
print(model.summary())


# Train the model
model.fit(
    x_train,
    y_train,
    validation_data=(x_test, y_test),
    epochs=1)  # original epochs = 3, batch-size = 64

# Evaluate the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1] * 100))

# predict sentiment from reviews
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"
for review in [good, bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length)
    print("%s. Sentiment: %s" % (
        review, model.predict(np.array([tmp_padded[0]]))))
# i really liked the movie and had fun. Sentiment: 0.715537
# this movie was terrible and bad. Sentiment: 0.0353295
EN

回答 1

Stack Overflow用户

发布于 2018-10-12 13:18:26

“你知道问题出在哪里吗?”从本质上讲,这可能没有问题。我有一些想法,按可能影响的顺序排列:

  1. 如果你的两句话不能代表IMDB评论,那么可以预期模型预测能力很差,erratically.
  2. Your模型只有一个时期,模型可能没有足够的机会学习从评论到情绪的稳健映射(假设这样的映射是可能的)。
  3. 神经网络有一个随机元素,因此,你开发的模型可能不会与本教程中的模型预测相同。
  4. 的“准确率约为90%",人们预计(取决于类分布)大约十分之一的预测是不正确的。少量实例(在您的案例中为两个)通常不是评估模型性能的好方法。

当我运行你的代码时,我得到了大约80%的训练准确率和大约85%的测试准确率,“我真的很喜欢这部电影,玩得很开心。情绪:[0.75149596]”和“这部电影很糟糕,情绪:[0.93544275]”。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52765201

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档