文章/答案/技术大牛

发布

社区首页 >问答首页 >尝试理解keras的标记器texts_to_sequences

问尝试理解keras的标记器texts_to_sequences
EN

Stack Overflow用户

提问于 2018-09-05 17:01:40

回答 2查看 7.7K关注 0票数 1

我正在使用：

from keras.preprocessing.text import Tokenizer

max_words = 10000

text = 'Decreased glucose-6-phosphate dehydrogenase activity along with oxidative stress affects visual contrast sensitivity in alcoholics.'

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)

print(sequences)

这将导致以下结果：

[[8], [2], [7], [12], [2], [5], [1], [2], [8], [], [14], [9], [16], [7], [6], [1], [2], [], [19], [], [17], [10], [6], [1], [17], [10], [5], [3], [2], [], [8], [2], [10], [15], [8], [12], [6], [14], [2], [11], [5], [1], [2], [], [5], [7], [3], [4], [13], [4], [3], [15], [], [5], [9], [6], [11], [14], [], [20], [4], [3], [10], [], [6], [21], [4], [8], [5], [3], [4], [13], [2], [], [1], [3], [12], [2], [1], [1], [], [5], [18], [18], [2], [7], [3], [1], [], [13], [4], [1], [16], [5], [9], [], [7], [6], [11], [3], [12], [5], [1], [3], [], [1], [2], [11], [1], [4], [3], [4], [13], [4], [3], [15], [], [4], [11], [], [5], [9], [7], [6], [10], [6], [9], [4], [7], [1], []]

这到底是什么意思？为什么有这么多条目？我可以看到有16个单词，因为Keras像这样拆分上面的文本：

{'oxidative', 'contrast', '6', 'affects', 'in', 'dehydrogenase', 'visual', 'stress', 'glucose', 'phosphate', 'along', 'activity', 'with', 'alcoholics', 'decreased', 'sensitivity'}

顺便说一句，这在我的场景中是错误的，因为我想阻止glucose-6-phosphate的拆分，但我认为我可以使用：

tokenizer = Tokenizer(num_words=max_words, filters='!"#$%&()*+,./:;<=>?@[\\]^_`{|}~\t\n')

python-3.x

keras

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-09-05 17:16:25

tokenizer.fit_on_texts需要一个文本列表，而您向它传递的是单个字符串。tokenizer.texts_to_sequences()也是如此。尝试将列表传递给这两种方法：

from keras.preprocessing.text import Tokenizer

max_words = 10000

text = 'Decreased glucose-6-phosphate dehydrogenase ...'

tokenizer = Tokenizer(num_words=max_words, filters='!"#$%&()*+,./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts([text])
sequences = tokenizer.texts_to_sequences([text])

这将为您提供一个对句子中的单词进行编码的整数序列列表，这可能是您的用例：

sequences

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]]

票数 5

Stack Overflow用户

发布于 2018-09-05 17:16:06

这是因为Tokenizer构建了字符字典，而不是单词字典。字典将如下所示：

{'s': 1, 'e': 2, 't': 3, 'i': 4, 'a': 5, 'o': 6, 'c': 7, 'd': 8, 'l': 9, 'h': 10, 'n': 11, 'r': 12, 'v': 13, 'g': 14, 'y': 15, 'u': 16, 'p': 17, 'f': 18, '6': 19, 'w': 20, 'x': 21}

Tokenizer接受列表作为输入，而不是字符串。执行以下操作：

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
max_words = 10000

text = 'Decreased glucose-6-phosphate dehydrogenase activity along with oxidative stress affects visual contrast sensitivity in alcoholics.'
text = text_to_word_sequence(text)
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
print(sequences)

这是你的字典现在的样子：

print(tokenizer.word_index)

{'decreased': 1, 'glucose': 2, '6': 3, 'phosphate': 4, 'dehydrogenase': 5, 'activity': 6, 'along': 7, 'with': 8, 'oxidative': 9, 'stress': 10, 'affects': 11, 'visual': 12, 'contrast': 13, 'sensitivity': 14, 'in': 15, 'alcoholics': 16}

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52181164

复制

相似问题

问尝试理解keras的标记器texts_to_sequences
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问尝试理解keras的标记器texts_to_sequencesEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问尝试理解keras的标记器texts_to_sequences
EN