我正在努力理解如何为ELMo矢量化编写段落。
文档只显示如何在同一时间嵌入多个句子/单词。
例如:
sentences = [["the", "cat", "is", "on", "the", "mat"],
["dogs", "are", "in", "the", "fog", ""]]
elmo(
inputs={
"tokens": sentences,
"sequence_len": [6, 5]
},
signature="tokens",
as_dict=True
)["elmo"]据我所知,这将返回两个向量,每个向量代表一个给定的句子。我将如何准备输入数据,以向量化包含多个句子的整个段落。请注意,我希望使用自己的预处理。
可以这样做吗?
sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>",
"<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]或者像这样?
sentences = [["the", "cat", "is", "on", "the", "mat", ".",
"dogs", "are", "in", "the", "fog", "."]]发布于 2018-12-01 19:42:31
ELMo生成上下文词向量。因此,与单词相对应的词向量是单词和上下文的函数,例如,它出现在句子中。
就像文档中的例子一样,您希望您的段落是一个句子列表,这些句子是标记的列表。你的第二个例子。要获得这种格式,可以使用spacy 令牌器
import spacy
# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')
text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]我不认为在第二句话中需要额外的填充"",因为sequence_len会处理这个问题。
更新
据我所知,这将返回两个向量,每个向量代表一个给定的句子。
不,这将返回每个单词的向量,在每个句子中。如果您希望整个段落成为上下文(每个单词),只需将其更改为
sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]和
...
"sequence_len": [11]https://stackoverflow.com/questions/53570918
复制相似问题