我想配置一个师生系统,其中一个教师seq2seq模型生成一个顶级的假设列表,这些假设用于训练学生的seq2seq模型。
我的计划是对教师假设进行批次处理,这意味着教师输出一个以批处理轴长度为k* B的张量,其中B是输入的批处理轴长度。输出批次张量,现在包含输入批次张量中每个序列的k个假设,按输入批中相关联的输入序列的位置排序。
这个张量被设定为学生的训练目标。然而,学生的批次张量仍然有一个B的批次轴长度,所以我利用tf.repeat重复学生编码器的输出张量中的序列k次,然后将该张量输入到学生的解码器。
为了调试的目的,我做了简化,以重复教师的单一最佳假设,目前,在我准备实现顶部-k列表选择之前。
下面是我的配置文件的摘要:
[...]
# Variables:
student_target = "teacher_hypotheses_stack"
[...]
# Custom repeat function:
def repeat(source, src_name="source", **kwargs):
import tensorflow as tf
input = source(0)
input = tf.Print(input, [src_name, "in", input, tf.shape(input)])
output = tf.repeat(input, repeats=3, axis=1)
output = tf.Print(output, [src_name, "out", output, tf.shape(output)])
return output
def repeat_t(source, **kwargs):
return repeat(source, "teacher")
def repeat_s(source, **kwargs):
return repeat(source, "student")
[...]
# Configuration of the teacher + repeating of its output
**teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable
"teacher_stack": {
"class": "eval", "from": ["teacher_decision"], "eval": repeat_t,
"trainable": False
# "register_as_extern_data": "teacher_hypotheses_stack"
},
"teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
"class": "reinterpret_data",
"set_axes": {"B": 1, "T": 0},
"enforce_time_major": True,
"from": ["teacher_stack"],
"trainable": False,
"register_as_extern_data": "teacher_hypotheses_stack"
}
[...]
# Repeating of the student's encoder ouput + configuration of its decoder
"student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]}, # dim: EncValueTotalDim
"student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat},
"student_encoder_stack": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
"class": "reinterpret_data",
"set_axes": {"B": 1, "T": 0},
"enforce_time_major": True,
"from": ["student_encoder_repeater"]
},
"student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim}, # preprocessed_attended in Blocks
"student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads},
"student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]}, # (B, enc-T, H, D'/H)
"model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": {
'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0},
"end": {"class": "compare", "from": ["output"], "value": 0},
'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0}, # feedback_input
"model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3},
"model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3},
"model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim},
"model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]},
"model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads}, # (B, enc-T, H)
"model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]}, # (B, enc-T, H)
"model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"],
"eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}},
"model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"}, # (B, H, V)
"model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]}, # (B, H*V)
"model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3}, # transform
"model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3}, # merge + post_merge bias
"model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]},
"model1_output_prob": {
"class": "softmax", "from": ["model1_readout"], "dropout": 0.3,
"target": student_target,
"loss": "ce", "loss_opts": {"label_smoothing": 0.1}
}
}, "target": student_target},
[...]运行此配置将将以下错误消息打印到控制台:
[...]
Create Adam optimizer.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>].
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23]
[[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[...]
Execute again to debug the op inputs...
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]')
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False)
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
Op inputs:
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False)
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]')
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17])
<tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23])
Step meta information:
{'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']}
Feed dict:
<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80])
<tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42])
<tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512])
<tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13])
<tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text'])
<tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17])
<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True)
EXCEPTION
[...]
File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens
x = check_dim_equal(x, 0, seq_lens, 0)
[...]因此,网络是在没有错误的情况下构建的,但是在第一个训练步骤中,它会因为断言错误而崩溃。在我看来,它看起来像是RETURNN或TensorFlow以某种方式验证批处理长度与其原始值的关系。但我不知道在哪里为什么,所以我不知道该怎么做。
我做错了什么?我的想法甚至可以用RETURNN这样实现吗?
编辑(2020年6月10日):用于澄清:我的最终目标是让老师为每个输入序列生成一个顶级的假设列表,然后用来训练学生。因此,对于每个学生的输入序列,都有k个解/目标序列。为了训练学生,必须预测每个假设的概率,然后计算交叉熵损失来确定更新梯度。但是,如果每个输入序列都有k个目标序列,则学生必须对编码器状态进行k次解码,每次以不同的目标序列为目标。这就是为什么我想重复编码器状态k次,使学生解码器的数据并行,然后使用RETURNN的默认交叉熵损失实现:
input-seq-1 --- teacher-hyp-1-1;
input-seq-1 --- teacher-hyp-1-2;
...;
input-seq-1 --- teacher-hyp-1-k;
input-seq-2 --- teacher-hyp-2-1;
... 有没有更合适的方法来实现我的目标?
编辑(2020年6月12日#1):是的,我知道老师的DecisionLayer已经选择了最好的假设,这样的话,我只重复最好的假设k次。我这样做是为了实现我的最终目标。后来,我想从老师的ChoiceLayer上拿出最上面的-k列表,但我觉得这是一个不同的建筑工地。
但是艾伯特,你说RETURNN会以某种方式自动扩展批处理维度上的数据吗?我怎么能想象得到呢?
编辑(2020年6月12日#2):好了,现在我从教师的选择层(或输出层)中选择最上面的k(这次是k=4)假设列表:
"teacher_hypotheses": {
"class": "copy", "from": ["extra.search:teacherMT_output"],
"register_as_extern_data": "teacher_hypotheses_stack"
}但是,使用这些数据作为学生的培训目标会导致错误:
TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
[[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]这就是说,我假设,由于学生的目标数据,假设列表,有一个批次轴长度k=4倍于学生的输入数据/编码器状态数据。这里不需要扩展/重复学生编码器状态数据以匹配目标数据吗?
编辑(2020年6月12日#3):我认为初始问题已经解决。总体问题在这里继续,Teacher-Student System: Training Student With k Target Sequences for Each Input Sequence
发布于 2020-06-10 10:53:50
它不仅验证批处理长度。它将折叠批处理和时间(它使用了flatten_with_seq_len_mask,请参阅Loss.init代码和该函数),然后计算扁平张量上的损失。因此,seq长度也需要匹配。这可能是个问题,但我不确定。因为对于rec层本身也有相同的目标,所以在训练中它应该具有相同的seq长度。
您可以通过仔细检查debug_print_layer_output_template的输出来调试它,即检查Data (批处理形状-元)输出,如果轴都是正确的,就像您期望的那样。(debug_print_layer_output_template可以而且应该始终被启用。它不会使它变慢。)您还可以暂时启用debug_print_layer_output_shape,它将真正打印所有张量的形状。这样你就可以验证它的样子了。
您对ReinterpretDataLayer的使用看起来非常错误。您永远不应该显式地按整数设置轴(如"set_axes": {"B": 1, "T": 0})。你为什么要这么做?这可能是它最终被搞砸的原因。
您的repeat函数不是很通用的。你也在使用硬编码的轴整数。你不该那么做。相反,你会写出如下的东西:
input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)我明白了吗,这就是你想要做的?在批处理轴上重复?在这种情况下,还需要调整该层输出的seq长度信息。不能简单地将该函数作为-is在EvalLayer中使用。您还需要将out_type定义为一个正确返回正确Data模板的函数。例如:
def repeat_out(out):
out = out.copy()
out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
return out
...
"student_encoder_repeater": {
"class": "eval", "from": ["student_encoder"], "eval": repeat,
"out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}现在,您有一个额外的问题,每次调用这个repeat_out时,您都会得到另一个seq长度信息。RETURNN将无法判断这些seq长度是相同的还是不同的(在编译时)。这将导致错误或奇怪的影响。要解决这个问题,您应该重用相同的seq长度。例如:
"teacher_stack_": {
"class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
"class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}顺便说一句,你为什么要做这种重复呢?那背后的想法是什么?你把学生和老师都重复了三遍?因此,仅仅通过第3因子提高你的学习率也会有同样的效果吗?
编辑:这似乎是为了匹配顶部的k列表。在这种情况下,这都是错误的,因为RETURNN应该已经自动进行了这样的重复。你不应该手动做这件事。
编辑:要了解重复(以及一般的波束搜索解析)是如何工作的,首先要查看日志输出(您必须启用debug_print_layer_output_template,但无论如何,您应该一直这样做)。您将看到每个层的输出,尤其是它的Data输出对象。这对于检查形状是否都如您所期望的(在日志中选中batch_shape_meta )已经很有用。但是,这只是编译时的静态形状,所以批处理-dim只是一个标记。你也会看到搜索光束的信息。这将保持跟踪,如果批来自于某些波束搜索(基本上是任何ChoiceLayer ),并有一个光束,和光束的大小。现在,在代码中,检查SearchChoices.translate_to_common_search_beam及其用法。当您遵循代码时,您将看到SelectSearchSourcesLayer,实际上您的情况将以output.copy_extend_with_beam(search_choices.get_beam_info())结束。
编辑:要重复,这是自动完成的。您要做的是,而不是,需要手动调用copy_extend_with_beam。
如果你希望从老师那里得到最上面的-k列表,你也很可能做错了,因为我看到你用"teacher_decision"作为输入。我想这是来自DecisionLayer?在这种情况下,它已经采取了第一-最好的从顶部-k梁。
编辑:现在我知道您忽略了这一点,而是只想取第一个最好的,然后再重复这个。我建议不要这样做,因为你正在使它变得不必要的复杂,你是一种斗争的RETURNN,知道什么批-昏暗应该是,并会感到困惑。(你可以通过我上面写的东西来实现它,但实际上,这只是不必要的复杂。)
顺便说一句,将EvalLayer设置为"trainable": False是没有意义的。那是没有效果的。无论怎样,eval层都没有参数。
https://stackoverflow.com/questions/62258957
复制相似问题