文章/答案/技术大牛

发布

社区首页 >问答首页 >教师-学生系统:培养具有顶级-k假设列表的学生

问教师-学生系统:培养具有顶级-k假设列表的学生
EN

Stack Overflow用户

提问于 2020-06-08 09:29:01

回答 1查看 102关注 0票数 0

我想配置一个师生系统，其中一个教师seq2seq模型生成一个顶级的假设列表，这些假设用于训练学生的seq2seq模型。

我的计划是对教师假设进行批次处理，这意味着教师输出一个以批处理轴长度为k* B的张量，其中B是输入的批处理轴长度。输出批次张量，现在包含输入批次张量中每个序列的k个假设，按输入批中相关联的输入序列的位置排序。

这个张量被设定为学生的训练目标。然而，学生的批次张量仍然有一个B的批次轴长度，所以我利用tf.repeat重复学生编码器的输出张量中的序列k次，然后将该张量输入到学生的解码器。

为了调试的目的，我做了简化，以重复教师的单一最佳假设，目前，在我准备实现顶部-k列表选择之前。

下面是我的配置文件的摘要：

[...]

# Variables:

student_target = "teacher_hypotheses_stack"

[...]

# Custom repeat function:

def repeat(source, src_name="source", **kwargs):
    import tensorflow as tf

    input = source(0)
    input = tf.Print(input, [src_name, "in", input, tf.shape(input)])

    output = tf.repeat(input, repeats=3, axis=1)
    output = tf.Print(output, [src_name, "out", output, tf.shape(output)])

    return output

def repeat_t(source, **kwargs):
    return repeat(source, "teacher")


def repeat_s(source, **kwargs):
    return repeat(source, "student")


[...]

# Configuration of the teacher + repeating of its output

**teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable
"teacher_stack": {
    "class": "eval", "from": ["teacher_decision"], "eval": repeat_t,
    "trainable": False
    # "register_as_extern_data": "teacher_hypotheses_stack"
},
"teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
    "class": "reinterpret_data",
    "set_axes": {"B": 1, "T": 0},
    "enforce_time_major": True,
    "from": ["teacher_stack"],
    "trainable": False,
    "register_as_extern_data": "teacher_hypotheses_stack"
}

[...]

# Repeating of the student's encoder ouput + configuration of its decoder

"student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]},  # dim: EncValueTotalDim
"student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat},
"student_encoder_stack": {  # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
    "class": "reinterpret_data",
    "set_axes": {"B": 1, "T": 0},
    "enforce_time_major": True,
    "from": ["student_encoder_repeater"]
},

"student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim},  # preprocessed_attended in Blocks
"student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads},
"student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]},  # (B, enc-T, H, D'/H)

"model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": {
    'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0},
    "end": {"class": "compare", "from": ["output"], "value": 0},
    'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0},  # feedback_input
    "model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3},
    "model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3},
    "model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim},
    "model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]},
    "model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads},  # (B, enc-T, H)
    "model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]},  # (B, enc-T, H)
    "model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"],
                                 "eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}},
    "model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"},  # (B, H, V)
    "model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]},  # (B, H*V)
    "model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3},  # transform
    "model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3},  # merge + post_merge bias
    "model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]},
    "model1_output_prob": {
        "class": "softmax", "from": ["model1_readout"], "dropout": 0.3,
        "target": student_target,
        "loss": "ce", "loss_opts": {"label_smoothing": 0.1}
    }
}, "target": student_target},

[...]

运行此配置将将以下错误消息打印到控制台：

[...]

Create Adam optimizer.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>].
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23]
     [[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

[...]

Execute again to debug the op inputs...
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]')
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False)
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
Op inputs:
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False)
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]')
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17])
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23])
Step meta information:
{'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
 'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']}
Feed dict:
  <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80])
  <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42])
  <tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512])
  <tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13])
  <tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text'])
  <tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17])
  <tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True)
EXCEPTION

[...]
File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens
    x = check_dim_equal(x, 0, seq_lens, 0)
[...]

因此，网络是在没有错误的情况下构建的，但是在第一个训练步骤中，它会因为断言错误而崩溃。在我看来，它看起来像是RETURNN或TensorFlow以某种方式验证批处理长度与其原始值的关系。但我不知道在哪里为什么，所以我不知道该怎么做。

我做错了什么？我的想法甚至可以用RETURNN这样实现吗？

编辑(2020年6月10日)：用于澄清:我的最终目标是让老师为每个输入序列生成一个顶级的假设列表，然后用来训练学生。因此，对于每个学生的输入序列，都有k个解/目标序列。为了训练学生，必须预测每个假设的概率，然后计算交叉熵损失来确定更新梯度。但是，如果每个输入序列都有k个目标序列，则学生必须对编码器状态进行k次解码，每次以不同的目标序列为目标。这就是为什么我想重复编码器状态k次，使学生解码器的数据并行，然后使用RETURNN的默认交叉熵损失实现：

input-seq-1 --- teacher-hyp-1-1; 
input-seq-1 --- teacher-hyp-1-2; 
...; 
input-seq-1 --- teacher-hyp-1-k; 
input-seq-2 --- teacher-hyp-2-1; 
...

有没有更合适的方法来实现我的目标？

编辑(2020年6月12日#1):是的，我知道老师的DecisionLayer已经选择了最好的假设，这样的话，我只重复最好的假设k次。我这样做是为了实现我的最终目标。后来，我想从老师的ChoiceLayer上拿出最上面的-k列表，但我觉得这是一个不同的建筑工地。

但是艾伯特，你说RETURNN会以某种方式自动扩展批处理维度上的数据吗？我怎么能想象得到呢？

编辑(2020年6月12日#2):好了，现在我从教师的选择层(或输出层)中选择最上面的k(这次是k=4)假设列表：

"teacher_hypotheses": {
    "class": "copy", "from": ["extra.search:teacherMT_output"],
    "register_as_extern_data": "teacher_hypotheses_stack"
}

但是，使用这些数据作为学生的培训目标会导致错误：

TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
     [[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

这就是说，我假设，由于学生的目标数据，假设列表，有一个批次轴长度k=4倍于学生的输入数据/编码器状态数据。这里不需要扩展/重复学生编码器状态数据以匹配目标数据吗？

编辑(2020年6月12日#3)：我认为初始问题已经解决。总体问题在这里继续，Teacher-Student System: Training Student With k Target Sequences for Each Input Sequence

returnn

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-06-10 10:53:50

它不仅验证批处理长度。它将折叠批处理和时间(它使用了flatten_with_seq_len_mask，请参阅Loss.init代码和该函数)，然后计算扁平张量上的损失。因此，seq长度也需要匹配。这可能是个问题，但我不确定。因为对于rec层本身也有相同的目标，所以在训练中它应该具有相同的seq长度。

您可以通过仔细检查debug_print_layer_output_template的输出来调试它，即检查Data (批处理形状-元)输出，如果轴都是正确的，就像您期望的那样。(debug_print_layer_output_template可以而且应该始终被启用。它不会使它变慢。)您还可以暂时启用debug_print_layer_output_shape，它将真正打印所有张量的形状。这样你就可以验证它的样子了。

您对ReinterpretDataLayer的使用看起来非常错误。您永远不应该显式地按整数设置轴(如"set_axes": {"B": 1, "T": 0})。你为什么要这么做？这可能是它最终被搞砸的原因。

您的repeat函数不是很通用的。你也在使用硬编码的轴整数。你不该那么做。相反，你会写出如下的东西：

input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)

我明白了吗，这就是你想要做的？在批处理轴上重复？在这种情况下，还需要调整该层输出的seq长度信息。不能简单地将该函数作为-is在EvalLayer中使用。您还需要将out_type定义为一个正确返回正确Data模板的函数。例如：

def repeat_out(out):
   out = out.copy()
   out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
   return out

...
"student_encoder_repeater": {
    "class": "eval", "from": ["student_encoder"], "eval": repeat,
    "out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}

现在，您有一个额外的问题，每次调用这个repeat_out时，您都会得到另一个seq长度信息。RETURNN将无法判断这些seq长度是相同的还是不同的(在编译时)。这将导致错误或奇怪的影响。要解决这个问题，您应该重用相同的seq长度。例如：

"teacher_stack_": {
    "class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
    "class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}

顺便说一句，你为什么要做这种重复呢？那背后的想法是什么？你把学生和老师都重复了三遍？因此，仅仅通过第3因子提高你的学习率也会有同样的效果吗？

编辑：这似乎是为了匹配顶部的k列表。在这种情况下，这都是错误的，因为RETURNN应该已经自动进行了这样的重复。你不应该手动做这件事。

编辑：要了解重复(以及一般的波束搜索解析)是如何工作的，首先要查看日志输出(您必须启用debug_print_layer_output_template，但无论如何，您应该一直这样做)。您将看到每个层的输出，尤其是它的Data输出对象。这对于检查形状是否都如您所期望的(在日志中选中batch_shape_meta )已经很有用。但是，这只是编译时的静态形状，所以批处理-dim只是一个标记。你也会看到搜索光束的信息。这将保持跟踪，如果批来自于某些波束搜索(基本上是任何ChoiceLayer )，并有一个光束，和光束的大小。现在，在代码中，检查SearchChoices.translate_to_common_search_beam及其用法。当您遵循代码时，您将看到SelectSearchSourcesLayer，实际上您的情况将以output.copy_extend_with_beam(search_choices.get_beam_info())结束。

编辑：要重复，这是自动完成的。您要做的是，而不是，需要手动调用copy_extend_with_beam。

如果你希望从老师那里得到最上面的-k列表，你也很可能做错了，因为我看到你用"teacher_decision"作为输入。我想这是来自DecisionLayer？在这种情况下，它已经采取了第一-最好的从顶部-k梁。

编辑：现在我知道您忽略了这一点，而是只想取第一个最好的，然后再重复这个。我建议不要这样做，因为你正在使它变得不必要的复杂，你是一种斗争的RETURNN，知道什么批-昏暗应该是，并会感到困惑。(你可以通过我上面写的东西来实现它，但实际上，这只是不必要的复杂。)

顺便说一句，将EvalLayer设置为"trainable": False是没有意义的。那是没有效果的。无论怎样，eval层都没有参数。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62258957

复制

相似问题

问教师-学生系统:培养具有顶级-k假设列表的学生
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问教师-学生系统:培养具有顶级-k假设列表的学生EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问教师-学生系统:培养具有顶级-k假设列表的学生
EN