我想训练一个连体网络来比较向量的相似性。
我的数据集由一对向量和一个目标列组成,如果它们是相同的,则为"1“,否则为"0”(二进制分类):
import pandas as pd
# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())
y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())
# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val
# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]
assert X_left_train.shape == X_right_train.shape
# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")
print(y_test.value_counts())
X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]返回
v0 v1 v2 ... v397 v398 v399 class
0 0.003615 0.013794 0.030388 ... -0.093931 0.106202 0.034870 0.0
1 0.018988 0.056302 0.002915 ... -0.007905 0.100859 -0.043529 0.0
2 0.072516 0.125697 0.111230 ... -0.010007 0.064125 -0.085632 0.0
3 0.051016 0.066028 0.082519 ... 0.012677 0.043831 -0.073935 1.0
4 0.020367 0.026446 0.015681 ... 0.062367 -0.022781 -0.032091 0.0
1.0 1060
0.0 923
Name: class, dtype: int64
1.0 354
0.0 308
Name: class, dtype: int64我的脚本的其余部分如下:
import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model
def euclidean_distance(vectors):
"""
Find the Euclidean distance between two vectors.
"""
x, y = vectors
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
# Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
return K.sqrt(K.maximum(sum_square, K.epsilon()))
def contrastive_loss(y_true, y_pred):
"""
Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.
See:
* https://gombru.github.io/2019/04/03/ranking_loss/
"""
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
def accuracy(y_true, y_pred):
"""
Compute classification accuracy with a fixed threshold on distances.
"""
return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
input1 = Input(input_dim, name="encoder")
x = input1
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu", name="Embeddings")(x)
return Model(input1, x)
def build_siamese_model(input_dim: int):
shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)
left_input = Input(input_dim)
right_input = Input(input_dim)
# Since this is a siamese nn, both sides share the same network.
encoded_l = shared_network(left_input)
encoded_r = shared_network(right_input)
# The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
return siamese_net
model = build_siamese_model(X_left_train.shape[1])
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
[X_left_train, X_right_train],
y_train,
validation_data=([X_left_val, X_right_val], y_val),
epochs=100,
callbacks=[es_callback],
verbose=1,
)我绘制了对比损失与时代和模型精度与时代:

验证线几乎是平坦的,这在我看来是很奇怪的(装好了吗?)
在将共享网络的退出从0.1更改为0.5后,我得到以下结果:

不知何故,它看起来更好,但也产生了糟糕的预测。
我的问题是:
Dense层的原因。这是正确的做法吗?binary_crossentropy的效果很差。编辑:
在遵循@PlzBePython建议之后,我提出了以下基本网络:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])

谢谢你的帮助!
发布于 2022-05-14 00:13:24
这不是一个答案,而是更多地写下我的想法,希望他们能帮助我找到答案。
总的来说,你做的每件事对我来说都是相当合理的。关于你的问题:
1:
嵌入或特征提取层从来都不是必须的,但几乎总是让人们更容易地学习到预期的内容。你可以这样想,就像为你的距离模型提供了一个句子的全面摘要,而不是它的生词。这也使得您的模型不依赖于一个单词的位置。在您的例子中,创建一个句子的摘要/重要特性并将相似的句子嵌入到彼此之间是由同一个网络完成的。当然,这也是可行的,我甚至不认为这是一种糟糕的方法。然而,我可能会增加网络的规模。
2:
在我看来,这两个损失函数并没有太大的不同。二进制交叉熵被定义为:

而对比损失(差额= 1)是:

所以你基本上把日志函数换成铰链函数。唯一真正的区别来自距离的计算。您可能会被建议使用某种L1距离,因为L2距离应该在更高的维数(例如,见这里)中表现得更差,而您的维数是128。就我个人而言,我宁愿在你的情况下使用L1,但我不认为这会破坏交易。
我想尝试的是:
最后,对我来说,90%的验证精度看起来相当不错。请记住,当在第一个时期计算验证精度时,模型已经完成了大约60个权重更新(batch_size = 32)。这意味着,特别是在第一集,一个比训练精度(训练期间计算)更高的验证精度是值得期待的。此外,这有时可能导致错误的信念,即培训损失的增长速度快于验证损失。
编辑
我建议在最后一层中使用“线性”,因为tensorflow推荐的 ("from_logits"=True,在-inf,inf中需要值)用于二进制交叉熵。根据我的经验,它能更好地融合。
https://stackoverflow.com/questions/72201150
复制相似问题