比如说,在100种销售产品中,前10位最受欢迎的产品,每天约有100,000用户定期购买商品。
A = has been purchased by 100k users.
B = has been purchased by 30k users.
C = has been purchased by 20k users.
D = has been purchased by 18k users.
E = has been purchased by 10k users.
F = has been purchased by 8k users.
G = has been purchased by 7k users.
H = has been purchased by 4k users.
I = has been purchased by 3k users.
J = has been purchased by 1k users.
X = never bought by anyone.
Y = never bought by anyone.
Z = never bought by anyone.基于这个事实,训练数据将会有超过5000万行像这样的数据。
User Id | User Name | Item Id | Item Name | label | Purchase Date |
1 Thomas 1 A true 12, Mar 2019
1 Thomas 1 A true 13, Mar 2019
1 Thomas 1 A true 14, Mar 2019
1 Thomas 1 A true 15, Mar 2019
1 Thomas 2 B true 11, Mar 2019
1 Thomas 3 C true 09, Mar 2019
1 Thomas 4 D true 07, Mar 2019
2 Angelica 1 E true 12, Mar 2019
.
.
.用户的偏好将是这样的,他们可能是无数的,但让我们举一个例子。
Thomas bought A, B, C, D
Angelica bought A, B, C, D
Gloria bought A, B, C, D
Jennifer bought A, B, C, D and I使用基于用户的协作过滤,很明显,Thomas、Angelica、Gloria很可能将项目I作为推荐项,因为Jennifer喜欢I项目,并且具有与其他项目完全相同的购买模式。
考虑到这一点,我开始认为,如果我有另外两个用户购买了不受欢迎的项目X,Y,Z,他们的预测将导致推荐未售出的项目。
因此,在像这样训练模型之前,我手动添加了虚拟数据。
User Id | User Name | Item Id | Item Name | label | Purchase Date |
1 Thomas 1 A true 12, Mar 2019
1 Thomas 1 A true 13, Mar 2019
1 Thomas 1 A true 14, Mar 2019
1 Thomas 1 A true 15, Mar 2019
1 Thomas 2 B true 11, Mar 2019
1 Thomas 3 C true 09, Mar 2019
1 Thomas 4 D true 07, Mar 2019
2 Angelica 1 E true 12, Mar 2019
.
.
.
100001 Andrew 24 X true 19, Mar 2019
100001 Andrew 25 Y true 19, Mar 2019
100002 Andy 24 X true 19, Mar 2019
100002 Andy 25 Y true 19, Mar 2019
100002 Andy 26 Z true 19, Mar 2019正如我前面提到的,我认为安德鲁会得到Z作为推荐项目,因为安德鲁和安迪有一个共同的项目偏好,他也买了Z,即使X,Y和Z的购买数据在培训数据中所占的比例非常小(在1000万条数据记录中只有5条记录)。
但结果是完全出乎意料的。
每个用户的推荐列表中都有X, Y, Z,尽管预测分数比其他用户低,更令人费解的是,安德鲁和安迪在不受欢迎的商品上没有优秀的分数,即使他们真的买了它们!
我不知道为什么会发生这种情况,我是否误解了基于用户的协作过滤概念?
发布于 2019-03-19 01:23:36
X,Y,Z项预测为除AndyandAndrew之外的其他现有用户的虚假候选项。下面是如何设置值。这是基于示例代码中的ML.NET
var pipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "userIdFeaturized", inputColumnName: nameof(MovieRating.userId)) .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "movieIdFeaturized", inputColumnName: nameof(MovieRating.movieId)) .Append(mlContext.Transforms.Concatenate(DefaultColumnNames.Features, "userIdFeaturized", "movieIdFeaturized")) .Append(mlContext.BinaryClassification.Trainers.FieldAwareFactorizationMachine( new string[] {DefaultColumnNames.Features} , (e) => { e.latentDim = 200; }) // set custom value here. ) );
据我所知,在对矩阵进行分解时,SVD从训练数据的矩阵中提取隐藏特征,隐层与每个用户和每一个实体中间的一组项直接相关,称为维数约简。我认为太小的潜在维度概括了推荐项目的多样性。这就是为什么,我认为,增加价值解决了我的问题。
如有任何更正或评论,将不胜感激。我绝对不想传递虚假的信念。
https://datascience.stackexchange.com/questions/47499
复制相似问题