我使用Ruby编写了以下代码来计算基于项的协同过滤 (使用贾卡德相似系数)。
我希望对代码和提示中的任何潜在问题提供反馈,以使代码更好地符合最佳实践。
require 'json'
require 'set'
def compute_jaccard_coefficients(users_for_activities)
jaccard_coefficient_hash = {}
all_activities = Set.new
users_for_activities.each do |key1, array1|
users_for_activities.each do |key2, array2|
if key1 != key2
all_activities.add key1
all_activities.add key2
intersected_users = array1 & array2
unioned_users = array1 | array2
if unioned_users.length > 0
jaccard_coefficient_hash[[key1, key2]] = intersected_users.length.fdiv(unioned_users.length)
else
jaccard_coefficient_hash[[key1, key2]] = 0
end
end
end
end
return [jaccard_coefficient_hash, all_activities]
end
data = '[
{"user": "1", "activities": ["running", "swimming", "rowing"]},
{"user": "2", "activities": ["tennis", "rowing"]},
{"user": "3", "activities": ["swimming", "running"]},
{"user": "4", "activities": ["tennis", "swimming"]}
]'
parsed_data = JSON.parse(data)
users_for_activities = Hash.new{|h,k| h[k] = []}
parsed_data.each do |child|
child["activities"].each do |activity|
users_for_activities[activity] << child["user"]
end
end
all_activities = Set.new
jaccard_coefficient_hash, all_activities = compute_jaccard_coefficients(users_for_activities)
parsed_data.each do |child|
activities_to_recommend = {}
child["activities"].each do |user_activity|
all_activities.each do |generic_activity|
if user_activity != generic_activity
if activities_to_recommend.has_key?(generic_activity)
if jaccard_coefficient_hash[[user_activity, generic_activity]] > activities_to_recommend[generic_activity]
unless child["activities"].include?(generic_activity)
activities_to_recommend[generic_activity] = jaccard_coefficient_hash[[user_activity, generic_activity]]
end
end
else
unless child["activities"].include?(generic_activity)
activities_to_recommend[generic_activity] = jaccard_coefficient_hash[[user_activity, generic_activity]]
end
end
end
end
end
print "User " + child["user"] + ": "
print activities_to_recommend.sort_by {|k,v| v}.reverse
print "\n"
end输出:
用户1:[“网球”,0.3333333333333333]用户2:[“奔跑”,0.3333333333333333,“游泳”,0.25]用户3:[“划船”,0.3333333333333333,“网球”,0.25]用户4:[“奔跑”,0.6666666666666666,“划船”,0.3333333333333333]
发布于 2014-06-25 10:18:38
我注意到的东西:
next跳过each-iteration,而不是包装if语句中的所有内容Set。散列的键已经是一个集合,所以在您的compute_jaccard_coefficients方法中,您只需调用all_activties = users_for_activities.keys并完成。从原始数据,您可以执行all_activities = parsed_data.map { |user| user['activities'] }.flatten.uniqcompute_jaccard_coefficients方法返回一个元组。如上面所示,获得所有的活动并不困难,你可以在任何地方做到这一点。而且该方法本身并不用于任何事情;创建集合也只是它所做的事情,尽管它不应该是它的责任。您的代码绝对不应该依赖于一个方法,只是随机地做一些事情。Array#combination。给它一个2的参数,它会给你一系列的对活动。请注意,它也只会给出唯一的组合--也就是说,您将得到一个类似于["tennis", "swimming"]的组合,但是它将跳过["swimming", "tennis"],因为这是相同的组合。any?这样的方法(或者更好的方法--正如Naklion在注释-empty?中指出的那样),而不是用arr.count > 0检查数组的长度考虑到以上所述,该方法可以编写为
def compute_jaccard_coefficients(hash)
coefficients = hash.keys.combination(2).map do |a, b|
union = hash[a] | hash[b]
weight = union.empty? ? 0 : (hash[a] & hash[b]).count / union.count.to_f
[[a, b], weight]
end
Hash[coefficients]
end请注意,此实现还为您提供了一个更干净的结果(没有重复的组合),迭代次数较少。
我现在没时间了,但我稍后会查看其余的代码。
更新:现在是“稍后”,还有中狮给出了一个很好的答案。不过,我也要试一试。
我在上面使用了combination,而Naklion使用了permutation。后者给出了与原始代码相同的结果(即系数散列将包含网球/划艇和划艇/网球的键)。使用combination,您将只得到唯一的组合,从而减少系数计算,但是查找会变得更加复杂(我在查看Jaccard方法时并没有考虑这个问题)。不过,由于Naklion使用了permutations,我将继续使用combinations。
不过,还有一个补充是使用Set (毕竟)来创建Jaccard散列的键:
def compute_jaccard_coefficients(hash)
coefficients = hash.keys.combination(2).map do |a, b|
union = hash[a] | hash[b]
weight = union.empty? ? 0 : (hash[a] & hash[b]).count / union.count.to_f
[Set[a, b], weight]
end
Hash[coefficients]
end这允许我们查找权重,而不必担心密钥元素的顺序,因为Set[a, b] == Set[b, a]。
Naklion提供了一种很好的方法(列表中的第9位)来将parsed_data从用户索引活动转到活动索引用户,所以我将跳过这一点。
所以从这个开始(和上面的方法)
all_activities = users_for_activities.keys
coefficients = compute_jaccard_coefficients(users_for_activities)你可以做到的
parsed_data.each do |user|
possibles = all_activities - user['activities'] # activities the user doesn't have already
recommendations = user['activities'].product(possibles).map do |a, b|
# a is the user's existing activity, b is the recommended activity
key = Set[a, b]
[b, coefficients[key]] if coefficients[key] > 0
end.compact.sort_by(&:last).reverse
puts "User #{user[:user]}: #{recommendations}"
endhttps://codereview.stackexchange.com/questions/55070
复制相似问题