两个集合之间的Jaccard距离是它们相交的大小除以它们合并的大小。例如,{1, 2, 3}和{2, 3, 4}之间的距离是2 ({2,3}) /4 ({1,2,3,4}) = 0.5。
Jaccard距离可以通过将字符串分割成单词或字符n克来表示字符串的相似性。例如,具有两个字符的n-克:
"Pear" vs "Peach"
`{"Pe", "ea", "ar"}` vs `{"Pe", "ea", "ac", "ch"}` = 2/5 = 0.4我用Rust编写了一个实现,作为学习语言的一种尝试。
extern crate unidecode;
use unidecode::unidecode;
use std::collections::HashSet;
fn normalize(s: &str) -> String {
unidecode(s).to_lowercase()
}
fn shingles(s: String) -> HashSet<String> {
let mut xs = HashSet::new();
let it = s.chars();
let mut pk = it.peekable();
loop {
let cur = pk.next();
let nxt = pk.peek();
match (cur, nxt) {
(Some(i), Some(nxt_i)) => {
xs.insert(format!("{}{}", i, nxt_i));
}
_ => { break }
}
}
return xs
}
// Intersection of the sets divided by the size of the union of the
// sets.
fn jaccard_distance(s1: String, s2: String) -> f64 {
let s1_shingles = shingles(s1);
let s2_shingles = shingles(s2);
s1_shingles.len() as f64;
let inter: HashSet<_> = s1_shingles.intersection(&s2_shingles).collect();
let union: HashSet<_> = s1_shingles.union(&s2_shingles).collect();
(inter.len() as f64) / (union.len() as f64)
}
fn comp_and_print(s1: &str, s2: &str) {
let normal_s1 = normalize(s1);
let normal_s2 = normalize(s2);
println!("'{}' vs '{}' ... \t {}", s1, s2,
jaccard_distance(normal_s1, normal_s2));
}
fn main() {
comp_and_print("Apple sauce", "Apple Trees");
comp_and_print("Apple pie", "Grandma's Apple pie");
comp_and_print("Pear", "Peach");
}带状疱疹的方法似乎特别糟糕。我希望它更灵活(例如,对于3或4个字符n克),而且在集合中使用format!而不是join (就像我在Ruby中所做的那样)是很难看的。
下面是用于shingles的Ruby代码,因为我可以在Ruby中更容易地表达我的意思。我的Rust实现只处理n=2,因为我很难弄清楚如何使它成为动态的。
def shingles(str, n = 2)
(0..(str.length - n)).map { |idx| str[idx..idx+(n-1)] }.to_set
end发布于 2015-11-02 10:45:18
shinglesshingles应该被命名为pairwise_with_repetitions,因为它就是这样做的。我将通过将字符串拉链本身减去第一个字符来实现它:
s.zip(s.remove_first()) // Pseudocode as I am not proficient in Rust我建议在实际情况下避免临时变量,打印函数可以内联函数调用。
我认为:
s1_shingles.len() as f64;什么也不做,应该移除。
https://codereview.stackexchange.com/questions/109461
复制相似问题