我有一根很大的绳子和一根针。我想从绳子上找出最接近那根针的文字。但是,字符串和指针都是Unicode(孟加拉语)。我有一些解决办法,但只能用英语。我还没有在Unicode (孟加拉语)中找到解决方案。为了更好地理解我的问题,请看下面的罗马尼亚语例子。
资料来源:"Cei bătr ni fac o băutură毒性ăpentru regina joviană“。
针头:"băuturăpentruăă“
输出:"băuturăăpentru“
资料来源:"Cei bătr ni fac o băutură毒性ăpentru regina joviană“。
针头:"bătra pak o băuturărinan“
输出:“bătr ni fac o băutură”
我发现我可以用类似于余弦或manhatton的相似性度量来做这件事。然而,我认为这个算法的实现将是困难的。你能建议我用什么简单或最快的方法来做这件事吗?也许可以用php的库函数来处理Unicode字符?提亚
发布于 2018-10-29 07:34:28
我认为最快的方法是ShpinxSearch引擎:
http://sphinxsearch.com/
它有类似mysql的客户端。你可以这样做:
mysql> SELECT * FROM test WHERE MATCH('băutură pentru toxică');输出是按最佳匹配排序的文档列表。
==============================================================
或者尝试在php上创建word索引表(它是一个非常简单的algoritm,必须针对您的需要进行优化):
function near( $src, $needle) {
$hashIndexes = [];
$words = mb_split(' ', $src);
foreach( $words as $k => $w ) {
$w = mb_strtolower( $w, 'utf-8');
$hashIndexes [sha1( $w )] = [ 'key' => $k, 'word' => $w ];
}
$nWords = mb_split(' ', mb_strtolower( $needle, 'utf-8'));
$matches = [];
foreach( $nWords as $k => $w ) {
$hash = sha1( $w );
if( isset( $hashIndexes [ $hash ]) && $w === $hashIndexes [ $hash ] ['word']) {
$matches [] = $hashIndexes [ $hash ] ['key'];
}
}
if( ! empty( $matches )) {
sort( $matches );
$start = $matches [0];
$last = end( $matches );
$result = array_slice( $words, $start, $last - $start + 1);
return implode( ' ', $result );
} else {
return '';
}
}
$src = "Cei bătrâni fac o băutură some other toxică pentru regina joviană";
$needle ="băutură pentru another toxică";
echo near( $src, $needle) . "\n";==============================================================
优化是一项伟大的工作(google )。
.、,、...、?等符号从$words和$nWords数组中删除。$hashIndexes [sha1( $w )]必须是数组(因为其他词可能与sha1相同)$hashIndexes [sha1( $w )] ['key']还必须是文本中等号词的数组。我真的建议你安装SphinxSearch或类似的文本搜索引擎。
https://stackoverflow.com/questions/53040602
复制相似问题