我有一个价值,如颜色,和字符串列表:{颜色,颜色,主题,品牌,主题……等等
我想得到最相似的字符串,除了搜索到的字符串本身。在本例中,希望得到颜色。(不是颜色)
我使用以下规则对列表进行排序,并对规则进行排序:
查看1000名相关候选人的名单需要很长时间。此外,我有很多候选人要检查。
还有其他有效的方法吗?
原始守则:
public static List findSimilarity(String word, List candidates) {
List recommendations = new ArrayList();
if (!word.equals("")) {
for (String candidate : candidates) {
if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces
if (StringUtils.deleteWhitespace(word).equalsIgnoreCase(StringUtils.deleteWhitespace(candidate))) {
recommendations.add(candidate);
}
//2. same tokens diff order
else if (candidate.split(" ").length == word.split(" ").length) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
boolean status = true;
SortIgnoreCase icc = new SortIgnoreCase();
Arrays.sort(candidatearr, icc);
Arrays.sort(wordarr, icc);
for (int i = 0; i < candidatearr.length; i++) {
if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsIgnoreCase(candidatearr[i])))
status = false;
}
if (status) {
recommendations.add(candidate);
}
}
}
}
//3. distance between words
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
//check for acronym
if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {
String acronym = "";
if (wordarr.length > candidatearr.length) {
for (String tmp : wordarr) {
if (!tmp.equals("")) {
acronym = acronym + tmp.substring(0, 1);
}
}
if (acronym.equalsIgnoreCase(candidatearr[0])) {
recommendations.add(candidate);
}
} else {
for (String tmp : candidatearr) {
if (!tmp.equals("")) {
acronym = acronym + tmp.substring(0, 1);
}
}
if (acronym.equalsIgnoreCase(wordarr[0])) {
recommendations.add(candidate);
}
}
}
}
}
}
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
int dist = 0;
String check = "";
if (word.length() > candidate.length()) {
check = candidate;
} else {
check = word;
}
if (check.length() <= 3) {
dist = 0;
} else if (check.length() > 3 && check.length() <= 5) {
dist = 1;
} else if (check.length() > 5) {
dist = 2;
}
if (StringUtils.getLevenshteinDistance(word, candidate) <= dist) {
//if(Levenshtein.distance(word,candidate) <= dist){
recommendations.add(candidate);
}
}
}
}
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
for (String cand : candidatearr) {
for (String wor : wordarr) {
if (cand.equals(wor) && cand.length() > 4) {
recommendations.add(candidate);
}
}
}
}
}//for
if (recommendations.size() > 4) {
recommendations.clear();
}
}
//4. low priority - starts with
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
if (candidate.startsWith(word) || word.startsWith(candidate)) {
recommendations.add(candidate);
}
}
}
if (recommendations.size() > 4) {
recommendations.clear();
}
}
//5. low priority - contain word
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
if (candidate.contains(word) || word.contains(candidate)) {
recommendations.add(candidate);
}
}
}
if (recommendations.size() > 4) {
recommendations.clear();
}
}
}
return recommendations;
}谢谢你,M。
发布于 2016-01-21 08:54:17
编辑的
为了让您更好地理解,我将波希米亚给出的答案包装在您的原始代码的上下文中。
行.map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" ")))拆分多个单词的术语、排序和再次连接,以消除相同单词的排列。这是对像“主颜色”和“颜色主”这样的术语的置换平等挑战的回答。
但是,在这个问题的上下文中捕获任务的所有业务需求是没有意义的。通过这个答案,你得到了解决方案的大纲。讨论了效率问题。您可能需要更多的阶段,但这是一个不同的故事。该方法的优点是所有阶段都是独立的,因此您可以独立地提出问题并寻求帮助。
public static String findSimilarity(String word, List<String> candidatesList) {
// Populating the set with distinct values of the input terms
Set<String> candidates = candidatesList.stream()
.map(String::toLowerCase)
.map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) // eliminates permutations
.collect(Collectors.toSet());
Map<String, Integer> cache = new ConcurrentHashMap<>();
return candidates.parallelStream()
.map(String::trim)
// add more mappers if needed
.filter(s -> !s.equalsIgnoreCase(word))
// add more filters if needed
.min((a, b) -> Integer.compare(
cache.computeIfAbsent(a, k -> getLevenshteinDistance(word, k)),
cache.computeIfAbsent(b, k -> getLevenshteinDistance(word, k))))
.get(); // get the closest match
}https://stackoverflow.com/questions/34917913
复制相似问题