这是我的小弟弟。
我需要一个函数,它可以在随机文本中找到最常见的字符串模式。
因此,如果输入如下:
my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name按事件排序的输出应该如下所示(大小写不敏感):
Rank Freq Phrase
1 6 jane doe
2 3 my name
3 3 name is
4 2 doe doe
5 2 doe doe my
6 2 doe my
7 2 is jane
8 2 is jane doe
9 2 jane doe doe
10 2 jane doe doe my
11 2 my name is
12 2 name is jane
13 2 name is jane doe
etc...在我的例子中,我只需要有两个和更多单词的短语。知道如何处理这个问题吗?
发布于 2013-09-17 04:09:57
原始版本-由于使用字符串连接操作符+,该版本非常浪费CPU和内存,因为它创建新的char[]对象,并通过每次使用+将数据从一个复制到另一个。
public class CountPhrases {
public static void main(String[] arg){
String input = "my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name";
String[] split = input.split(" ");
Map<String, Integer> counts = new HashMap<String,Integer>();
for(int i=0; i<split.length-1; i++){
String phrase = split[i];
for(int j=i+1; j<split.length; j++){
phrase += " " + split[j];
Integer count = counts.get(phrase);
if(count==null){
counts.put(phrase, 1);
} else {
counts.put(phrase, count+1);
}
}
}
Map.Entry<String,Integer>[] entries = counts.entrySet().toArray(new Map.Entry[0]);
Arrays.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
return o2.getValue().compareTo(o1.getValue());
}
});
int rank=1;
System.out.println("Rank Freq Phrase");
for(Map.Entry<String,Integer> entry:entries){
int count = entry.getValue();
if(count>1){
System.out.printf("%4d %4d %s\n", rank++, count,entry.getKey());
}
}
}
}输出:
Rank Freq Phrase
1 6 jane doe
2 3 name is
3 3 my name
4 2 name is jane doe
5 2 jane doe doe
6 2 doe my
7 2 my name is
8 2 is jane doe
9 2 jane doe doe my
10 2 name is jane
11 2 is jane
12 2 doe doe
13 2 doe doe my
Process finished with exit code 0新版本--使用String.substring可以节省CPU和内存,因为所有由子字符串获得的字符串都共享同一char[]。这应该跑得快得多。
public class CountPhrases {
public static void main(String[] arg){
String input = "my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name";
String[] split = input.split(" ");
Map<String, Integer> counts = new HashMap<String,Integer>(split.length*(split.length-1)/2,1.0f);
int idx0 = 0;
for(int i=0; i<split.length-1; i++){
int splitIpos = input.indexOf(split[i],idx0);
int newPhraseLen = splitIpos-idx0+split[i].length();
String phrase = input.substring(idx0, idx0+newPhraseLen);
for(int j=i+1; j<split.length; j++){
newPhraseLen = phrase.length()+split[j].length()+1;
phrase=input.substring(idx0, idx0+newPhraseLen);
Integer count = counts.get(phrase);
if(count==null){
counts.put(phrase, 1);
} else {
counts.put(phrase, count+1);
}
}
idx0 = splitIpos+split[i].length()+1;
}
Map.Entry<String, Integer>[] entries = counts.entrySet().toArray(new Map.Entry[0]);
Arrays.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
return o2.getValue().compareTo(o1.getValue());
}
});
int rank=1;
System.out.println("Rank Freq Phrase");
for(Map.Entry<String,Integer> entry:entries){
int count = entry.getValue();
if(count>1){
System.out.printf("%4d %4d %s\n", rank++, count,entry.getKey());
}
}
}
}输出
Rank Freq Phrase
1 6 jane doe
2 3 name is
3 3 my name
4 2 name is jane doe
5 2 jane doe doe
6 2 doe my
7 2 my name is
8 2 is jane doe
9 2 jane doe doe my
10 2 name is jane
11 2 is jane
12 2 doe doe
13 2 doe doe my
Process finished with exit code 0发布于 2013-09-17 03:00:03
使用计算单词邻居的马尔可夫算法的思想来创建单词之间的关系。一开始是一个词,其次是两个,等等。
发布于 2013-09-17 04:02:01
String txt = "my name is songxiao name is";
List<Map<String, Integer>> words = new ArrayList<Map<String, Integer>>();
Map map = new HashMap<String, Integer>();
String[] tmp = txt.split(" ");
for (int i = 0; i < tmp.length - 1; i++) {
String key = tmp[i];
for (int j = 1; j < tmp.length - i; j++) {
key += " " + tmp[i + j];
if (map.containsKey(key)) {
map.put(key, Integer.parseInt(map.get(key).toString()) + 1);
} else {
map.put(key, 1);
}
}
}
Iterator<String> it = map.keySet().iterator();
while (it.hasNext()) {
String key = it.next().toString();
System.out.println(key + " " + map.get(key));
}您可以将代码粘贴到main方法,并运行它。
https://stackoverflow.com/questions/18840291
复制相似问题