如何生成n元字符串,如下所示:
String Input="This is my car."我想用下面的输入生成n-gram:
Input Ngram size = 3输出应为:
This
is
my
car
This is
is my
my car
This is my
is my car在Java中给出一些想法,如何实现它,或者是否有可用的库。
我正在尝试使用this NGramTokenizer,但它给出了字符序列的n-gram,我想要n-gram的单词序列。
发布于 2010-09-07 20:53:09
您正在寻找ShingleFilter。
更新:链接指向版本3.0.2。在新版本的Lucene中,这个类可能在不同的包中。
发布于 2010-09-07 16:03:33
我相信这会做你想要的:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}输出:
This
is
my
car.
This is
is my
my car.
This is my
is my car.作为迭代器实现的“按需”解决方案:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}发布于 2010-09-07 16:07:00
此代码返回给定长度的所有字符串的数组:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}例如。
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car] https://stackoverflow.com/questions/3656762
复制相似问题