我对文本提取的概念完全陌生。当我搜索一个示例时,我找到了一个使用Lucene实现的示例。我只是试图在eclipse中运行它,但是它出错了。这是我得到的错误:(TokenStream合同违约:/close()调用丢失,reset()多次调用,或者子类不调用super.reset() )。有关正确的消费工作流的更多信息,请参见TokenStream类的Javadocs )。我直接从网络上发布的一篇文章中获得了代码,并且做了很少的修改,因为首先我想确保代码在一个接一个地运行之前没有错误。最初的代码是从URL中获取文本,但我将其更改为从定义的字符串中获取文本(它位于主方法中)。我还更改了版本,因为我使用Lucene4.8版本。
我也搜索了错误并且做了很少的修改,但是我仍然得到了错误。我是这里的密码。你能帮我排除这个错误吗?我应该在哪里修改以避免错误。这是我得到代码http://pastebin.com/jNALz7DZ的链接,这里是我修改的代码。
public class KeywordsGuesser {
/** Lucene version. */
private static Version LUCENE_VERSION = Version.LUCENE_48;
/**
* Keyword holder, composed by a unique stem, its frequency, and a set of found corresponding
* terms for this stem.
*/
public static class Keyword implements Comparable<Keyword> {
/** The unique stem. */
private String stem;
/** The frequency of the stem. */
private Integer frequency;
/** The found corresponding terms for this stem. */
private Set<String> terms;
/**
* Unique constructor.
*
* @param stem The unique stem this instance must hold.
*/
public Keyword(String stem) {
this.stem = stem;
terms = new HashSet<String>();
frequency = 0;
}
/**
* Add a found corresponding term for this stem. If this term has been already found, it
* won't be duplicated but the stem frequency will still be incremented.
*
* @param term The term to add.
*/
private void add(String term) {
terms.add(term);
frequency++;
}
/**
* Gets the unique stem of this instance.
*
* @return The unique stem.
*/
public String getStem() {
return stem;
}
/**
* Gets the frequency of this stem.
*
* @return The frequency.
*/
public Integer getFrequency() {
return frequency;
}
/**
* Gets the list of found corresponding terms for this stem.
*
* @return The list of found corresponding terms.
*/
public Set<String> getTerms() {
return terms;
}
/**
* Used to reverse sort a list of keywords based on their frequency (from the most frequent
* keyword to the least frequent one).
*/
@Override
public int compareTo(Keyword o) {
return o.frequency.compareTo(frequency);
}
/**
* Used to keep unicity between two keywords: only their respective stems are taken into
* account.
*/
@Override
public boolean equals(Object obj) {
return obj instanceof Keyword && obj.hashCode() == hashCode();
}
/**
* Used to keep unicity between two keywords: only their respective stems are taken into
* account.
*/
@Override
public int hashCode() {
return Arrays.hashCode(new Object[] { stem });
}
/**
* User-readable representation of a keyword: "[stem] x[frequency]".
*/
@Override
public String toString() {
return stem + " x" + frequency;
}
}
/**
* Stemmize the given term.
*
* @param term The term to stem.
* @return The stem of the given term.
* @throws IOException If an I/O error occured.
*/
private static String stemmize(String term) throws IOException {
// tokenize term
TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(term));
// stemmize
tokenStream = new PorterStemFilter(tokenStream);
Set<String> stems = new HashSet<String>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
// for each token
while (tokenStream.incrementToken()) {
// add it in the dedicated set (to keep unicity)
stems.add(token.toString());
}
// if no stem or 2+ stems have been found, return null
if (stems.size() != 1) {
return null;
}
String stem = stems.iterator().next();
// if the stem has non-alphanumerical chars, return null
if (!stem.matches("[\\w-]+")) {
return null;
}
return stem;
}
/**
* Tries to find the given example within the given collection. If it hasn't been found, the
* example is automatically added in the collection and is then returned.
*
* @param collection The collection to search into.
* @param example The example to search.
* @return The existing element if it has been found, the given example otherwise.
*/
private static <T> T find(Collection<T> collection, T example) {
for (T element : collection) {
if (element.equals(example)) {
return element;
}
}
collection.add(example);
return example;
}
/**
* Extracts text content from the given URL and guesses keywords within it (needs jsoup parser).
*
* @param The URL to read.
* @return A set of potential keywords. The first keyword is the most frequent one, the last the
* least frequent.
* @throws IOException If an I/O error occurred.
* @see <a href="http://jsoup.org/">http://jsoup.org/</a>
*/
public static List<Keyword> guessFromUrl(String url) throws IOException {
// get textual content from url
//Document doc = Jsoup.connect(url).get();
//String content = doc.body().text();
String content = url;
// guess keywords from this content
return guessFromString(content);
}
/**
* Guesses keywords from given input string.
*
* @param input The input string.
* @return A set of potential keywords. The first keyword is the most frequent one, the last the
* least frequent.
* @throws IOException If an I/O error occured.
*/
public static List<Keyword> guessFromString(String input) throws IOException {
// hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
input = input.replaceAll("-+", "-0");
// replace any punctuation char but dashes and apostrophes and by a space
input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
// replace most common English contractions
input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");
// tokenize input
TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(input));
// to lower case
tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
// remove dots from acronyms (and "'s" but already done manually above)
tokenStream = new ClassicFilter(tokenStream);
// convert any char to ASCII
tokenStream = new ASCIIFoldingFilter(tokenStream);
// remove english stop words
tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());
List<Keyword> keywords = new LinkedList<Keyword>();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
// for each token
while (tokenStream.incrementToken()) {
String term = token.toString();
// stemmize
String stem = stemmize(term);
if (stem != null) {
// create the keyword or get the existing one if any
Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
// add its corresponding initial token
keyword.add(term.replaceAll("-0", "-"));
}
}
tokenStream.end();
tokenStream.close();
// reverse sort by frequency
Collections.sort(keywords);
return keywords;
}
public static void main(String args[]) throws IOException{
String input = "Java is a computer programming language that is concurrent, "
+ "class-based, object-oriented, and specifically designed to have as few "
+ "implementation dependencies as possible. It is intended to let application developers "
+ "write once, run anywhere (WORA), "
+ "meaning that code that runs on one platform does not need to be recompiled "
+ "to run on another. Java applications are typically compiled to byte code (class file) "
+ "that can run on any Java virtual machine (JVM) regardless of computer architecture. "
+ "Java is, as of 2014, one of the most popular programming languages in use, particularly "
+ "for client-server web applications, with a reported 9 million developers."
+ "[10][11] Java was originally developed by James Gosling at Sun Microsystems "
+ "(which has since merged into Oracle Corporation) and released in 1995 as a core "
+ "component of Sun Microsystems' Java platform. The language derives much of its syntax "
+ "from C and C++, but it has fewer low-level facilities than either of them."
+ "The original and reference implementation Java compilers, virtual machines, and "
+ "class libraries were developed by Sun from 1991 and first released in 1995. As of "
+ "May 2007, in compliance with the specifications of the Java Community Process, "
+ "Sun relicensed most of its Java technologies under the GNU General Public License. "
+ "Others have also developed alternative implementations of these Sun technologies, "
+ "such as the GNU Compiler for Java (byte code compiler), GNU Classpath "
+ "(standard libraries), and IcedTea-Web (browser plugin for applets).";
System.out.println(KeywordsGuesser.guessFromString(input));
}
}这是eclipse输出的错误。
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.zzRefill(ClassicTokenizerImpl.java:431)
at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.getNextToken(ClassicTokenizerImpl.java:638)
at org.apache.lucene.analysis.standard.ClassicTokenizer.incrementToken(ClassicTokenizer.java:140)
at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
at org.apache.lucene.analysis.standard.ClassicFilter.incrementToken(ClassicFilter.java:47)
at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:104)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
at beehex.lucene.KeywordsGuesser.guessFromString(KeywordsGuesser.java:239)
at beehex.lucene.KeywordsGuesser.main(KeywordsGuesser.java:288)去掉错误后,我的输出是:
# x10,x5,sun x5,run x4,compil x4,languag x3,implement x3,applic x3,代码x3,gnu x3,comput x2,program x2,specif x2,ha x2,on x2,platform x2,字节x2,class x2,machin,大部分,原产地,微系统,ha,releas,1995,它,从,c#en28,图书管理,technolog #30#,同意r on 31,类#bas,对象#33#,设计#34,少数#en39#36#en39####en39#38号#编写x1、onc x1、anywher x1、wora x1、mean x1、doe x1、recompil x1、anoth x1、典型x1、file x1、can x1、ani x1、jvm x1,而不管x1、architectur x1、2014 x1、流行的x1、us x1、特别是en19#、客户端-serv、web、报告、9、百万、10、11、jame、gosl、其中、sinc en30、merg、oracl #en32、corpor ##33、核心#34#、en39#37号##en39#en39##37号#37号#x1#37号#37号#37号#低级x1,facil x1,x1,或者x1,x1,refer x1,also x1,1991 x1,first x1,mai x1,2007 x1,complianc x1,commun x1,process x1,relicens x1,下面是x1,gener x1,public x1,licens x1,其他,还有,altern,classpath,标准,iced茶-web,浏览器,plugin,applet
发布于 2014-06-25 09:07:12
您需要在调用TokenStream对象上的incrementToken方法之前重置它,正如错误指出的那样:
// add this line
tokenStream.reset();
while (tokenStream.incrementToken()) {
....https://stackoverflow.com/questions/24403993
复制相似问题