首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用lucene提取关键字时出错

使用lucene提取关键字时出错
EN

Stack Overflow用户
提问于 2014-06-25 08:42:50
回答 1查看 1.4K关注 0票数 1

我对文本提取的概念完全陌生。当我搜索一个示例时,我找到了一个使用Lucene实现的示例。我只是试图在eclipse中运行它,但是它出错了。这是我得到的错误:(TokenStream合同违约:/close()调用丢失,reset()多次调用,或者子类不调用super.reset() )。有关正确的消费工作流的更多信息,请参见TokenStream类的Javadocs )。我直接从网络上发布的一篇文章中获得了代码,并且做了很少的修改,因为首先我想确保代码在一个接一个地运行之前没有错误。最初的代码是从URL中获取文本,但我将其更改为从定义的字符串中获取文本(它位于主方法中)。我还更改了版本,因为我使用Lucene4.8版本。

我也搜索了错误并且做了很少的修改,但是我仍然得到了错误。我是这里的密码。你能帮我排除这个错误吗?我应该在哪里修改以避免错误。这是我得到代码http://pastebin.com/jNALz7DZ的链接,这里是我修改的代码。

代码语言:javascript
复制
public class KeywordsGuesser {

     /** Lucene version. */
     private static Version LUCENE_VERSION = Version.LUCENE_48;

     /**
      * Keyword holder, composed by a unique stem, its frequency, and a set of found corresponding
      * terms for this stem.
      */
    public static class Keyword implements Comparable<Keyword> {

         /** The unique stem. */
         private String stem;

         /** The frequency of the stem. */
         private Integer frequency;

         /** The found corresponding terms for this stem. */
        private Set<String> terms;

         /**
          * Unique constructor.
          * 
          * @param stem The unique stem this instance must hold.
          */
         public Keyword(String stem) {
             this.stem = stem;
            terms = new HashSet<String>();
             frequency = 0;
         }

         /**
          * Add a found corresponding term for this stem. If this term has been already found, it
          * won't be duplicated but the stem frequency will still be incremented.
          * 
          * @param term The term to add.
          */
         private void add(String term) {
             terms.add(term);
             frequency++;
         }

         /**
          * Gets the unique stem of this instance.
          * 
          * @return The unique stem.
          */
         public String getStem() {
             return stem;
         }

         /**
          * Gets the frequency of this stem.
          * 
          * @return The frequency.
          */
         public Integer getFrequency() {
             return frequency;
         }

         /**
          * Gets the list of found corresponding terms for this stem.
          * 
          * @return The list of found corresponding terms.
          */
        public Set<String> getTerms() {
             return terms;
         }

         /**
          * Used to reverse sort a list of keywords based on their frequency (from the most frequent
          * keyword to the least frequent one).
          */
         @Override
         public int compareTo(Keyword o) {
             return o.frequency.compareTo(frequency);
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public boolean equals(Object obj) {
             return obj instanceof Keyword && obj.hashCode() == hashCode();
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public int hashCode() {
             return Arrays.hashCode(new Object[] { stem });
         }

         /**
          * User-readable representation of a keyword: "[stem] x[frequency]".
          */
         @Override
         public String toString() {
             return stem + " x" + frequency;
         }

     }

     /**
      * Stemmize the given term.
      * 
      * @param term The term to stem.
      * @return The stem of the given term.
      * @throws IOException If an I/O error occured.
      */
     private static String stemmize(String term) throws IOException {

         // tokenize term
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(term));
         // stemmize
         tokenStream = new PorterStemFilter(tokenStream);

         Set<String> stems = new HashSet<String>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
         // for each token
         while (tokenStream.incrementToken()) {
             // add it in the dedicated set (to keep unicity)
             stems.add(token.toString());
         }

         // if no stem or 2+ stems have been found, return null
         if (stems.size() != 1) {
             return null;
         }

         String stem = stems.iterator().next();

         // if the stem has non-alphanumerical chars, return null
         if (!stem.matches("[\\w-]+")) {
             return null;
         }

         return stem;
     }

     /**
      * Tries to find the given example within the given collection. If it hasn't been found, the
      * example is automatically added in the collection and is then returned.
      * 
      * @param collection The collection to search into.
      * @param example The example to search.
      * @return The existing element if it has been found, the given example otherwise.
      */
     private static <T> T find(Collection<T> collection, T example) {
         for (T element : collection) {
             if (element.equals(example)) {
                 return element;
             }
         }
         collection.add(example);
         return example;
     }

     /**
      * Extracts text content from the given URL and guesses keywords within it (needs jsoup parser).
      * 
      * @param The URL to read.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occurred.
      * @see <a href="http://jsoup.org/">http://jsoup.org/</a>
      */
     public static List<Keyword> guessFromUrl(String url) throws IOException {
         // get textual content from url
         //Document doc = Jsoup.connect(url).get();
         //String content = doc.body().text();

       String content = url;
         // guess keywords from this content
         return guessFromString(content);
     }

     /**
      * Guesses keywords from given input string.
      * 
      * @param input The input string.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occured.
      */
     public static List<Keyword> guessFromString(String input) throws IOException {

         // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
         input = input.replaceAll("-+", "-0");
         // replace any punctuation char but dashes and apostrophes and by a space
         input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
         // replace most common English contractions
         input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

         // tokenize input
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(input));
         // to lower case
         tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
         // remove dots from acronyms (and "'s" but already done manually above)
         tokenStream = new ClassicFilter(tokenStream);
         // convert any char to ASCII
         tokenStream = new ASCIIFoldingFilter(tokenStream);
         // remove english stop words
         tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());

         List<Keyword> keywords = new LinkedList<Keyword>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

         // for each token
         while (tokenStream.incrementToken()) {
             String term = token.toString();
             // stemmize
             String stem = stemmize(term);
             if (stem != null) {
                 // create the keyword or get the existing one if any
                 Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
                 // add its corresponding initial token
                 keyword.add(term.replaceAll("-0", "-"));
             }
         }



         tokenStream.end();
         tokenStream.close();


         // reverse sort by frequency
         Collections.sort(keywords);

         return keywords;
     }



     public static void main(String args[]) throws IOException{

       String input = "Java is a computer programming language that is concurrent, "
               + "class-based, object-oriented, and specifically designed to have as few "
               + "implementation dependencies as possible. It is intended to let application developers "
               + "write once, run anywhere (WORA), "
               + "meaning that code that runs on one platform does not need to be recompiled "
               + "to run on another. Java applications are typically compiled to byte code (class file) "
               + "that can run on any Java virtual machine (JVM) regardless of computer architecture. "
               + "Java is, as of 2014, one of the most popular programming languages in use, particularly "
               + "for client-server web applications, with a reported 9 million developers."
               + "[10][11] Java was originally developed by James Gosling at Sun Microsystems "
               + "(which has since merged into Oracle Corporation) and released in 1995 as a core "
               + "component of Sun Microsystems' Java platform. The language derives much of its syntax "
               + "from C and C++, but it has fewer low-level facilities than either of them."
               + "The original and reference implementation Java compilers, virtual machines, and "
               + "class libraries were developed by Sun from 1991 and first released in 1995. As of "
               + "May 2007, in compliance with the specifications of the Java Community Process, "
               + "Sun relicensed most of its Java technologies under the GNU General Public License. "
               + "Others have also developed alternative implementations of these Sun technologies, "
               + "such as the GNU Compiler for Java (byte code compiler), GNU Classpath "
               + "(standard libraries), and IcedTea-Web (browser plugin for applets).";

       System.out.println(KeywordsGuesser.guessFromString(input));
     }



 }

这是eclipse输出的错误。

代码语言:javascript
复制
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
    at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.zzRefill(ClassicTokenizerImpl.java:431)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.getNextToken(ClassicTokenizerImpl.java:638)
    at org.apache.lucene.analysis.standard.ClassicTokenizer.incrementToken(ClassicTokenizer.java:140)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.standard.ClassicFilter.incrementToken(ClassicFilter.java:47)
    at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:104)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
    at beehex.lucene.KeywordsGuesser.guessFromString(KeywordsGuesser.java:239)
    at beehex.lucene.KeywordsGuesser.main(KeywordsGuesser.java:288)

去掉错误后,我的输出是:

# x10,x5,sun x5,run x4,compil x4,languag x3,implement x3,applic x3,代码x3,gnu x3,comput x2,program x2,specif x2,ha x2,on x2,platform x2,字节x2,class x2,machin,大部分,原产地,微系统,ha,releas,1995,它,从,c#en28,图书管理,technolog #30#,同意r on 31,类#bas,对象#33#,设计#34,少数#en39#36#en39####en39#38号#编写x1、onc x1、anywher x1、wora x1、mean x1、doe x1、recompil x1、anoth x1、典型x1、file x1、can x1、ani x1、jvm x1,而不管x1、architectur x1、2014 x1、流行的x1、us x1、特别是en19#、客户端-serv、web、报告、9、百万、10、11、jame、gosl、其中、sinc en30、merg、oracl #en32、corpor ##33、核心#34#、en39#37号##en39#en39##37号#37号#x1#37号#37号#37号#低级x1,facil x1,x1,或者x1,x1,refer x1,also x1,1991 x1,first x1,mai x1,2007 x1,complianc x1,commun x1,process x1,relicens x1,下面是x1,gener x1,public x1,licens x1,其他,还有,altern,classpath,标准,iced茶-web,浏览器,plugin,applet

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-06-25 09:07:12

您需要在调用TokenStream对象上的incrementToken方法之前重置它,正如错误指出的那样:

代码语言:javascript
复制
// add this line
tokenStream.reset();
while (tokenStream.incrementToken()) {
....
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/24403993

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档