文章/答案/技术大牛

发布

社区首页 >问答首页 >使用lucene提取关键字时出错

问使用lucene提取关键字时出错
EN

Stack Overflow用户

提问于 2014-06-25 08:42:50

回答 1查看 1.4K关注 0票数 1

我对文本提取的概念完全陌生。当我搜索一个示例时，我找到了一个使用Lucene实现的示例。我只是试图在eclipse中运行它，但是它出错了。这是我得到的错误：(TokenStream合同违约:/close()调用丢失，reset()多次调用，或者子类不调用super.reset() )。有关正确的消费工作流的更多信息，请参见TokenStream类的Javadocs )。我直接从网络上发布的一篇文章中获得了代码，并且做了很少的修改，因为首先我想确保代码在一个接一个地运行之前没有错误。最初的代码是从URL中获取文本，但我将其更改为从定义的字符串中获取文本(它位于主方法中)。我还更改了版本，因为我使用Lucene4.8版本。

我也搜索了错误并且做了很少的修改，但是我仍然得到了错误。我是这里的密码。你能帮我排除这个错误吗？我应该在哪里修改以避免错误。这是我得到代码http://pastebin.com/jNALz7DZ的链接，这里是我修改的代码。

public class KeywordsGuesser {

     /** Lucene version. */
     private static Version LUCENE_VERSION = Version.LUCENE_48;

     /**
      * Keyword holder, composed by a unique stem, its frequency, and a set of found corresponding
      * terms for this stem.
      */
    public static class Keyword implements Comparable<Keyword> {

         /** The unique stem. */
         private String stem;

         /** The frequency of the stem. */
         private Integer frequency;

         /** The found corresponding terms for this stem. */
        private Set<String> terms;

         /**
          * Unique constructor.
          * 
          * @param stem The unique stem this instance must hold.
          */
         public Keyword(String stem) {
             this.stem = stem;
            terms = new HashSet<String>();
             frequency = 0;
         }

         /**
          * Add a found corresponding term for this stem. If this term has been already found, it
          * won't be duplicated but the stem frequency will still be incremented.
          * 
          * @param term The term to add.
          */
         private void add(String term) {
             terms.add(term);
             frequency++;
         }

         /**
          * Gets the unique stem of this instance.
          * 
          * @return The unique stem.
          */
         public String getStem() {
             return stem;
         }

         /**
          * Gets the frequency of this stem.
          * 
          * @return The frequency.
          */
         public Integer getFrequency() {
             return frequency;
         }

         /**
          * Gets the list of found corresponding terms for this stem.
          * 
          * @return The list of found corresponding terms.
          */
        public Set<String> getTerms() {
             return terms;
         }

         /**
          * Used to reverse sort a list of keywords based on their frequency (from the most frequent
          * keyword to the least frequent one).
          */
         @Override
         public int compareTo(Keyword o) {
             return o.frequency.compareTo(frequency);
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public boolean equals(Object obj) {
             return obj instanceof Keyword && obj.hashCode() == hashCode();
         }

         /**
          * Used to keep unicity between two keywords: only their respective stems are taken into
          * account.
          */
         @Override
         public int hashCode() {
             return Arrays.hashCode(new Object[] { stem });
         }

         /**
          * User-readable representation of a keyword: "[stem] x[frequency]".
          */
         @Override
         public String toString() {
             return stem + " x" + frequency;
         }

     }

     /**
      * Stemmize the given term.
      * 
      * @param term The term to stem.
      * @return The stem of the given term.
      * @throws IOException If an I/O error occured.
      */
     private static String stemmize(String term) throws IOException {

         // tokenize term
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(term));
         // stemmize
         tokenStream = new PorterStemFilter(tokenStream);

         Set<String> stems = new HashSet<String>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
         // for each token
         while (tokenStream.incrementToken()) {
             // add it in the dedicated set (to keep unicity)
             stems.add(token.toString());
         }

         // if no stem or 2+ stems have been found, return null
         if (stems.size() != 1) {
             return null;
         }

         String stem = stems.iterator().next();

         // if the stem has non-alphanumerical chars, return null
         if (!stem.matches("[\\w-]+")) {
             return null;
         }

         return stem;
     }

     /**
      * Tries to find the given example within the given collection. If it hasn't been found, the
      * example is automatically added in the collection and is then returned.
      * 
      * @param collection The collection to search into.
      * @param example The example to search.
      * @return The existing element if it has been found, the given example otherwise.
      */
     private static <T> T find(Collection<T> collection, T example) {
         for (T element : collection) {
             if (element.equals(example)) {
                 return element;
             }
         }
         collection.add(example);
         return example;
     }

     /**
      * Extracts text content from the given URL and guesses keywords within it (needs jsoup parser).
      * 
      * @param The URL to read.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occurred.
      * @see <a href="http://jsoup.org/">http://jsoup.org/</a>
      */
     public static List<Keyword> guessFromUrl(String url) throws IOException {
         // get textual content from url
         //Document doc = Jsoup.connect(url).get();
         //String content = doc.body().text();

       String content = url;
         // guess keywords from this content
         return guessFromString(content);
     }

     /**
      * Guesses keywords from given input string.
      * 
      * @param input The input string.
      * @return A set of potential keywords. The first keyword is the most frequent one, the last the
      *         least frequent.
      * @throws IOException If an I/O error occured.
      */
     public static List<Keyword> guessFromString(String input) throws IOException {

         // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")
         input = input.replaceAll("-+", "-0");
         // replace any punctuation char but dashes and apostrophes and by a space
         input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");
         // replace most common English contractions
         input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

         // tokenize input
         TokenStream tokenStream = new ClassicTokenizer(LUCENE_VERSION, new StringReader(input));
         // to lower case
         tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
         // remove dots from acronyms (and "'s" but already done manually above)
         tokenStream = new ClassicFilter(tokenStream);
         // convert any char to ASCII
         tokenStream = new ASCIIFoldingFilter(tokenStream);
         // remove english stop words
         tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());

         List<Keyword> keywords = new LinkedList<Keyword>();
         CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

         // for each token
         while (tokenStream.incrementToken()) {
             String term = token.toString();
             // stemmize
             String stem = stemmize(term);
             if (stem != null) {
                 // create the keyword or get the existing one if any
                 Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));
                 // add its corresponding initial token
                 keyword.add(term.replaceAll("-0", "-"));
             }
         }



         tokenStream.end();
         tokenStream.close();


         // reverse sort by frequency
         Collections.sort(keywords);

         return keywords;
     }



     public static void main(String args[]) throws IOException{

       String input = "Java is a computer programming language that is concurrent, "
               + "class-based, object-oriented, and specifically designed to have as few "
               + "implementation dependencies as possible. It is intended to let application developers "
               + "write once, run anywhere (WORA), "
               + "meaning that code that runs on one platform does not need to be recompiled "
               + "to run on another. Java applications are typically compiled to byte code (class file) "
               + "that can run on any Java virtual machine (JVM) regardless of computer architecture. "
               + "Java is, as of 2014, one of the most popular programming languages in use, particularly "
               + "for client-server web applications, with a reported 9 million developers."
               + "[10][11] Java was originally developed by James Gosling at Sun Microsystems "
               + "(which has since merged into Oracle Corporation) and released in 1995 as a core "
               + "component of Sun Microsystems' Java platform. The language derives much of its syntax "
               + "from C and C++, but it has fewer low-level facilities than either of them."
               + "The original and reference implementation Java compilers, virtual machines, and "
               + "class libraries were developed by Sun from 1991 and first released in 1995. As of "
               + "May 2007, in compliance with the specifications of the Java Community Process, "
               + "Sun relicensed most of its Java technologies under the GNU General Public License. "
               + "Others have also developed alternative implementations of these Sun technologies, "
               + "such as the GNU Compiler for Java (byte code compiler), GNU Classpath "
               + "(standard libraries), and IcedTea-Web (browser plugin for applets).";

       System.out.println(KeywordsGuesser.guessFromString(input));
     }



 }

这是eclipse输出的错误。

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
    at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:110)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.zzRefill(ClassicTokenizerImpl.java:431)
    at org.apache.lucene.analysis.standard.ClassicTokenizerImpl.getNextToken(ClassicTokenizerImpl.java:638)
    at org.apache.lucene.analysis.standard.ClassicTokenizer.incrementToken(ClassicTokenizer.java:140)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.standard.ClassicFilter.incrementToken(ClassicFilter.java:47)
    at org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter.incrementToken(ASCIIFoldingFilter.java:104)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
    at beehex.lucene.KeywordsGuesser.guessFromString(KeywordsGuesser.java:239)
    at beehex.lucene.KeywordsGuesser.main(KeywordsGuesser.java:288)

去掉错误后，我的输出是：

# x10，x5，sun x5，run x4，compil x4，languag x3，implement x3，applic x3，代码x3，gnu x3，comput x2，program x2，specif x2，ha x2，on x2，platform x2，字节x2，class x2，machin，大部分，原产地，微系统，ha，releas，1995，它，从，c#en28，图书管理，technolog #30#，同意r on 31，类#bas，对象#33#，设计#34，少数#en39#36#en39####en39#38号#编写x1、onc x1、anywher x1、wora x1、mean x1、doe x1、recompil x1、anoth x1、典型x1、file x1、can x1、ani x1、jvm x1，而不管x1、architectur x1、2014 x1、流行的x1、us x1、特别是en19#、客户端-serv、web、报告、9、百万、10、11、jame、gosl、其中、sinc en30、merg、oracl #en32、corpor ##33、核心#34#、en39#37号##en39#en39##37号#37号#x1#37号#37号#37号#低级x1，facil x1，x1，或者x1，x1，refer x1，also x1,1991 x1，first x1，mai x1,2007 x1，complianc x1，commun x1，process x1，relicens x1，下面是x1，gener x1，public x1，licens x1，其他，还有，altern，classpath，标准，iced茶-web，浏览器，plugin，applet

lucene

tokenize

feature-extraction

java

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-06-25 09:07:12

您需要在调用TokenStream对象上的incrementToken方法之前重置它，正如错误指出的那样：

// add this line
tokenStream.reset();
while (tokenStream.incrementToken()) {
....

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24403993

复制

相似问题

问使用lucene提取关键字时出错
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lucene提取关键字时出错EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lucene提取关键字时出错
EN