我遵循示例这里来索引目录中的所有pdfs。这个过程似乎运行得很好,但最后,当我进入Solr-UI并单击"Execute query“(使用q=**)时,我只得到一个条目。
我的代码中有遗漏什么吗?
...
String[] files = documentDir.list();
if (files != null)
{
for (String document : files)
{
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser autoDetectParser = new AutoDetectParser();
InputStream inputStream = null;
try
{
inputStream = new FileInputStream(new File(documentDir, document));
autoDetectParser.parse(inputStream, textHandler, metadata, context);
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", document);
String content = textHandler.toString();
if (content != null)
{
doc.addField("fullText", content);
}
UpdateResponse resp = server.add(doc, 1);
server.commit(true, true, true);
if (resp.getStatus() != 0)
{
throw new IDSystemException(LOG, "Document could not be indexed. Status returned: " + resp.getStatus());
}
}
catch (FileNotFoundException fnfe)
{
throw new IDSystemException(LOG, fnfe.getMessage(), fnfe);
}
catch (IOException ioe)
{
throw new IDSystemException(LOG, ioe.getMessage(), ioe);
}
catch (SAXException se)
{
throw new IDSystemException(LOG, se.getMessage(), se);
}
catch (TikaException te)
{
throw new IDSystemException(LOG, te.getMessage(), te);
}
catch (SolrServerException sse)
{
throw new IDSystemException(LOG, sse.getMessage(), sse);
}
finally
{
if (inputStream != null)
{
try
{
inputStream.close();
}
catch (IOException ioe)
{
throw new IDSystemException(LOG, ioe.getMessage(), ioe);
}
}
}
...发布于 2014-03-11 13:42:09
我将"signatureField“绑定到solrconfig.xml中的" uid”,所以uid总是相同的。现在,我为"signatureField“定义了一个新字段,它可以工作了!
在此之前:
...
<updateRequestProcessorChain name="deduplication">
<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="overwriteDupes">false</bool>
<str name="signatureField">uid</str> <---------------------
<bool name="enabled">true</bool>
<str name="fields">content</str>
<str name="minTokenLen">10</str>
<str name="quantRate">.2</str>
<str name="signatureClass">solr.update.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
...
...
<fields>
<field name="uid" type="string" indexed="true" stored="true" multiValued="false" />
<dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="fasle" />
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="fullText" indexed="true" type="text" multiValued="true" />
</fields>
<uniqueKey>uid</uniqueKey>
...之后:
...
<updateRequestProcessorChain name="deduplication">
<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
<bool name="overwriteDupes">false</bool>
<str name="signatureField">signatureField</str> <---------------------
<bool name="enabled">true</bool>
<str name="fields">content</str>
<str name="minTokenLen">10</str>
<str name="quantRate">.2</str>
<str name="signatureClass">solr.update.processor.TextProfileSignature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
...
...
<fields>
<field name="uid" type="string" indexed="true" stored="true" multiValued="false" />
<field name="signatureField" type="string" indexed="true" stored="true" multiValued="false" /> <----------------------------------
<dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="fasle" />
<field name="id" type="string" indexed="true" stored="true" multiValued="false" />
<field name="fullText" indexed="true" type="text" multiValued="true" />
</fields>
<uniqueKey>uid</uniqueKey>
...发布于 2014-03-11 09:18:44
也许可以用server.add检查一下您的线路。为什么有1作为第二个参数?
你有没有考虑过使用Nutch,它会为您进行文件爬行和索引吗?
https://stackoverflow.com/questions/22320927
复制相似问题