文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在setFieldName of PDFTextStream中获取文档的文件路径？

问如何在setFieldName of PDFTextStream中获取文档的文件路径？
EN

Stack Overflow用户

提问于 2014-06-13 15:30:06

回答 1查看 106关注 0票数 3

我有一个问题，我正在与PDFTextStream和Lucene合作，以索引pdf文件。问题是使用PDFTextStream方法构建文本，但是我无法获得存储在我正在索引文件的目录中的文档的文件路径，我尝试过：

setFieldName(file.getPath(),"path");，但我无法获得文件路径。有什么建议吗？

这是我的密码：

public class PDFDocument {
//Constructor vacío
 IndexWriter writer;
 File directorio;
public PDFDocument(){
    directorio= new File("C:/indexpdf");
}
/*Metódo estático para agregar un documento PDF a un IndexWriter de Lucene
 * pasando como parámetros IndexWriter, y el archivo PDF
 */

public void agregarPDFaIndex() throws IOException{
    writer= new IndexWriter(new File("C:/indexpdfsalida"), new StandardAnalyzer(), true);
    //Se crea e inicializa una nueva instancia de DocumentFactoryCofig
    DocumentFactoryConfig config= new DocumentFactoryConfig();;


    config.setCopyAllPDFAttrs(false);
    //los datos del documento PDF se almacenan, se tokenizan y se indexan
    config.setPDFAttrSettings(true, true, true);
    /*Se configuran los nombre explicitos que deben ser usados en los Fields 
     * que crean una nueva instancia de un Document de Lucene        * 
     */
    File[] files= directorio.listFiles();

    for(File file: files){
        if(file.canRead() && !file.isDirectory() && file.getName().endsWith(".pdf")){
            System.out.println("Indexando el archivo: "+file.getAbsolutePath());
            config.setMainTextFieldName("content");
            Document doc= new Document();

            config.setTextSettings(false, true, true);
            config.setFieldName(PDFTextStream.ATTR_AUTHOR, "autor");
            config.setFieldName(PDFTextStream.ATTR_CREATION_DATE, "fecha_creacion");
            config.setFieldName(PDFTextStream.ATTR_MOD_DATE,"ultimo_mod");
            config.setFieldName(PDFTextStream.ATTR_TITLE,"titulo");
            config.setFieldName(DocumentFactoryConfig.DEFAULT_MAIN_TEXT_FIELD_NAME, "content");
            config.setFieldName(PDFTextStream.ATTR_CREATOR,"creador");
            config.setFieldName(PDFTextStream.ATTR_PRODUCER, "productor");
            config.setFieldName(PDFTextStream.ATTR_SUBJECT, "asunto");
            config.setFieldName(file.getPath(), "path");

            doc= PDFDocumentFactory.buildPDFDocument(file, config);

            System.out.println(doc.get("path"));
            writer.addDocument(doc);
        }
    }
    writer.optimize();
    writer.close();
    System.out.println("OK");

}
public static void main(String[]args) throws IOException{
    PDFDocument doc= new PDFDocument();
    doc.agregarPDFaIndex();
}
}

java

eclipse

lucene

indexing

filepath

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-06-13 18:06:48

我看不出在PDFTextStream中可以使用任何常量。但是，在将它添加到索引之前，将其直接添加到PDFDocumentFactory生成的Lucene文档将非常简单：

doc= PDFDocumentFactory.buildPDFDocument(file, config);

doc.addField(new StringField("path", file.getPath(), Field.Store.YES));

System.out.println(doc.get("path"));
writer.addDocument(doc);

您可能希望使用不同的字段类型，这取决于您的需要(StringField用于精确匹配，TextField用于全文搜索，StoredField用于不搜索该字段)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24209003

复制

相似问题

问如何在setFieldName of PDFTextStream中获取文档的文件路径？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在setFieldName of PDFTextStream中获取文档的文件路径？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在setFieldName of PDFTextStream中获取文档的文件路径？
EN