我尝试用icepdf.And搜索句子,得到了正确的结果大多数time.But现在面临的问题是
发布于 2014-03-03 23:35:22
循环遍历文档中的所有行并创建一个句子列表。每个句子都可以是一个WordText对象的列表。然后搜索你为找到你的句子而创建的列表。
下面是一些示例代码(到目前为止我还没有检查)来构建WordText对象列表。
ArrayList<ArrayList<WordText>> Sentences = new ArrayList<ArrayList<WordText>>;
ArrayList<WordText> currentSentence = new ArrayList<WordText>;
Document document = new Document();
// Build sentences
for (int pageNumber = 0, max = document.getNumberOfPages();
pageNumber < max; pageNumber++) {
PageText pageText = document.getPageText(pageNumber);
ArrayList<LineText> pageLines = pageText.getPageLines();
for (LineText pageLine : pageLines) {
ArrayList<WordText> words = pageLine.getWords();
for (WordText word : words) {
// If this is a word, and the last word was not a space,
// start a new sentence
if(!word.getText().equals(" ") && currentSentence.size() > 0
!currentSentence.get(currentSentence.size() - 1).getText().equals(" ")) {
sentences.add(currentSentence);
currentSentence = new ArrayList<WordText>;
}
// Add word to current sentnece
currentSentence.add(word);
}
// Add the last sentence in
sentences.add(currentSentence);
}
}如果需要对WordText列表进行排序,可以比较WordText对象y和x值。
https://stackoverflow.com/questions/18372084
复制相似问题