blocks|key|1505593|text|iText现在有一个文本解析模块(我是解析器的作者之一)。有关如何使用它的示例，请参阅com.itextpdf.text.pdf.parser.PdfContentReaderTool类。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1505594|entityMap|0|LINK|mutability|MUTABLE|url|http://itextpdf.com/^0|0|5|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

<a href="http://itextpdf.com/" rel="nofollow noreferrer">iText</a> now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.

blocks|key|1505644|text|PdfBox不能在GAE上运行。它使用了不允许的java类。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1505645|(GAE仅允许这些http://code.google.com/appengine/docs/java/jrewhitelist.html)|offset|length|1505646|我已经部分修改了一个非常旧的PdfBox版本(0.7.3)，使其符合GAE。现在我可以从PDF+(整个页面或矩形区域)中提取文本。我只修改了pdf文本提取的最小部分，而不是整个PdfBox。:)|1505647|这个想法是为了去掉对java.awt.retangle+&+C的引用。使用我自己的"rectangle“类。|1505648|更多信息：http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html|1505649|entityMap|0|LINK|mutability|MUTABLE|url|http://code.google.com/appengine/docs/java/jrewhitelist.html|1|http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html^0|0|9|1O|0|0|0|0|5|1T|1|0^^$0|@$1|2|3|4|5|6|7|V|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|W|8|@]|9|@$D|X|E|Y|1|Z]]|A|$]]|$1|F|3|G|5|6|7|10|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|11|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|12|8|@]|9|@$D|13|E|14|1|15]]|A|$]]|$1|L|3|-4|5|6|7|16|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|S]]|T|$5|O|P|Q|A|$R|U]]]]

PdfBox does not run on GAE. It uses not-allowed java classes. 
(GAE only permits these <a href="http://code.google.com/appengine/docs/java/jrewhitelist.html" rel="nofollow noreferrer">http://code.google.com/appengine/docs/java/jrewhitelist.html</a>)

I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :) 
The idea was to remove refences to java.awt.retangle &amp; C. using my own "rectangle" class.

More info: <a href="http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html" rel="nofollow noreferrer">http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html</a>

blocks|key|1505721|text|我修改了最新的(1.8.0-快照)版本，以便在谷歌AppEngine上运行。我不得不禁用一个单元测试，但它可以很好地运行于简单的文本提取。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1505722|按照简单的尝试-失败-修复方法，我总共修改了5个文件。非常可行。|1505723|您还必须显式地使用RandomAccessBuffer，就像Fabrizio解释的那样。|1505724|对于额外的惰性，这里是编译的jar、用于文本提取的依赖项和补丁。请注意，它可能并不适用于所有用例(例如，基于矩形的提取)。用它来提取整个页面的文本。https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit|offset|length|1505725|entityMap|0|LINK|mutability|MUTABLE|url|https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit^0|0|0|0|22|1U|0|0^^$0|@$1|2|3|4|5|6|7|R|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|S|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|U|8|@]|9|@$H|V|I|W|1|X]]|A|$]]|$1|J|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|K|$L|$5|M|N|O|A|$P|Q]]]]

I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.

Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.

You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.

For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
<a href="https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit" rel="nofollow">https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit</a>

blocks|key|272946|text|我知道有http://pdfbox.apache.org/index.html|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|272947|272948|+Apache是一个开源的Java+PDF库，用于处理PDFBox文档。这个项目允许创建新的PDF文档，操作现有的文档和从文档中提取内容的能力。|blockquote|272949|272950|272951|但我从来没有测试过它。|272952|entityMap|0|LINK|mutability|MUTABLE|url|http://pdfbox.apache.org/index.html^0|4|Z|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]]|C|$]]|$1|D|3|-4|5|6|7|X|8|@]|9|@]|C|$]]|$1|E|3|F|5|G|7|Y|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|Z|8|@]|9|@]|C|$]]|$1|I|3|-4|5|6|7|10|8|@]|9|@]|C|$]]|$1|J|3|K|5|6|7|11|8|@]|9|@]|C|$]]|$1|L|3|-4|5|6|7|12|8|@]|9|@]|C|$]]]|M|$N|$5|O|P|Q|C|$R|S]]]]

I know there is <a href="http://pdfbox.apache.org/index.html" rel="nofollow noreferrer">http://pdfbox.apache.org/index.html</a>

<blockquote>
 Apache PDFBox is an open source Java
 PDF library for working with PDF
 documents. This project allows
 creation of new PDF documents,
 manipulation of existing documents and
 the ability to extract content from
 documents.
</blockquote>

but I've never tested it.

blocks|key|271876|text|上个月，我刚刚完成了从项目中的pdf文件中提取文本。我使用XPDF工具来获取文本和文本坐标，但我在Xcode+(Objective-C)中使用了它。这个工具是开源的，由C%2B%2B编写，可以用多种语言编码。然而，我不知道XPdf是否可以在你的java上运行。不管怎样，你可以试试这个工具。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|271877|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?

I've read about PDFJet, but it can't read PDF, can it?

Is there perhaps other way how to extract text from PDF? I tried <a href="http://www.pdfdownload.org/" rel="nofollow noreferrer">http://www.pdfdownload.org/</a>, unfortunately they don't handle non-English characters correctly.

Extract text from PDF (google app engine)

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

有没有免费的Java库可以从PDF中提取文本，与Google应用程序引擎兼容？我读过关于PDFJet的文章，但是它不能读PDF，是吗？有没有其他方法可以从PDF中提取文本？我试过，不幸的是他们不能正确处理非英文字符。

问从PDF (google应用程序引擎)中提取文本
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从PDF (google应用程序引擎)中提取文本EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从PDF (google应用程序引擎)中提取文本
EN