文章/答案/技术大牛

发布

社区首页 >问答首页 >为什么我的网页内容在阅读PDF时会被混淆？

问为什么我的网页内容在阅读PDF时会被混淆？
EN

Stack Overflow用户

提问于 2020-01-06 16:09:45

回答 1查看 292关注 0票数 3

我使用iText7从一个pdf文件中读取文本。这对于第一页来说很好。在那之后，书页的内容不知何故被搞混了。因此，在文档的第3页，我有包含第1页和第3页内容的行。第2页的文本显示与第1页完全相同的行(但在“真的”中它们是完全不同的)。

页面1，真：~36行，结果36行->大
页面2，real：>50行，结果36行(==Page 1)
页面3，real：~16行，结果47行(添加和混合第1页的行)

https://www.dropbox.com/s/63gy5cg1othy6ci/Dividenden_Microsoft.pdf?dl=0

为了阅读文档，我使用了以下代码：

using System;
using System.Collections.Generic;
using System.Linq;

namespace StockMarket
{
    class PdfReader
    {
        /// <summary>
        /// Reads PDF file by a given path.
        /// </summary>
        /// <param name="path">The path to the file</param>
        /// <param name="pageCount">The number of pages to read (0=all, 1 by default) </param>
        /// <returns></returns>
        public static DocumentTree PdfToText(string path, int pageCount=1 )
        {
            var pages = new DocumentTree();
            using (iText.Kernel.Pdf.PdfReader reader = new iText.Kernel.Pdf.PdfReader(path))
            {
                using (iText.Kernel.Pdf.PdfDocument pdfDocument = new iText.Kernel.Pdf.PdfDocument(reader))
                {
                    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

                    // set up pages to read
                    int pagesToRead = 1;
                    if (pageCount > 0)
                    {
                        pagesToRead = pageCount;
                    }
                    if (pagesToRead > pdfDocument.GetNumberOfPages() || pageCount==0)
                    {
                        pagesToRead = pdfDocument.GetNumberOfPages();
                    }

                    // for each page to read...
                    for (int i = 1; i <= pagesToRead; ++i)
                    {
                        // get the page and save it
                        var page = pdfDocument.GetPage(i);
                        var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
                        pages.Add(txt);
                    }
                    pdfDocument.Close();
                    reader.Close();
                }
            }
            return pages;
        }

    }

    /// <summary>
    /// A class representing parts of a PDF document.
    /// </summary>
    class DocumentTree
    {
        /// <summary>
        /// Constructor
        /// </summary>
        public DocumentTree()
        {
            Pages = new List<DocumentPage>();
        }

        private List<DocumentPage> _pages;
        /// <summary>
        /// The pages of the document
        /// </summary>
        public List<DocumentPage> Pages
        {
            get { return _pages; }
            set { _pages = value; }
        }

        /// <summary>
        /// Adds a <see cref="DocumentPage"/> to the document.
        /// </summary>
        /// <param name="page">The text of the <see cref="DocumentPage"/>.</param>
        public void Add(string page)
        {
            Pages.Add(new DocumentPage(page));
        }
    }

    /// <summary>
    /// A class representing a single page of a document
    /// </summary>
    class DocumentPage
    {
        /// <summary>
        /// Constructor
        /// </summary>
        /// <param name="pageContent">The pages content as text</param>
        public DocumentPage(string pageContent)
        {
            // set the content to the input
            CompletePage = pageContent;

            // split the content by lines
            var splitter = new string[] { "\n" };
            foreach (var line in CompletePage.Split(splitter, StringSplitOptions.None))
            {
                // add lines to the page if the content is not empty
                if (!string.IsNullOrWhiteSpace(line))
                {                    
                    _lines.Add(new Line(line));
                }
            }

        }

        private List<Line> _lines = new List<Line>();
        /// <summary>
        /// The lines of text of the <see cref="DocumentPage"/>
        /// </summary>
        public List<Line> Lines
        {
            get
            {
                return _lines;
            }            
        }

        /// <summary>
        /// The text of the complete <see cref="DocumentPage"/>.
        /// </summary>
        private string CompletePage;
    }

    /// <summary>
    /// A class representing a single line of text
    /// </summary>
    class Line
    {
        /// <summary>
        /// Constructor
        /// </summary>
        public Line(string lineContent)
        {
            CompleteLine = lineContent;
        }

        /// <summary>
        /// The words of the <see cref="Line"/>.
        /// </summary>
        public List<string> Words
        {
            get
            {
                return CompleteLine.Split(" ".ToArray()).Where((word)=> { return !string.IsNullOrWhiteSpace(word); }).ToList();
            }
        }

        /// <summary>
        /// The complete text of the <see cref="Line"/>.
        /// </summary>
        private string CompleteLine;

        public override string ToString()
        {
            return CompleteLine;
        }
    }
}

页面树是一个包含页面的简单树，由行(由“\n”分隔的读页)和由单词组成的行(由“”分隔)组成，但是循环中的txt已经包含了混乱的内容(所以我的树不会引起问题)。

谢谢你的帮助。

.net

itext7

text-extraction

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-01-16 17:04:08

一些解析事件侦听器，特别是大多数文本提取策略，不打算在多个页面上重用。相反，您应该为每个页面创建一个新实例。

经验法则是，在分析页面时收集信息的每个此类侦听器，然后允许您访问该数据(就像文本提取策略允许您访问所收集的页面文本一样)，如果不希望所有页面的数据积累，则很可能必须对每个页面分别实例化。

因此，在代码中移动策略实例化。

var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

进入for循环：

// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
    pages.Add(txt);
}

或者，您可以将循环缩短为

// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
    pages.Add(txt);
}

此PdfTextExtractor.GetTextFromPage重载每次在内部创建一个新的LocationTextExtractionStrategy实例。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59615341

复制

相似问题

问为什么我的网页内容在阅读PDF时会被混淆？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的网页内容在阅读PDF时会被混淆？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的网页内容在阅读PDF时会被混淆？
EN