我试图分析一个PDF文档与itextsharp library...the的最终意图是阅读所有的文本,并分割它的每一行。
为此,我使用读取文本的拆分函数.字符串变量中有完整的文本如下所示。
Dim RigheTesto As String()
RigheTesto = testoEstrapolato.Split({vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries)拆分函数工作良好,我获得了一个字符串数组,如“数据类型:值”,从原始文件中每一行都有一个数组.
..。但是当拆分遇到页面更改(在原始PDF中)时,不理解是另一行,它与以前的合并.
你知道如何解决这个问题吗?
耽误您时间,实在对不起!
发布于 2021-09-16 14:44:02
下面展示了如何使用NuGet包iTextSharp从PDF文件中提取文本(已经使用v5.5.13.2进行了测试)。
下载/安装NuGet包 iTextSharp
创建一个类(名称: PdfPageInfo.vb)
Public Class PdfPageInfo
Public Property PageNumber As Integer
Public Property Lines As List(Of String) = New List(Of String)
End Class创建模块(名称: HelperiTextSharp.vb)
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Module HelperiTextSharp
Public Function ExtractText(filename As String) As List(Of PdfPageInfo)
Dim pageInfoList As List(Of PdfPageInfo) = New List(Of PdfPageInfo)
Using reader As PdfReader = New PdfReader(filename)
For i As Integer = 1 To reader.NumberOfPages Step 1
'create new instance
Dim pageInfo As PdfPageInfo = New PdfPageInfo()
'set value
pageInfo.PageNumber = i
'get text from PDF page
Dim pageText As String = PdfTextExtractor.GetTextFromPage(reader, i)
'split on newline and set value
pageInfo.Lines = pageText.Split(New String() {vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries).ToList()
'add
pageInfoList.Add(pageInfo)
Next
End Using
Return pageInfoList
End Function
End Module使用
Dim ofd As OpenFileDialog = New OpenFileDialog()
ofd.Filter = "PDF files(*.pdf)|*.pdf"
If ofd.ShowDialog = DialogResult.OK Then
Dim pdfPageInfoList As List(Of PdfPageInfo) = HelperiTextSharp.ExtractText(ofd.FileName)
For Each pInfo As PdfPageInfo In pdfPageInfoList
Debug.WriteLine("Page Number: " & pInfo.PageNumber.ToString())
For i As Integer = 0 To pInfo.Lines.Count - 1 Step 1
Debug.WriteLine("[" & i & "]: " & pInfo.Lines(i))
Next
Debug.WriteLine("---------------------------------" & vbCrLf)
Next
End If资源
https://stackoverflow.com/questions/69207028
复制相似问题