文章/答案/技术大牛

发布

社区首页 >问答首页 >从R或Python的pdf中将非结构化数据解析为结构化数据

问从R或Python的pdf中将非结构化数据解析为结构化数据
EN

Stack Overflow用户

提问于 2020-02-05 01:33:01

回答 1查看 1K关注 0票数 0

我需要阅读100个pdf文档，在那里我需要从pdf中提取文本信息并导出excel。在pdf中有不同的文本，我需要从这些文本创建数据表。我给出的一部分pdf，我需要从其中提取信息。

I am doing my job in the company(Employee Id : 12345678)
Name : XXXXX YYYYY
** Date of Birth : 12/12/2001**
** Place : AAAAAAAA**
** Address: 111, BLOCK 1,**
** XYZ LOCALITY**
** BANGKOK **
** Email id: xyz@yahoo.in**

我必须创建列并从Excel中的所有pdfs中提取所有信息。我正在尝试使用tesseract和pdf_convert。

我的产出应该是：

Date              Address         Place 
12/12/2001       XYZ Locality    AAAAAAA
                  bangkok

python

pdf

machine-learning

ocr

回答 1

Stack Overflow用户

发布于 2022-09-26 22:16:29

以下是一种可以考虑的方法：

library(stringr)

text_Vector <- c("** Date of Birth : 12/12/2001**",
                 "** Place : AAAAAAAA**",
                 "** Address: 111, BLOCK 1,**",
                 "** XYZ LOCALITY**",
                 "** BANGKOK **",
                 "** Email id: xyz@yahoo.in**")

text_Vector_One_Line <- paste0(text_Vector, collapse = "")
text_Splitted <- stringr::str_split(text_Vector_One_Line, "(\\*\\*\\*\\*)|(\\*\\*)")[[1]]
text_Splitted <- text_Splitted[text_Splitted != ""]

list_Text <- list()
nb_Token <- length(text_Splitted)
counter <- 1

for(i in 1 : nb_Token)
{
  if(stringr::str_detect(text_Splitted[i], ":") == TRUE)
  {
    list_Text[[counter]] <- text_Splitted[i]
    counter <- counter + 1
  }else
  {
    list_Text[[counter - 1]] <- paste0(list_Text[[counter - 1]], text_Splitted[i], collapse = "")
  }
}

list_Text <- lapply(X = list_Text, FUN = function(x) strsplit(x, ":")[[1]])
first_Col <- unlist(lapply(X = list_Text, FUN = function(x) x[1]))
second_Col <- unlist(lapply(X = list_Text, FUN = function(x) x[2]))

cbind(first_Col, second_Col)

first_Col         second_Col                            
[1,] " Date of Birth " " 12/12/2001"                         
[2,] " Place "         " AAAAAAAA"                           
[3,] " Address"        " 111, BLOCK 1, XYZ LOCALITY BANGKOK "
[4,] " Email id"       " xyz@yahoo.in"

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60067856

复制

相似问题

问从R或Python的pdf中将非结构化数据解析为结构化数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从R或Python的pdf中将非结构化数据解析为结构化数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从R或Python的pdf中将非结构化数据解析为结构化数据
EN