我需要阅读100个pdf文档,在那里我需要从pdf中提取文本信息并导出excel。在pdf中有不同的文本,我需要从这些文本创建数据表。我给出的一部分pdf,我需要从其中提取信息。
I am doing my job in the company(Employee Id : 12345678)
Name : XXXXX YYYYY
** Date of Birth : 12/12/2001**
** Place : AAAAAAAA**
** Address: 111, BLOCK 1,**
** XYZ LOCALITY**
** BANGKOK **
** Email id: xyz@yahoo.in**我必须创建列并从Excel中的所有pdfs中提取所有信息。我正在尝试使用tesseract和pdf_convert。
我的产出应该是:
Date Address Place
12/12/2001 XYZ Locality AAAAAAA
bangkok发布于 2022-09-26 22:16:29
以下是一种可以考虑的方法:
library(stringr)
text_Vector <- c("** Date of Birth : 12/12/2001**",
"** Place : AAAAAAAA**",
"** Address: 111, BLOCK 1,**",
"** XYZ LOCALITY**",
"** BANGKOK **",
"** Email id: xyz@yahoo.in**")
text_Vector_One_Line <- paste0(text_Vector, collapse = "")
text_Splitted <- stringr::str_split(text_Vector_One_Line, "(\\*\\*\\*\\*)|(\\*\\*)")[[1]]
text_Splitted <- text_Splitted[text_Splitted != ""]
list_Text <- list()
nb_Token <- length(text_Splitted)
counter <- 1
for(i in 1 : nb_Token)
{
if(stringr::str_detect(text_Splitted[i], ":") == TRUE)
{
list_Text[[counter]] <- text_Splitted[i]
counter <- counter + 1
}else
{
list_Text[[counter - 1]] <- paste0(list_Text[[counter - 1]], text_Splitted[i], collapse = "")
}
}
list_Text <- lapply(X = list_Text, FUN = function(x) strsplit(x, ":")[[1]])
first_Col <- unlist(lapply(X = list_Text, FUN = function(x) x[1]))
second_Col <- unlist(lapply(X = list_Text, FUN = function(x) x[2]))
cbind(first_Col, second_Col)
first_Col second_Col
[1,] " Date of Birth " " 12/12/2001"
[2,] " Place " " AAAAAAAA"
[3,] " Address" " 111, BLOCK 1, XYZ LOCALITY BANGKOK "
[4,] " Email id" " xyz@yahoo.in" https://stackoverflow.com/questions/60067856
复制相似问题