我正在尝试使用R从.pdf文件中提取表。我尝试使用tabulizer包,它可以将表提取到一个很大的列表中。我想更进一步,清理这些表(它们都是不同的)并将它们放入tibble (或data.frame)中。
#incase you don't have the tabulizer package, the below is needed
install.packages("rJava")
library(rJava) # load and attach 'rJava' now
install.packages("devtools")
devtools::install_github("ropensci/tabulizer", args="--no-multiarch")
library(tabulizer)
#set path to file
file <- "https://www.sdccu.com/CURates/HomeLoanRates.pdf"
#extract tables
mortgagerates <- extract_tables(file, encoding = "UTF-8")
#first table from the third page
mortgagerates[[7]]这是最后一行代码的输出:
> mortgagerates[[7]]
[,1]
[1,] "ADJUSTABLE RATE MORTGAGES: JUMBO LOANS $453,101 TO $1,500,000
(Purchase or Refinance)"
[2,] "Available for all counties:"
[3,] " Purchases or refinances up to 95% LTV with a maximum loan amount of
$679,650. Cash-out refinances up to 70% LTV."
[4,] ""
[5,] " Purchases or refinances up to 80% LTV with a maximum loan amount of
$1,500,000."
[6,] "Annual Percentage Loans Amortized Over 30 Years. Rate Rate (APR)
Points Per $1,000 Borrowed Estimated Payment"
[7,] "5/1 CMT 3.500% 4.394% 0.000 $4.49"
[8,] "7/1 CMT 3.750% 4.358% 0.000 $4.63"
[9,] "3.500% 4.322% 1.000 $4.49" 最好的方法是将它整理成一个类似于实际pdf文档的tibble?下表中我想要的图像:

以下是来自dput(mortgagerates[7])的更新
> file
[,1]
[1,] "ADJUSTABLE RATE MORTGAGES: JUMBO LOANS $453,101 TO $1,500,000
(Purchase or Refinance)"
[2,] "Available for all counties:"
[3,] " Purchases or refinances up to 95% LTV with a maximum loan amount of
$679,650. Cash-out refinances up to 70% LTV."
[4,] ""
[5,] " Purchases or refinances up to 80% LTV with a maximum loan amount of
$1,500,000."
[6,] "Annual Percentage Loans Amortized Over 30 Years. Rate Rate (APR)
Points
Per $1,000 Borrowed Estimated Payment"
[7,] "5/1 CMT 3.500% 4.394% 0.000 $4.49"
[8,] "7/1 CMT 3.750% 4.358% 0.000 $4.63"
[9,] "3.500% 4.322% 1.000 $4.49" 发布于 2018-04-26 00:12:13
此文件中表格的布局太复杂,如果没有更多的输入就无法自动提取它们。使用tabulizer解决这个问题的方法是提供包含表的区域。对于这个特定的表,您可以执行如下操作:
file <- "https://www.sdccu.com/CURates/HomeLoanRates.pdf"
area <- locate_areas(file, pages = 3)
area
[[1]]
top left bottom right
442.20975 30.50972 549.83752 592.01857
mortgagerates <- extract_tables(file, pages = 3, area = area, guess = FALSE)这提供了:
> as.data.frame(mortgagerates[[1]])
V1 V2 V3 V4 V5
1 Annual Percentage Loans Amortized Over 30 Years. Rate Rate (APR) Points Estimated Payment Per $1,000 Borrowed
2 5/1 CMT 3.625% 4.439% 0.000 $4.56
3 7/1 CMT 3.875% 4.417% 0.000 $4.70
4 3.625% 4.381% 1.000 $4.56https://stackoverflow.com/questions/49921586
复制相似问题