我在R中的Tabulizer的帮助下从PDF中提取了一些表格。下面是其中一个表格的代码
library(tabulizer)
location <- "http://napic.jpph.gov.my/portal/web/guest/main-page?
p_p_id=ViewPublishings_WAR_ViewPublishingsportlet&
p_p_lifecycle=2&
p_p_state=normal&
p_p_mode=view&
p_p_resource_id=fileDownload&
p_p_cacheability=cacheLevelPage&
p_p_col_id=column-2&
p_p_col_pos=1&
p_p_col_count=2&
_ViewPublishings_WAR_ViewPublishingsportlet_publishingId=433&
_ViewPublishings_WAR_ViewPublishingsportlet_action=renderReportPeriodScreen&
_ViewPublishings_WAR_ViewPublishingsportlet_language=&
_ViewPublishings_WAR_ViewPublishingsportlet_pageno=1&
publishingId=4537"
out <- extract_tables(location, page=3)提取的表的输出有一些奇怪的地方,例如,它被分成两部分,一些数据没有正确分隔。
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] " Review " "States " "Single " "2 - 3 " "Single " "2 - 3 " "Detach " "Town " "Cluster " "Low " "Low " "Flat " "Condo- " "Total"
[2,] "Period " "" "Storey " "Storey " "Storey " "Storey " "" "House " "" "Cost " "Cost " "" "minium/" ""
[3,] "" "" "Terrace " "Terrace " "Semi- " "Semi- " "" "" "" "House " "Flat " "" "Apart-" ""
[4,] "" "" "" "" "Detach " "Detach " "" "" "" "" "" "" "ment" ""
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "EXISTING STOCK " "" "" "" "" "" "" "" "" "" "" "" ""
[2,] "" "" "" "" "" "" "" "" "" "" "" "" ""
[3,] "Q3 2016P WP Kuala Lumpur 21,574 " "" "66,286 " "466 " "5,968 " "7,098 " "4,671 " "4,248 " "3,786 " "95,647 " "50,156 " "163,119 " "423,019"
[4,] "WP Putrajaya 0 " "" "2,102 " "0 " "991 " "203 " "96 " "0 " "0 " "2,538 " "0 " "1,785 " "7,715"
[5,] "WP Labuan 835 " "" "1,044 " "70 " "944 " "5,686 " "11 " "0 " "966 " "680 " "1,300 " "225 " "11,761" 我正在寻找的期望输出应该接近原始表:

我现在被难住了,如果有人能告诉我正确的方向,我将不胜感激。提前谢谢。
发布于 2021-05-07 10:02:43
尝试:
locate_areas(file, pages = NULL, resolution = 60L, widget = c("shiny",
"native", "reduced"), copy = FALSE)找到你想要提取的区域,
然后,您需要处理数据以获得您想要的内容。这是目前使用制表工具的唯一方法。致以问候。
https://stackoverflow.com/questions/42294770
复制相似问题