我使用R将一些..pdf转换成.txt文件,并且很难找到一种方法来抓取它们来最终构建一个数据框架。我刚开始发短信,所以请宽恕我的无知。
这是.txt文件的格式,我主要对数字和头文件感兴趣。任何建议都是非常感谢的。
Township of Buena Vista
General Election Results - November 2, 2010
Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk
Township Committee Public Count
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Ward Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188我希望输出是表格形式的,比如
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188除名称和党派归属为一个字符串外。其目标是将其与其他类似的文件合并,以创建数据集。
发布于 2021-07-13 22:31:51
它总是很难看,但这应该是某种程度的自动化:
# read it in as individual lines
rl <- readLines(textConnection(txt))
# drop all the extra info at top
rl <- rl[-(1:9)]
# just keep header
dist <- which(rl == "District")
hd <- head(rl, dist - 1)
# make everything same length and split characters
hd <- lapply(strsplit(hd, ""), `length<-`, max(nchar(hd)))
hd <- lapply(hd, function(x) replace(x, is.na(x), " "))
# find where spaces are in common in all rows
wdths <- rle(Reduce(`&`, lapply(hd, `==`, " ")))$lengths
# read it all in, ignoring district row
out <- read.fwf(textConnection(rl[-dist]), widths=wdths )
# keep those columns that aren't all NA
out <- out[!sapply(out, function(x) all(is.na(x)) )]
# collapse the header
hdr <- sapply(head(out, dist - 1),
function(x) trimws(gsub("\\s+", " ", paste(na.omit(x), collapse=" "))))
# finalise by joining
setNames(
data.frame(lapply(tail(out, -(dist-1)), type.convert, as.is=TRUE)),
hdr
)结果:
# Ward Peter C. Bylone, Sr. Democratic Richard Harlan Democratic
#1 D-1 205 195
#2 D-2 202 160
#3 D-3 331 346
#4 D-4 215 205
#5 D-5 104 95
#6 D-6 77 70
# Mary Ann Micheletti- Levari Republican Henry L. Coia, Jr. Republican
#1 230 223
#2 275 261
#3 99 87
#4 164 152
#5 169 166
#6 109 108
# Total Machine Count Total Vote By Mail Total Provisional Count
#1 436 113 16
#2 459 NA NA
#3 457 NA NA
#4 377 NA NA
#5 271 NA NA
#6 188 NA NA
# Total Emergency Count Total Public Count
#1 NA 565
#2 NA 459
#3 NA 457
#4 NA 377
#5 NA 271
#6 NA 188txt使用的示例是:
" Township of Buena Vista\n General Election Results - November 2, 2010\n Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk\n\n\n\n\n Township Committee Public Count\n\n Mary Ann\n Peter C. Richard Henry L. Total Total Total Total Total\n Micheletti-\n Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public\n Levari\nWard Democratic Democratic Republican Count Mail Count Count Count\n Republican\nDistrict\n D-1 205 195 230 223 436 113 16 565\n D-2 202 160 275 261 459 459\n D-3 331 346 99 87 457 457\n D-4 215 205 164 152 377 377\n D-5 104 95 169 166 271 271\n D-6 77 70 109 108 188 188"发布于 2021-07-13 22:12:42
也许您可以泛化这种方法,但我认为,与示例数据以外的其他数据一起使用时,它非常稳定。
我将您的示例放入一个名为example.txt的文件中。
library(tidyverse)
input <- read_lines("example.txt")
input[as.logical(cumsum(input == "District"))] %>%
tibble() %>%
slice(-1) %>%
mutate(count = str_replace_all(string = ., "\\s{9,12}", ";")) %>%
select(-.) %>%
separate(col = count, into = c("District", as.character(1:9)), sep = ";") %>%
mutate(across(everything(), str_trim),
across(as.character(1:9), as.integer))返回
# A tibble: 6 x 10
District `1` `2` `3` `4` `5` `6` `7` `8` `9`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 D-1 205 195 230 223 436 113 16 NA 565
2 D-2 202 160 275 261 459 NA NA NA 459
3 D-3 331 346 99 87 457 NA NA NA 457
4 D-4 215 205 164 152 377 NA NA NA 377
5 D-5 104 95 169 166 271 NA NA NA 271
6 D-6 77 70 109 108 188 NA NA NA 188创建列名(候选人名称)是一项棘手的任务。根据计数的不同,也许有必要调整用";":\\s{9,12}替换的空格,意思是至少替换9个最多12个空格字符。
https://stackoverflow.com/questions/68369401
复制相似问题