首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >为许多格式化类似的文件抓取.txt文件

为许多格式化类似的文件抓取.txt文件
EN

Stack Overflow用户
提问于 2021-07-13 21:14:11
回答 2查看 51关注 0票数 1

我使用R将一些..pdf转换成.txt文件,并且很难找到一种方法来抓取它们来最终构建一个数据框架。我刚开始发短信,所以请宽恕我的无知。

这是.txt文件的格式,我主要对数字和头文件感兴趣。任何建议都是非常感谢的。

代码语言:javascript
复制
   Township of Buena Vista
                                                                             General Election Results - November 2, 2010
                                                                 Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk




                           Township Committee                                                Public Count

                                       Mary Ann
            Peter C.      Richard                    Henry L.      Total         Total          Total          Total           Total
                                      Micheletti-
           Bylone, Sr.     Harlan                    Coia, Jr.    Machine       Vote By      Provisional     Emergency         Public
                                        Levari
Ward       Democratic    Democratic                 Republican     Count         Mail          Count           Count           Count
                                      Republican
District
 D-1          205           195          230           223          436           113            16                            565
 D-2          202           160          275           261          459                                                        459
 D-3          331           346          99            87           457                                                        457
 D-4          215           205          164           152          377                                                        377
 D-5          104           95           169           166          271                                                        271
 D-6          77            70           109           108          188                                                        188

我希望输出是表格形式的,比如

代码语言:javascript
复制
                               Mary Ann
            Peter C.      Richard                    Henry L.      Total         Total          Total          Total           Total
                                      Micheletti-
           Bylone, Sr.     Harlan                    Coia, Jr.    Machine       Vote By      Provisional     Emergency         Public
                                        Levari
           Democratic    Democratic                 Republican     Count         Mail          Count           Count           Count
                                      Republican
District
 D-1          205           195          230           223          436           113            16                            565
 D-2          202           160          275           261          459                                                        459
 D-3          331           346          99            87           457                                                        457
 D-4          215           205          164           152          377                                                        377
 D-5          104           95           169           166          271                                                        271
 D-6          77            70           109           108          188                                                        188

除名称和党派归属为一个字符串外。其目标是将其与其他类似的文件合并,以创建数据集。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-07-13 22:31:51

它总是很难看,但这应该是某种程度的自动化:

代码语言:javascript
复制
# read it in as individual lines
rl <- readLines(textConnection(txt))
# drop all the extra info at top
rl <- rl[-(1:9)]

# just keep header
dist <- which(rl == "District")
hd <- head(rl, dist - 1)

# make everything same length and split characters
hd <- lapply(strsplit(hd, ""), `length<-`, max(nchar(hd)))
hd <- lapply(hd, function(x) replace(x, is.na(x), " "))

# find where spaces are in common in all rows
wdths <- rle(Reduce(`&`, lapply(hd, `==`, " ")))$lengths

# read it all in, ignoring district row
out <- read.fwf(textConnection(rl[-dist]), widths=wdths )
# keep those columns that aren't all NA 
out <- out[!sapply(out, function(x) all(is.na(x)) )]

# collapse the header
hdr <- sapply(head(out, dist - 1), 
       function(x) trimws(gsub("\\s+", " ", paste(na.omit(x), collapse=" "))))

# finalise by joining
setNames(
  data.frame(lapply(tail(out, -(dist-1)), type.convert, as.is=TRUE)),
  hdr
)

结果:

代码语言:javascript
复制
#  Ward Peter C. Bylone, Sr. Democratic Richard Harlan Democratic
#1  D-1                             205                       195
#2  D-2                             202                       160
#3  D-3                             331                       346
#4  D-4                             215                       205
#5  D-5                             104                        95
#6  D-6                              77                        70
#  Mary Ann Micheletti- Levari Republican Henry L. Coia, Jr. Republican
#1                                    230                           223
#2                                    275                           261
#3                                     99                            87
#4                                    164                           152
#5                                    169                           166
#6                                    109                           108
#  Total Machine Count Total Vote By Mail Total Provisional Count
#1                 436                113                      16
#2                 459                 NA                      NA
#3                 457                 NA                      NA
#4                 377                 NA                      NA
#5                 271                 NA                      NA
#6                 188                 NA                      NA
#  Total Emergency Count Total Public Count
#1                    NA                565
#2                    NA                459
#3                    NA                457
#4                    NA                377
#5                    NA                271
#6                    NA                188

txt使用的示例是:

代码语言:javascript
复制
"   Township of Buena Vista\n                                                                             General Election Results - November 2, 2010\n                                                                 Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk\n\n\n\n\n                           Township Committee                                                Public Count\n\n                                       Mary Ann\n            Peter C.      Richard                    Henry L.      Total         Total          Total          Total           Total\n                                      Micheletti-\n           Bylone, Sr.     Harlan                    Coia, Jr.    Machine       Vote By      Provisional     Emergency         Public\n                                        Levari\nWard       Democratic    Democratic                 Republican     Count         Mail          Count           Count           Count\n                                      Republican\nDistrict\n D-1          205           195          230           223          436           113            16                            565\n D-2          202           160          275           261          459                                                        459\n D-3          331           346          99            87           457                                                        457\n D-4          215           205          164           152          377                                                        377\n D-5          104           95           169           166          271                                                        271\n D-6          77            70           109           108          188                                                        188"
票数 1
EN

Stack Overflow用户

发布于 2021-07-13 22:12:42

也许您可以泛化这种方法,但我认为,与示例数据以外的其他数据一起使用时,它非常稳定。

我将您的示例放入一个名为example.txt的文件中。

代码语言:javascript
复制
library(tidyverse)

input <- read_lines("example.txt")

input[as.logical(cumsum(input == "District"))] %>% 
  tibble() %>% 
  slice(-1) %>% 
  mutate(count = str_replace_all(string = ., "\\s{9,12}", ";")) %>%
  select(-.) %>% 
  separate(col = count, into = c("District", as.character(1:9)), sep = ";") %>% 
  mutate(across(everything(), str_trim),
         across(as.character(1:9), as.integer))

返回

代码语言:javascript
复制
# A tibble: 6 x 10
  District   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`
  <chr>    <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 D-1        205   195   230   223   436   113    16    NA   565
2 D-2        202   160   275   261   459    NA    NA    NA   459
3 D-3        331   346    99    87   457    NA    NA    NA   457
4 D-4        215   205   164   152   377    NA    NA    NA   377
5 D-5        104    95   169   166   271    NA    NA    NA   271
6 D-6         77    70   109   108   188    NA    NA    NA   188

创建列名(候选人名称)是一项棘手的任务。根据计数的不同,也许有必要调整用";":\\s{9,12}替换的空格,意思是至少替换9个最多12个空格字符。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68369401

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档