我有一个带有100,000+行数据的txt文件。我想把它变成一个数据,但不需要每一行数据。数据输入的示例如下所示:
FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Yang, Qiang
Liu, Yang
Chen, Tianjian
Tong, Yongxin
TI Federated Machine Learning: Concept and Applications
SO ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
VL 10
IS 2
AR 12
DI 10.1145/3298981
DT Article
PD FEB 2019
PY 2019
AB Today's artificial intelligence still faces two major challenges (...) etc. 我只想要开始TI,AU,PD,AB的行,并将它们提取到相应的命名列中。这就是我所得到的,我真的在挣扎!
read.table("groupprojectdatabase.txt", header = FALSE, sep = ",", quote = "",
dec = ".", numerals = c("allow.loss"),
row.names = c("TI", "AU", "PB","AB"), col.names = c('title_col','author_col','date_col','summary_col'), as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = FALSE,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)任何帮助都将是非常感谢的,即使这是我需要查找的功能,或者我在正确的轨道上。我在想,sep =命令是相关的,但我想不出怎么告诉它跳过除了TI、AU、PB和AB行之外的所有东西
特别是,我不知道如何编程R来处理整个句子作为变量,而不是每个单词等等。
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 4 elements发布于 2022-12-02 18:41:31
我已经根据你上面的数据制作了一个文件test.txt。在使用read.table遇到一些问题之后,我从tidyverse切换到了read::read_delim。
这将逐行读取文件。然后用第一个whitespace分隔这一行,即前两个字母之后。
因为有4行(AU前两个字母)属于一起,下面代码的最后一部分将这些行合并在一起。
library(tidyverse)
df <- read_delim("path_to_your/test.txt", delim = ";", col_names = TRUE)
ddf <- df |>
separate(`FN Clarivate Analytics Web of Science`,
into = c("first", "rest"),
sep = " ", extra = 'merge') |>
mutate(first = ifelse(first == "", NA, first)) |>
fill(first) |>
group_by(first) |>
mutate(rest = paste0(rest, collapse = "")) |>
distinct(first, .keep_all = T)
ddf |>
filter(first %in% c('TI', 'AU', 'PD', 'AB'))
#> # A tibble: 4 × 2
#> # Groups: first [4]
#> first rest
#> <chr> <chr>
#> 1 AU Yang, Qiang Liu, Yang Chen, Tianjian Tong, Yongxin
#> 2 TI Federated Machine Learning: Concept and Applications
#> 3 PD FEB 2019
#> 4 AB Today's artificial intelligence still faces two major challengeshttps://stackoverflow.com/questions/74644656
复制相似问题