我有一个PDF和几个100页。pdf包含不同长度的新闻稿(从1页到几页)。
然而,每一份新闻稿的开头和结尾都有类似的结构:
一份新闻稿的标题示例:OTS0071 5 AI 0339 MAA0001 Do, 14.Dez 2017
相关新闻稿的尾部示例:141028 Dez 17
将pdf文件读取到R中很容易:
df <- readtext("ots.pdf", encoding = "UTF8")
下面是一个示例文件:
structure(list(doc_id = "ots.pdf", text = "OTS0071 5 AI 0339 MAA0001 Do, 14.Dez 2017\n\nText of press release 1\n\n\n\nOTS0071 2017-12-14/10:28\n\n141028 Dez 17\n\n\n\n\nOTS0184 5 AI 0120 MAA0001 Di, 12.Dez 2017\n\nText of press release 2\n\n\n\nOTS0184 2017-12-12/15:46\n\n121546 Dez 17\n\n\n\n\nOTS0018 5 AI 0206 MAA0002 So, 10.Dez 2017\n\nText of press release 3\n\n\nOTS0018 2017-12-10/12:00\n\n101200 Dez 17\n"), row.names = c(NA,
-1L), class = c("readtext", "data.frame"))但是,我如何告诉R在每一份新闻稿中阅读,作为一个新的观察与以下三个变量: ID,日期,文本
id =新闻稿的OTS编号,在上面的示例中是OTS0071
date=发布新闻稿的日期,在上面的例子中是Do,14.Dez 2017 (即2017年12月14日星期四)
text =头部和尾部之间文本的其余部分
我成功地提取了所有新闻稿,并使用以下命令将它们保存到列表中:
x <- str_extract_all(df$text, "(OTS[0-9]{4})((.|\n)*?)([[:digit:]]{6} [[:alpha:]]{3} [[:digit:]]{2})")但是如何将x(列表)转换为数据帧并添加变量id、date和text
发布于 2022-07-30 13:15:52
我想我终于自己解决了。
必需的软件包
require(pacman)
p_load(readtext, # read files
lubridate, # work with date-times and time-spans
plyr, # Splitting, Applying and Combining Data
tidyverse # data manipulation and plotting
)First,阅读pdf:
df <- readtext("ots.pdf", encoding = "UTF8")或者使用示例数据集
df <- structure(list(doc_id = "ots.pdf", text = "OTS0071 5 AI 0339 MAA0001 Do, 14.Dez 2017\n\nText of press release 1\n\n\n\nOTS0071 2017-12-14/10:28\n\n141028 Dez 17\n\n\n\n\nOTS0184 5 AI 0120 MAA0001 Di, 12.Dez 2017\n\nText of press release 2\n\n\n\nOTS0184 2017-12-12/15:46\n\n121546 Dez 17\n\n\n\n\nOTS0018 5 AI 0206 MAA0002 So, 10.Dez 2017\n\nText of press release 3\n\n\nOTS0018 2017-12-10/12:00\n\n101200 Dez 17\n"), row.names = c(NA,
-1L), class = c("readtext", "data.frame"))第二个,提取文本中的不同新闻稿:
x <- str_extract_all(df$text, "(OTS[0-9]{4})((.|\n)*?)([[:digit:]]{4} [[:alpha:]]{3} [[:digit:]]{2})")第三次,将结果列表转换为tibble,并为该列命名(即"pressReleases"):
df_tibble <- as_tibble(x, "ots")
colnames(df_tibble) <- "pressReleases"**第四,创建变量并删除变量"pressReleases":
df_tibble <- df_tibble %>%
mutate(date = str_extract(df_tibble$pressReleases, "[[:digit:]]{2}.[[:alpha:]]{3} [[:digit:]]{4}")) %>%
mutate(ots = str_extract(df_tibble$pressReleases, "OTS[0-9]{4}")) %>%
mutate(text = str_extract(df_tibble$pressReleases, "([[:digit:]]{2}.[[:alpha:]]{3} [[:digit:]]{4})((.|\n)*)")) %>%
select(-pressReleases)最后,,删除"/n“并将日期转换为日期格式:
df_tibble$text <- gsub("\n"," ", df_tibble$text)
df_tibble$date <- dmy(df_tibble$date)https://stackoverflow.com/questions/73174021
复制相似问题