我想使用stringr::str_match和rebus::capture捕获字符串的某些部分,但我无法理解正确的模式。
文本可能包含一些特殊字符。类似于:

数据:
df <- structure(list(ID = c(1, 1, 1, 2, 2), TEXT = c("VERIFIED DATE/TIME: 24/11/2018 16:23, VERIFIED PERSON IN CHARGE: JOHN",
"HISTORY aaaAAA# 111 FINDINGS Bb123 CONCLUSION 987CCC ccc654",
"DIAGNOSIS abc def hij", "VERIFIED DATE/TIME: 25/10/2018 16:23, VERIFIED PERSON IN CHARGE: Mary",
"HISTORY eeeEEE@ 111 FINDINGS Bb321 CONCLUSION 987FFF ggg654"
)), .Names = c("ID", "TEXT"), row.names = c(NA, 5L), class = "data.frame")
# ID TEXT
# 1 1 VERIFIED DATE/TIME: 24/11/2018 16:23, VERIFIED PERSON IN CHARGE: JOHN
# 2 1 HISTORY aaaAAA# 111 FINDINGS Bb123 CONCLUSION 987CCC ccc654
# 3 1 DIAGNOSIS abc def hij
# 4 2 VERIFIED DATE/TIME: 25/10/2018 16:23, VERIFIED PERSON IN CHARGE: Mary
# 5 2 HISTORY eeeEEE@ 111 FINDINGS Bb321 CONCLUSION 987FFF ggg654所需的输出:I希望将文本拆分为不同的列:
df_out <- structure(ID= c(1,2),
VERIFIED DATE/TIME= c("24/11/2018 16:23 ","25/10/2018 16:23 "),VERIFIED PERSON IN CHARGE=c(“约翰”,“玛丽”),历史= c("aaaAAA# 111","eeeEEE@ 111"),调查结果= c("Bb123","Bb321"),结论= c("987CCC ccc654",“987 def ggg654"),诊断= c("abc def hij",NA),.Names = c("ID“、”核实日期/时间“、”经核实的负责人“、”历史“、”发现“、”结论“、”诊断“),row.names = 1:2,class = "data.frame")
代码:
我尝试了以下代码,但它给了我NA:
library(stringr)
library(rebus)
str_match(df$TEXT, pattern = "VERIFIED DATE/TIME:" %R%
capture(one_or_more(ANY_CHAR)) %R%
"VERIFIED PERSON IN CHARGE:" %R%
capture(one_or_more(ANY_CHAR)))发布于 2019-04-30 08:55:11
组合库tm和stringr。我们首先为每个ID创建一个完整的文本,并在FINDINGS和CONCLUSION之前添加,以保持一致性。
library(tm)
library(stringr)
library(dplyr)
df = df%>%group_by(ID)%>%summarise(TEXT=paste(TEXT,collapse=", "))%>%mutate(TEXT=gsub("(.*)( FINDINGS.*)( CONCLUSION.*)","\\1,\\2,\\3",TEXT))
> df
# A tibble: 2 x 2
ID TEXT
<dbl> <chr>
1 1 VERIFIED DATE/TIME: 24/11/2018 16:23, VERIFIED PERSON IN CHARGE: JOHN, HISTORY aaaAAA# 111, FINDINGS Bb123, CONCLUSION 987CCC ccc654, DIAGN~
2 2 VERIFIED DATE/TIME: 25/10/2018 16:23, VERIFIED PERSON IN CHARGE: Mary, HISTORY eeeEEE@ 111, FINDINGS Bb321, CONCLUSION 987FFF ggg654 然后将我们感兴趣的名称定义为列名,并将它们从字符串中删除。
titles = c("VERIFIED DATE/TIME: ","VERIFIED PERSON IN CHARGE: ","HISTORY ","FINDINGS ","CONCLUSION ","DIAGNOSIS ")
df$TEXT = removeWords(df$TEXT,titles)
> df
# A tibble: 2 x 2
ID TEXT
<dbl> <chr>
1 1 24/11/2018 16:23, JOHN, aaaAAA# 111, Bb123, 987CCC ccc654, abc def hij
2 2 25/10/2018 16:23, Mary, eeeEEE@ 111, Bb321, 987FFF ggg654 最后,我们通过,划分列并设置列的名称。
df_fin=str_split_fixed(df$TEXT, ", ",6)
colnames(df_fin)=titles
> df_fin
VERIFIED DATE/TIME: VERIFIED PERSON IN CHARGE: HISTORY FINDINGS CONCLUSION DIAGNOSIS
[1,] "24/11/2018 16:23" "JOHN" "aaaAAA# 111" "Bb123" "987CCC ccc654" "abc def hij"
[2,] "25/10/2018 16:23" "Mary" "eeeEEE@ 111" "Bb321" "987FFF ggg654" "" 发布于 2019-05-01 05:52:18
这里有一种使用stringr的方法
library(tidyr)
library(dplyr)
library(stringr)
df2 <- df %>%
group_by(ID) %>%
summarise(conc_text = paste(TEXT, collapse = ", ")) %>%
mutate(verified_date = apply(str_match(conc_text, "VERIFIED DATE/TIME: (.*?),"), 1, FUN = function(x) x[2]),
verified_person = apply(str_match(conc_text, "VERIFIED PERSON IN CHARGE: (.*?),"), 1, FUN = function(x) x[2]),
history = apply(str_match(conc_text, "HISTORY (.*?[0-9]{3})"), 1, FUN = function(x) x[2]),
findings = apply(str_match(conc_text, "FINDINGS (.*?[0-9]{3})"), 1, FUN = function(x) x[2]),
conclusions = apply(str_match(conc_text, "CONCLUSION (.*[0-9]{3})"), 1, FUN = function(x) x[2]),
diagnosis = apply(str_match(conc_text, "DIAGNOSIS (.*$)"), 1, FUN = function(x) x[2]))首先通过ID对文本进行连接。
假设HISTORY、FINDINGS和CONCLUSIONS变量以3位数结尾,因此为什么会有[0-9]{3}表达式。使用apply函数获取匹配的字符串。
https://stackoverflow.com/questions/55916399
复制相似问题