文章/答案/技术大牛

发布

社区首页 >问答首页 >在定义的术语(多酶)周围提取n个单词

问在定义的术语(多酶)周围提取n个单词
EN

Stack Overflow用户

提问于 2018-02-11 02:03:17

回答 1查看 113关注 0票数 1

我有一个文本字符串向量s，如：

Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.",
               "Everyone was busy, so I went to the movie alone. Two seats were vacant.",
               "TEST Rock music approaches at high velocity.",
               "I am happy to take your TEST donation; any amount will be greatly TEST appreciated.",
               "A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.",
               "Rock music approaches at high velocity TEST.")

我希望提取n个(例如:3)单词(单词的特征是字符前后的空格)，围绕(即，前后)特定的术语(例如，'TEST')。改进：几个匹配应该是允许的(也就是说，如果一个特定的术语出现了不止一次，那么预期的解决方案应该捕捉到这些情况)。

结果可能如下所示(格式可以改进)：

S1  <- c(before = "the promotion, but", after = "my attendance wasn’t")
S2  <- c(before = "",                   after = "")
S3  <- c(before = "",                   after = "Rock music approaches")
S4a <- c(before = "to take your",       after = "donation; any amount")
S4b <- c(before = "will be greatly",    after = "appreciated.")
S5a <- c(before = "a green donkey",     after = "flew a TEST")
S5b <- c(before = "TEST flew",          after = "kite in the")
S6  <- c(before = "at high velocit",    after = "")

我该怎么做？我已经找出了其他的泡泡，它们要么是only for one-case-matches，要么与fixed sentence structures有关。

text

text-mining

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-02-11 09:50:33

quanteda包具有很好的功能：kwic() (上下文中的关键字)。

开箱即用，这在您的示例中运行得很好：

library("quanteda")
names(Sentences) <- paste0("S", seq_along(Sentences))
(kw <- kwic(Sentences, "TEST", window = 3))
# 
# [S1, 9]   promotion, but | TEST | my attendance wasn't 
# [S3, 1]                  | TEST | Rock music approaches
# [S4, 7]     to take your | TEST | donation; any        
# [S4, 15] will be greatly | TEST | appreciated.         
# [S5, 8]   a green donkey | TEST | flew a TEST          
# [S5, 11]     TEST flew a | TEST | kite in the          
# [S6, 7] at high velocity | TEST | .               

(kw2 <- as.data.frame(kw)[, c("docname", "pre", "post")])
#   docname              pre                  post
# 1      S1  promotion , but  my attendance wasn't
# 2      S3                  Rock music approaches
# 3      S4     to take your        donation ; any
# 4      S4  will be greatly         appreciated .
# 5      S5   a green donkey           flew a TEST
# 6      S5      TEST flew a           kite in the
# 7      S6 at high velocity                     .

这可能比您在问题中要求的单独对象更好的格式。但是，为了尽可能接近目标，您可以如下所示对其进行进一步转换。

# this picks up the empty matching sentence S2
(kw3 <- merge(kw2, 
              data.frame(docname = names(Sentences), stringsAsFactors = FALSE), 
              all.y = TRUE))
# replaces the NA with the empty string
kw4 <- as.data.frame(lapply(kw3, function(x) { x[is.na(x)] <- ""; x} ), 
                     stringsAsFactors = FALSE)
# renames pre/post to before/after
names(kw4)[2:3] <- c("before", "after")
# makes the docname unique
kw4$docname <- make.unique(kw4$docname)

kw4
#   docname           before                 after
# 1      S1  promotion , but  my attendance wasn't
# 2      S2                                       
# 3      S3                  Rock music approaches
# 4      S4     to take your        donation ; any
# 5    S4.1  will be greatly         appreciated .
# 6      S5   a green donkey           flew a TEST
# 7    S5.1      TEST flew a           kite in the
# 8      S6 at high velocity                     .

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48727546

复制

相似问题

问在定义的术语(多酶)周围提取n个单词
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在定义的术语(多酶)周围提取n个单词EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在定义的术语(多酶)周围提取n个单词
EN