文章/答案/技术大牛

发布

社区首页 >问答首页 >选择字典中找到的短语并返回doc_id和phrase的数据帧

问选择字典中找到的短语并返回doc_id和phrase的数据帧
EN

Stack Overflow用户

提问于 2020-03-19 23:11:59

回答 2查看 111关注 0票数 2

我有一个医学短语的字典文件和一个原始文本语料库。我正在尝试使用字典文件从文本中选择相关的短语。在这种情况下，短语是1到5个单词的n-gram。最后，我希望在包含两列的数据帧中选择短语: doc_id，phrase

我一直在尝试使用quanteda包来做这件事，但是没有成功。下面是一些重现我最近一次尝试的代码。如果您有任何建议，我将不胜感激……我已经尝试了各种方法，但始终只能得到单个单词的匹配。

version  R version 3.6.2 (2019-12-12)
os       Windows 10 x64              
system   x86_64, mingw32             
ui       RStudio 
Packages:
dbplyr   1.4.2 
quanteda 1.5.2

library(quanteda)
library(dplyr)
raw <- data.frame("doc_id" = c("1", "2", "3"), 
                  "text" = c("diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.", 
                             "magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.", 
                             "radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."))

term = c("diffuse intrinsic pontine glioma", "brain tumors", "brain", "pontine glioma", "mri", "medical imaging", "radiology", "anatomy", "physiological processes", "radiation therapy", "radiotherapy", "cancer treatment", "malignant cells")
medTerms = list(term = term)
dict <- dictionary(medTerms)

corp <- raw %>% group_by(doc_id) %>% summarise(text = paste(text, collapse=" "))
corp <- corpus(corp, text_field = "text")

dfm <- dfm(corp,
           tolower = TRUE, stem = FALSE, remove_punct = TRUE,
           remove = stopwords("english"))
dfm <- dfm_select(dfm, pattern = phrase(dict))

我最终想要得到的是下面这样的东西：

doc_id        term
1       diffuse intrinsice pontine glioma
1       pontine glioma
1       brain tumors
1       brain
2       mri
2       medical imaging
2       radiology
2       anatomy
2       physiological processes
3       radiation therapy
3       radiotherapy
3       cancer treatment
3       malignant cells

dictionary

corpus

quanteda

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-03-20 00:42:35

如果你想从字典中匹配多个单词模式，你可以通过使用ngram构建你的dfm来实现。

library(quanteda)
library(dplyr)
library(tidyr)

raw$text <- as.character(raw$text) # you forgot to use stringsAsFactors = FALSE while constructing the data.frame, so I convert your factor to character before continuing
corp <- corpus(raw, text_field = "text")

dfm <- tokens(corp) %>% 
  tokens_ngrams(1:5) %>% # This is the new way of creating ngram dfms. 1:5 means to construct all from unigram to 5-grams
  dfm(tolower = TRUE, 
      stem = FALSE,
      remove_punct = TRUE) %>% # I wouldn't remove stopwords for this matching task
  dfm_select(pattern = dict)

现在我们只需将dfm转换为data.frame，并将其转换为长格式：

convert(dfm, "data.frame") %>% 
  pivot_longer(-document, names_to = "term") %>% 
  filter(value > 0)
#> # A tibble: 13 x 3
#>    document term                             value
#>    <chr>    <chr>                            <dbl>
#>  1 1        brain                                2
#>  2 1        pontine_glioma                       1
#>  3 1        brain_tumors                         1
#>  4 1        diffuse_intrinsic_pontine_glioma     1
#>  5 2        mri                                  1
#>  6 2        radiology                            1
#>  7 2        anatomy                              1
#>  8 2        medical_imaging                      1
#>  9 2        physiological_processes              1
#> 10 3        radiotherapy                         1
#> 11 3        radiation_therapy                    1
#> 12 3        cancer_treatment                     1
#> 13 3        malignant_cells                      1

您可以删除值列，但稍后可能会对它感兴趣。

票数 2

Stack Overflow用户

发布于 2020-03-20 02:11:43

您可以形成长度从1到5的所有ngram，然后选择all out。但对于大型文本，这将是非常低效的。这里有一个更直接的方法。我在这里通过一些修改重现了整个问题(例如stringsAsFactors = FALSE和跳过一些不必要的步骤)。

当然，这不会像您预期的示例中那样重复计算术语，但我认为您可能不想这样做。如果它发生在“脑瘤”中，为什么还要计算“脑”呢？当“脑瘤”作为这个短语出现时，你会更好地计算它，而“大脑”只有在没有“肿瘤”的情况下才会被计算出来。下面的代码可以做到这一点。

library(quanteda)
## Package version: 2.0.1

raw <- data.frame(
  "doc_id" = c("1", "2", "3"),
  "text" = c(
    "diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
    "magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
    "radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."
  ),
  stringsAsFactors = FALSE
)

dict <- dictionary(list(
  term = c(
    "diffuse intrinsic pontine glioma",
    "brain tumors", "brain", "pontine glioma", "mri", "medical imaging",
    "radiology", "anatomy", "physiological processes", "radiation therapy",
    "radiotherapy", "cancer treatment", "malignant cells"
  )
))

答案的关键在于:首先使用字典来选择标记，然后将它们连接起来，然后为每个新的“文档”重塑一个字典匹配。最后一步是创建所需的data.frame。

toks <- corpus(raw) %>%
  tokens() %>%
  tokens_select(dict) %>% # select just dictionary values
  tokens_compound(dict, concatenator = " ") %>% # turn phrase into single "tokens"
  tokens_segment(pattern = "*") # make one token per "document"

# make into data.frame
data.frame(
  doc_id = docid(toks), term = as.character(toks),
  stringsAsFactors = FALSE
)
##    doc_id                             term
## 1       1 diffuse intrinsic pontine glioma
## 2       1                     brain tumors
## 3       1                            brain
## 4       2                              mri
## 5       2                  medical imaging
## 6       2                        radiology
## 7       2                          anatomy
## 8       2          physiological processes
## 9       3                radiation therapy
## 10      3                     radiotherapy
## 11      3                 cancer treatment
## 12      3                  malignant cells

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60759983

复制

相似问题

问选择字典中找到的短语并返回doc_id和phrase的数据帧
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问选择字典中找到的短语并返回doc_id和phrase的数据帧EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问选择字典中找到的短语并返回doc_id和phrase的数据帧
EN