首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用R编程实现Pdf中基于关键字的前后行提取

用R编程实现Pdf中基于关键字的前后行提取
EN

Stack Overflow用户
提问于 2017-04-14 23:25:37
回答 1查看 326关注 0票数 0

我想使用R从pdf列表中提取与关键字“癌症”相关的信息。

我想提取之前和之后的行或段落中包含文字癌症的文本文件。

代码语言:javascript
复制
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})

以上正则表达式不起作用

EN

回答 1

Stack Overflow用户

发布于 2017-04-19 11:30:58

这里有一种方法:

代码语言:javascript
复制
library(textreadr)
library(tidyverse)

loc <- function(var, regex, n = 1, ignore.case = TRUE){
    locs <- grep(regex, var, ignore.case = ignore.case)
    out <- sort(unique(c(locs - 1, locs, locs + 1)))
    out <- out[out > 0]
    out[out <= length(var)]
}

doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
    read_pdf() %>%
    slice(loc(text, 'cancer'))

doc

##    page_id element_id                                                                                                                  text
## 1       24         28                              Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2       24         29                              partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3       24         30                                stresses that, in order for them to work, they should be voluntary, and the government
## 4       25          8                         the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5       25          9                             while an average estimate of the value of drugs to treat the country's cancer patients is
## 6       25         10                             $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7       25         12                           because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8       25         13                                                                              excise exemptions for anti-cancer drugs.
## 9       25         14                       Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10      32         19                              Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11      32         20                               anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12      32         21                             December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/43414570

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档