首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何从用R中的tm包导入的pdf中提取带有特定标题的文本?

如何从用R中的tm包导入的pdf中提取带有特定标题的文本?
EN

Stack Overflow用户
提问于 2018-08-30 22:27:58
回答 1查看 254关注 0票数 0

我正在使用tm包在R中导入几个pdf,我需要从pdfs的内容中包含一些包含标题公司信息的字符向量。问题是双重的。首先,我没有设法提取这个标题的向量。其次,这个向量以一种非常混乱的方式出现。我真的不能把这个人的名字和他在公司里的职位联系起来。这是我尝试构建的数据集的类型。下面我给出一个例子。欢迎任何帮助。

代码语言:javascript
复制
vector_of_interest <- c("   CORPORATE INFORMATION\r\n   BOARD OF DIRECTORS                 REGISTERED OFFICE\r\n   Chuah Ah Bee                       Suite 12-02,12th Floor\r\n   Executive Chairman                 Menara Zurich\r\n   Chuah Hoon Phong                   170 Jalan Argyll, 10050 Penang\r\n   Group Managing Director            Telephone Number : 04-2296 318\r\n   Chan Kim Keow                      Facsimile Number : 04-2282 118\r\n   Executive Director\r\n   Loo Choo Gee\r\n   Executive Director                 COMPANY SECRETARIES\r\n   Chew Chee Khong\r\n   Executive Director                 Gunn Chit Geok\r\n   Ng Seng Bee                        (MAICSA 0673097)\r\n   Independent Non-Executive Director Chew Siew Cheng\r\n   Haji Ahmad Fazil Bin Haji Hashim   (MAICSA 7019191)\r\n   Independent Non-Executive Director\r\n   Goh Choon Aik\r\n   Independent Non-Executive Director SHARE REGISTRAR\r\n                                      Tricor Investor Services Sdn Bhd\r\n   AUDIT COMMITTEE                    Level 17, The Gardens North Tower\r\n                                      Mid Valley City\r\n   Ng Seng Bee                        Lingkaran Syed Putra\r\n   Chairman                           59200 Kuala Lumpur\r\n   Haji Ahmad Fazil Bin Haji Hashim   Telephone Number : 03-2264 3883\r\n   Member                             Facsimile Number : 03-2282 1886\r\n   Goh Choon Aik\r\n   Member\r\n                                      STOCK EXCHANGE LISTING\r\n   REMUNERATION COMMITTEE             Main Market of Bursa Malaysia Securities Berhad\r\n                                      Stock Code : 7174\r\n   Haji Ahmad Fazil Bin Haji Hashim   Stock Name : CAB\r\n   Chairman\r\n   Chuah Ah Bee\r\n   Member                             AUDITORS\r\n   Ng Seng Bee\r\n   Member                             Deloitte KassimChan\r\n                                      Chartered Accountants\r\n                                      4th Floor, Wisma Wang\r\n   NOMINATION COMMITTEE               251-A Jalan Burma\r\n                                      10350 Penang\r\n   Haji Ahmad Fazil Bin Haji Hashim\r\n   Chairman\r\n   Ng Seng Bee                        PRINCIPAL BANKERS\r\n   Member\r\n   Goh Choon Aik                      Malayan Banking Berhad\r\n   Member                             Hong Leong Bank Berhad\r\n                                      United Overseas Bank (Malaysia) Berhad\r\n10 CAB Annual Report 2012\r\n")

#my attempt
 library(tm)
 library(tidyverse)
 library(stringr)

 Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible. I have also tried to add engine = "xpdf", before control

 docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents
 document <- content(docs[[1]])
 corporate.info <- unlist(str_extract_all(document, "CORPORATE INFORMATION.+"))

可在以下链接中找到pdf:http://www.bursamalaysia.com/market/listed-companies/company-announcements/4372609信息位于第10页

EN

回答 1

Stack Overflow用户

发布于 2018-09-01 21:33:46

我找到了一个解决方案:

首先,我将默认ReadPDF engine更改为xpdf

代码语言:javascript
复制
Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout")) 
      # layout control in order to keep the original format as much as possible 

docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) 
        # upload documents i ncname, the path to the files

其次,我折叠文本,以便每个向量都有一个文档:

代码语言:javascript
复制
 document <- content(docs[[1]])
 document <- unlist(paste(document , collapse = ' '))

第三,我使用我正在查找的信息提取页面,并使用正则表达式提取名称

代码语言:javascript
复制
 corporate.info <- unlist(str_extract_all(document, "\\f+.+CORPORATE+.+INFORMATION+.+\\f"))

### "\f" --> indicates the beggining and end of of a page
### "+.+CORPORATE+.+INFORMATION+.+"  --> indicates the page with the heading I was interested

 corporate.info <- unlist(str_extract_all(corporate.info, "[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}")) # extract names 
 corporate.info <- unique(corporate.info) # clean
 corporate.info <- str_replace_all(corporate.info, ".*Bank.*", "") # clean + similar stuff to clean
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52099489

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档