我正在使用tm包在R中导入几个pdf,我需要从pdfs的内容中包含一些包含标题公司信息的字符向量。问题是双重的。首先,我没有设法提取这个标题的向量。其次,这个向量以一种非常混乱的方式出现。我真的不能把这个人的名字和他在公司里的职位联系起来。这是我尝试构建的数据集的类型。下面我给出一个例子。欢迎任何帮助。
vector_of_interest <- c(" CORPORATE INFORMATION\r\n BOARD OF DIRECTORS REGISTERED OFFICE\r\n Chuah Ah Bee Suite 12-02,12th Floor\r\n Executive Chairman Menara Zurich\r\n Chuah Hoon Phong 170 Jalan Argyll, 10050 Penang\r\n Group Managing Director Telephone Number : 04-2296 318\r\n Chan Kim Keow Facsimile Number : 04-2282 118\r\n Executive Director\r\n Loo Choo Gee\r\n Executive Director COMPANY SECRETARIES\r\n Chew Chee Khong\r\n Executive Director Gunn Chit Geok\r\n Ng Seng Bee (MAICSA 0673097)\r\n Independent Non-Executive Director Chew Siew Cheng\r\n Haji Ahmad Fazil Bin Haji Hashim (MAICSA 7019191)\r\n Independent Non-Executive Director\r\n Goh Choon Aik\r\n Independent Non-Executive Director SHARE REGISTRAR\r\n Tricor Investor Services Sdn Bhd\r\n AUDIT COMMITTEE Level 17, The Gardens North Tower\r\n Mid Valley City\r\n Ng Seng Bee Lingkaran Syed Putra\r\n Chairman 59200 Kuala Lumpur\r\n Haji Ahmad Fazil Bin Haji Hashim Telephone Number : 03-2264 3883\r\n Member Facsimile Number : 03-2282 1886\r\n Goh Choon Aik\r\n Member\r\n STOCK EXCHANGE LISTING\r\n REMUNERATION COMMITTEE Main Market of Bursa Malaysia Securities Berhad\r\n Stock Code : 7174\r\n Haji Ahmad Fazil Bin Haji Hashim Stock Name : CAB\r\n Chairman\r\n Chuah Ah Bee\r\n Member AUDITORS\r\n Ng Seng Bee\r\n Member Deloitte KassimChan\r\n Chartered Accountants\r\n 4th Floor, Wisma Wang\r\n NOMINATION COMMITTEE 251-A Jalan Burma\r\n 10350 Penang\r\n Haji Ahmad Fazil Bin Haji Hashim\r\n Chairman\r\n Ng Seng Bee PRINCIPAL BANKERS\r\n Member\r\n Goh Choon Aik Malayan Banking Berhad\r\n Member Hong Leong Bank Berhad\r\n United Overseas Bank (Malaysia) Berhad\r\n10 CAB Annual Report 2012\r\n")
#my attempt
library(tm)
library(tidyverse)
library(stringr)
Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible. I have also tried to add engine = "xpdf", before control
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents
document <- content(docs[[1]])
corporate.info <- unlist(str_extract_all(document, "CORPORATE INFORMATION.+"))可在以下链接中找到pdf:http://www.bursamalaysia.com/market/listed-companies/company-announcements/4372609信息位于第10页
发布于 2018-09-01 21:33:46
我找到了一个解决方案:
首先,我将默认ReadPDF engine更改为xpdf
Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout"))
# layout control in order to keep the original format as much as possible
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf))
# upload documents i ncname, the path to the files其次,我折叠文本,以便每个向量都有一个文档:
document <- content(docs[[1]])
document <- unlist(paste(document , collapse = ' '))第三,我使用我正在查找的信息提取页面,并使用正则表达式提取名称
corporate.info <- unlist(str_extract_all(document, "\\f+.+CORPORATE+.+INFORMATION+.+\\f"))
### "\f" --> indicates the beggining and end of of a page
### "+.+CORPORATE+.+INFORMATION+.+" --> indicates the page with the heading I was interested
corporate.info <- unlist(str_extract_all(corporate.info, "[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}\\s[A-Z]+[a-z]{1,8}")) # extract names
corporate.info <- unique(corporate.info) # clean
corporate.info <- str_replace_all(corporate.info, ".*Bank.*", "") # clean + similar stuff to cleanhttps://stackoverflow.com/questions/52099489
复制相似问题