文章/答案/技术大牛

发布

社区首页 >问答首页 >如何将pdf文件中的数据转换为数据帧

问如何将pdf文件中的数据转换为数据帧
EN

Stack Overflow用户

提问于 2014-06-16 21:18:47

回答 1查看 7.9K关注 0票数 2

我正在尝试将数据从大量的PDF文件转换为R中的数据框。我一直在使用read.fwf()将PDF文件转换为.txt文件，但问题是所有.txt文件的宽度都不相同。有没有办法确定列的宽度，或者有没有办法使用read.fwf()以外的函数？

我有大量的文件要转换，而且它们一开始都有不同的格式，所以为每个文件寻找特定的列宽变得非常繁琐。有没有一种更有效的方法将数据从PDF文件转换为R中的数据帧？

pdf

dataframe

text-files

column-width

回答 1

Stack Overflow用户

发布于 2014-06-17 06:48:56

以下是使用正则表达式的一种可能的解决方案。您可以使用tm包中的readPDF函数将PDF文件转换为文本，并将每行转换为一个文本字符串。然后使用正则表达式将数据划分为适当的列字段，以便转换为数据框。

我已经将其打包到一个函数中，这样您就可以读取和解析所有PDF文件，并在一个操作中将它们组合到单个数据框中。如果您的其他文件具有您发布的文件中不存在的格式化特性，那么您需要进行一些调整以使其正常工作。

该代码还检查一些简单的数据格式问题，并将“坏”行保存在单独的文本文件中，以供以后检查和处理。同样，如果您的其他文件具有不同的格式变化，则可能需要对此进行调整。

# Use text-mining package to extract text from PDF files    
library(tm)

# Function to read a PDF file and turn it into a data frame
PDFtoDF = function(file) {
  ## Extract PDF text. Each line of PDF becomes one element of the string vector dat.
  dat = readPDF(control=list(text="-layout"))(elem=list(uri=file), 
                                              language="en", id="id1") 
  dat = c(as.character(dat))

  ## Keep only those strings that contain the data we want. 
  ## These are the ones that begin with a number.
  dat = dat[grep("^ {0,2}[0-9]{1,3}", dat)]

  ## Create separators so we can turn strings into a data frame. We'll use the 
  ## pipe "|" as a separator.

  # Add pipe after first number (the row number in the PDF file)
  dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)

  # Replace each instance of 2 or more spaces in a row with a pipe separator. This 
  # works because the company names have a single space between words, while data
  # fields generally have more than one space between them. 
  # (We just need to first add an extra space in a few cases where there's only one
  # space between two data fields.)
  dat = gsub("(, HVOL )","\\1 ", dat)
  dat = gsub(" {2,100}", "|", dat)

  ## Check for data format problems
  # Identify rows without the right number of fields (there should 
  # be six pipe characters per row) and save them to a file for 
  # later inspection and processing (in this case row 11 of the PDF file is excluded))
  excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
  write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)

  # Remove the excluded rows from the string vector
  dat = dat[!excludeRows]

  ## Convert string vector to data frame 
  dat = read.table(text=dat, sep="|", quote="", stringsAsFactors=FALSE)
  names(dat) = c("RowNum", "Reference Entity", "Sub-Index", "CLIP", 
                  "Reference Obligation", "CUSIP/ISIN", "Weighting")
  return(dat)
}

# Create vector of names of files to read
files = list.files(pattern="CDX.*\\.pdf")

# Read each file, convert it to a data frame, then rbind into single data frame
df = do.call("rbind", lapply(files, PDFtoDF))

# Sample of data frame output from your sample file
df
    RowNum    Reference Entity    Sub-Index      CLIP           Reference Obligation   CUSIP/ISIN Weighting
1        1         ACE Limited          FIN 0A4848AC9     ACE-INAHldgs 8.875 15Aug29    00440EAC1     0.008
2        2           Aetna Inc.         FIN 0A8985AC5     AET 6.625 15Jun36 BondCall    00817YAF5     0.008
3        3           Alcoa Inc.  INDU, HVOL 014B98AD5                AA 5.72 23Feb19    013817AP6     0.008

票数 7

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24244709

复制

相似问题

问如何将pdf文件中的数据转换为数据帧
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将pdf文件中的数据转换为数据帧EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将pdf文件中的数据转换为数据帧
EN