我正在使用pdfparser来解析pdf文件中的文本。对于旧版本的pdf文件,它可以工作,但对于新版本的pdf文件,这个解析器不能工作。我的pdf版本是1.7
<?php
include 'vendor/autoload.php';
// Parse pdf file and build necessary objects.
$parser = new Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('sample.pdf');
// Retrieve all pages from the pdf file.
$pages = $pdf->getPages();
// Loop over each page to extract text.
$content=array();
foreach ($pages as $page) {
$content[]= $page->getTextArray();
echo"<pre>";
print_r($content);
}发布于 2019-05-25 18:34:33
我也经历过同样的行为!
现在,我使用一个工具检查pdf版本,然后再尝试解析它。如果不是1.4,我将其转换为1.4,然后进行解析。如果需要,这里有一个php库:https://github.com/xthiago/pdf-version-converter
代码示例:
function searchablePdfParser($systemPath) {
//we save the file to a temporay file because we might need to convert it.
$tempPath = getPathWithIdAndTimestamp($systemPath) . 'tmp.pdf';
copy($systemPath, $tempPath);
//check whether it needs to be converted and convert it if required
$guesser = new RegexGuesser();
$pdfVersion = $guesser->guess($tempPath); // will print something like '1.4'
if ( $pdfVersion != '1.4' ) {
$command = new GhostscriptConverterCommand();
$filesystem = new Filesystem();
$converter = new GhostscriptConverter($command, $filesystem);
$converter->convert($tempPath, '1.4');
}
//parse the original file or the converted file if it hadn't been a pdf 1.4 version
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($tempPath);
$text = $pdf->getText();
unlink($tempPath);
if ( strlen($text) < 30 ) {
return '';
}
return $text;
}https://stackoverflow.com/questions/40016313
复制相似问题