文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在Google Cloud Function上使用Python pdf2image模块？

问如何在Google Cloud Function上使用Python pdf2image模块？
EN

Stack Overflow用户

提问于 2021-03-21 19:46:42

回答 2查看 454关注 0票数 2

我尝试在Google Cloud Functions上将PDF转换为JPEG。我使用了Python模块pdf2image。但是我不知道如何解决云函数上的错误No such file or directory: 'pdfinfo'和"Unable to get page count. Is poppler installed and in PATH?。

错误代码与this question非常相似。pdf2image是"pdftoppm“和"pdftocairo”的包装器。但是如何在google cloud function上安装poppler包，并将其添加到PATH中？我找不到相关的参考资料。这是可能的吗？如果没有，可以做些什么呢？

还有this question，但它没有什么用处。

代码看起来像下面这样。入口点是process_image。

import requests
from pdf2image import convert_from_path

def process_image(event, context):
    # Download sample pdf file
    url = 'https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf'
    r = requests.get(url, allow_redirects=True)
    open('/tmp/sample.pdf', 'wb').write(r.content)

    # Error occur on this line
    pages = convert_from_path('/tmp/sample.pdf')

    # Save pages to /tmp
    for idx, page in enumerate(pages):
        output_file_path = f"/tmp/{str(idx)}.jpg"
        page.save(output_file_path, 'JPEG')
        # To be saved to cloud storage

Requirement.txt：

requests==2.25.1
pdf2image==1.14.0

这是我得到的错误代码：

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 441, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 1706, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

在处理上述异常的过程中，发生了另一个异常：

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/functions_framework/__init__.py", line 149, in view_func
    function(data, context)
  File "/workspace/main.py", line 11, in process_image
    pages = convert_from_path('/tmp/sample.pdf')
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 97, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 467, in pdfinfo_from_path
    raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

提前感谢您的帮助。

pdf

google-cloud-functions

poppler

python

image

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-07-12 01:48:33

Cloud Functions不支持安装自定义系统级包(即使它支持使用npm、pip等包管理器的相关编程语言的第三方库)。如https://cloud.google.com/functions/docs/reference/system-packages中所示，没有“poppler”包。

但是，您仍然可以使用其他预先安装的软件包。ghostscript可用于将pdf转换为图像。

首先，您应该将pdf文件保存在云函数中(例如，从云存储中)。您只有对/tmp (https://cloud.google.com/functions/docs/concepts/exec#file_system)的磁盘写访问权限。

将pdf转换为jpeg的终端命令示例如下

gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile=output/file/path input/file/path

在python环境中使用该命令的示例代码：

# download the file from google cloud storage
gcs = storage.Client(project=os.environ['GCP_PROJECT'])
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(input_file_path)

# run ghostscript
cmd = f'gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile="{output_file_path}" {input_file_path}'.split(' ')
p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
stdout, stderr = p.communicate()
error = stderr.decode('utf8')
if error:
    logging.error(error)
    return

注意:您可能希望使用imagemagick包，该包本身使用ghostscript。然而，正如Can't load PDF with Wand/ImageMagick in Google Cloud Function中提到的，由于Ghostscript在撰写本文时(2021-07-12)存在的安全漏洞，ImageMagick读取PDF已被禁用。提供的解决方案本质上是运行ghostscript的另一种方式。

参考：https://www.the-swamp.info/blog/google-cloud-functions-system-packages/

票数 0

Stack Overflow用户

发布于 2021-03-23 02:36:30

出现此错误的原因是，弹出程序包不能在Cloud Functions中工作，因为它需要将某些文件写入系统。不幸的是，你不能在像Cloud Functions这样的无服务器产品中写入文件系统。

您可能想尝试另一个线程Cloud Functions for Firebase - Converting PDF to image中描述的方法，或者考虑使用可以访问整个系统的GCP计算引擎。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66731910

复制

相似问题

问如何在Google Cloud Function上使用Python pdf2image模块？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Google Cloud Function上使用Python pdf2image模块？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Google Cloud Function上使用Python pdf2image模块？
EN