问pdfplumber extract_text函数也可以从表格中提取文本。只想提取表外的文本
EN

Stack Overflow用户

提问于 2021-10-01 14:46:09

回答 1查看 226关注 0票数 0

我有一个包含文本和表格的pdf。我想将它们都提取出来，但是当我使用extract_text函数时，它也会提取表中的内容。我只想提取表外的文本，并且可以使用extract_tables函数提取该表。

我测试了一个只包含表格的pdf，但仍然使用extract_tables函数提取我想要提取的表格内容。

python

pdf

pdfplumber

回答 1

Stack Overflow用户

发布于 2021-10-08 17:32:09

您可以尝试使用以下代码

import pdfplumber

# Import the PDF.
pdf = pdfplumber.open("file.pdf")

# Load the first page.
p = pdf.pages[0]

# Table settings.
ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
}

# Get the bounding boxes of the tables on the page.
bboxes = [table.bbox for table in p.find_tables(table_settings=ts)]

def not_within_bboxes(obj):
    """Check if the object is in any of the table's bbox."""
    def obj_in_bbox(_bbox):
        """See https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L404"""
        v_mid = (obj["top"] + obj["bottom"]) / 2
        h_mid = (obj["x0"] + obj["x1"]) / 2
        x0, top, x1, bottom = _bbox
        return (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
    return not any(obj_in_bbox(__bbox) for __bbox in bboxes)

print("Text outside the tables:")
print(p.filter(not_within_bboxes).extract_text())

我使用pdfplumber提供的.filter()方法删除掉在任何表的边界框内的任何对象，并创建页面的过滤版本，然后从其中提取文本。

由于您尚未共享PDF，因此我使用的表格设置可能不起作用，但您可以根据需要进行更改。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69407850

复制

相似问题

问pdfplumber extract_text函数也可以从表格中提取文本。只想提取表外的文本
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问pdfplumber extract_text函数也可以从表格中提取文本。只想提取表外的文本EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问pdfplumber extract_text函数也可以从表格中提取文本。只想提取表外的文本
EN