文章/答案/技术大牛

发布

社区首页 >问答首页 >将文档部分分解为用于导出Python的列表

问将文档部分分解为用于导出Python的列表
EN

Stack Overflow用户

提问于 2017-04-13 21:17:23

回答 1查看 369关注 0票数 2

我对Python非常陌生，我正在尝试将一些法律文档分解成几个部分，以便导出到SQL中。我需要做两件事：

按目录定义节号，以及
根据定义的节号分解文档

目录列出了节号: 1.1、1.2、1.3等。

然后将文档本身按以下节号细分: 1.1“...Text.”、1.2“...Text.”、1.3“...Text.”等等。

类似于一本书的章节，但用升序小数分隔。

我已经使用Tika解析了文档，并且能够使用一些基本regex创建一个节列表：

import tika
import re

from tika import parser
parsed = parser.from_file('test.pdf')
content = (parsed["content"])

headers = re.findall("[0-9]*[.][0-9]",content)

现在我需要做这样的事情：

splitsections = content.split() by headers

var_string = ', '.join('?' * len(splitsections))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, splitsections)

抱歉，如果这一切都不清楚的话。对这件事还是很陌生的。

如果你能提供任何帮助，我们将不胜感激。

sql

python-3.x

parsing

document

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-04-14 00:20:16

除了DB的最后一部分外，所有测试都经过了测试。此外，代码也可以改进，但这是另一项任务。主要任务已经完成。

在列表split_content中，有您想要的所有信息(即2.1到2.2之间的文本，然后是2.2和2.3之间的文本等等)，不包括节本身的num+name (即不包括2.1 Continuation、2.2 Name等)。

我用PyPDF2代替了PyPDF2，因为tika没有提供完成这项任务所需的工具(也就是说，我没有找到如何提供所需页面的数字并获取其内容)。

def get_pdf_content(pdf_path,
                    start_page_table_contents, end_page_table_contents,
                    first_parsing_page, last_phrase_to_stop):
    """
    :param pdf_path: Full path to the PDF file
    :param start_page_table_contents: The page where the "Contents table" starts
    :param end_page_table_contents:    The page where the "Contents Table" ends
                                      (i.e. the number of the page where Contents Table ENDs, i.e. not the next one)
    :param first_parsing_page:        The 1st page where we need to start data grabbing
    :param last_phrase_to_stop:       The phrase that tells the code where to stop grabbing.
                                      The phrase must match exactly what is written in PDF.
                                      This phrase will be excluded from the grabbed data.
    :return: 
    """

    # ======== GRAB TABLE OF CONTENTS ========
    start_page = start_page_table_contents
    end_page = end_page_table_contents

    table_of_contents_page_nums = range(start_page-1, end_page)

    sections_of_articles = []  # ['2.1 Continuation', '2.2 Name', ... ]

    open_file = open(pdf_path, "rb")
    pdf = PyPDF2.PdfFileReader(open_file)

    for page_num in table_of_contents_page_nums:
        page_content = pdf.getPage(page_num).extractText()

        page_sections = re.findall("[\d]+[.][\d][™\s\w;,-]+", page_content)

        for section in page_sections:
            cleared_section = section.replace('\n', '').strip()
            sections_of_articles.append(cleared_section)

    # ======== GRAB ALL NECESSARY CONTENT (MERGE ALL PAGES) ========
    total_num_pages = pdf.getNumPages()
    parsing_pages = range(first_parsing_page-1, total_num_pages)

    full_parsing_content = ''  # Merged pages

    for parsing_page in parsing_pages:
        page_content = pdf.getPage(parsing_page).extractText()
        cleared_page = page_content.replace('\n', '')

        # Remove page num from the start of "page_content"

        # Covers the case with the page 65, 71 and others when the "page_content" starts
        # with, for example, "616.6 Liability to Partners.  (a)  It is understood that"
        # i.e. "61" is the page num and "6.6 Liability ..." is the section data
        already_cleared = False
        first_50_chars = cleared_page[:51]

        for section in sections_of_articles:
            if section in first_50_chars:
                indx = cleared_page.index(section)
                cleared_page = cleared_page[indx:]

                already_cleared = True
                break

        # Covers all other cases
        if not already_cleared:
            page_num_to_remove = re.match(r'^\d+', cleared_page)
            if page_num_to_remove:
                cleared_page = cleared_page[len(str(page_num_to_remove.group(0))):]

        full_parsing_content += cleared_page

    # ======== BREAK ALL CONTENT INTO PIECES ACCORDING TO TABLE CONTENTS ========
    split_content = []

    num_sections = len(sections_of_articles)

    for num_section in range(num_sections):
        start = sections_of_articles[num_section]

        # Get the last piece, i.e. "11.16 FATCA" (as there is no any "end" section after "11.16 FATCA", so we cant use
        # the logic like "grab info between sections 11.1 and 11.2, 11.2 and 11.3 and so on")
        if num_section == num_sections-1:
            end = last_phrase_to_stop

        else:
            end = sections_of_articles[num_section + 1]

        content = re.search('%s(.*)%s' % (start, end), full_parsing_content).group(1)

        cleared_piece = content.replace('™', "'").strip()
        if cleared_piece[0:3] == '.  ':
            cleared_piece = cleared_piece[3:]

        # There are few appearances of "[Signature Page Follows]", as a "last_phrase_to_stop".
        # We need the text between "11.16 FATCA" and the 1st appearance of "[Signature Page Follows]"
        try:
            indx = cleared_piece.index(end)
            cleared_piece = cleared_piece[:indx]
        except ValueError:
            pass

        split_content.append(cleared_piece)

    # ======== INSERT TO DB ========
    # Did not test this section
    for piece in split_content:
        var_string = ', '.join('?' * len(piece))
        query_string = 'INSERT INTO table VALUES (%s);' % var_string
        cursor.execute(query_string, parts)

如何使用：(可能的方法之一)：

1)在python中保存上面的代码( my_pdf_code.py 2)：

import path.to.my_pdf_code as the_code
the_code.get_pdf_content('/home/username/Apollo_Investment_Fund_VIII_LPA_S1.pdf', 2, 4, 24, '[Signature Page Follows]')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43401861

复制

相似问题

问将文档部分分解为用于导出Python的列表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将文档部分分解为用于导出Python的列表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将文档部分分解为用于导出Python的列表
EN