文章/答案/技术大牛

发布

社区首页 >问答首页 >查找文档中的一段信息并删除前后的所有内容

问查找文档中的一段信息并删除前后的所有内容
EN

Stack Overflow用户

提问于 2020-02-26 10:01:29

回答 1查看 55关注 0票数 1

我有一些格式化非常特别的.docx文件。

我已经复制了这个文件5次，以表示我需要“找到”的5个不同的字符串，并删除了其他所有内容。

#! python 3
import docx
import os
import shutil
import readDocx as rD

def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    p._p = p._element = None

#Select the file you want to work with
fP = rD.file

#get the working directory for the file
nfP = os.path.dirname(os.path.abspath(fP))
#print (nfP)

#Break the filepath into parts
fileSplit = fP.split('/')

#Get the filename only
fileCode = fileSplit[-1]
#print (fileCode)

#Seperate the course code
nameSplit = fileCode.split(' ')
courseCode = nameSplit[0]
#print (courseCode)

#List of files that we need to create
a1 = "Assessment Summary"
a2 = "Back to Business project"
a3 = "Back to Business Checklist"
a4 = "Skills Demonstration"
a5 = "Skills Demonstration Checklist"
names = [a1, a2, a3, a4, a5]

#Creates a list for the new filenames to sit in
newFiles = []
#Creates the files from the original
for name in names:
    fileName = os.path.join(nfP + '\\' + courseCode + ' ' + str(name) + ' ' +'Version 1.0' + '.docx')
    shutil.copy(fP, fileName)
    #print(fileName)
    newFiles.append(fileName)

#print (newFiles)

#Need to iterate through the files and start deleting data.
h1 = "Learner Declaration"
h2 = "Back to Business Project"
h3 = "Assessor Observation Checklist / Marking Guide"
h4 = "Skills Demonstration"
h5 = "Assessor Observation Checklist / Marking Guide"

这就是我在有限的技能上开始失败的地方。h1-5标记表示我想要保留的文档片段的标题。如何遍历文档，找到标题并删除这些段落之前/之后的所有内容？我不一定需要答案，只需要更多的“朝这个方向看”。

谢谢

python-docx

python

python-3.x

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-02-26 14:32:11

尝尝这个。在评论中清楚地提到了代码的作用。

from docx import Document #Package "Python-docx" needs to be installed to import this
import pandas as pd

# Read the document into a python-docx Document object
document = Document('Path/to/your/input/.docx/document')

#Initialize an empty dataframe to store the .docx document into a dataframe along with the style of each paragraph
document_text_dataframe = pd.DataFrame(columns=['para_text','style'])

#Iterate through the "document" object for extracting the paragraph texts along with their styles into the dataframe "document_text_dataframe"
for para in document.paragraphs: 
    #Extract paragraph style
    style = str(para.style.name)

    ##### For headings which are created as NORMAL style but are BOLD, we need to extract them as well- 
    #####   Ideally these represent headings as well. 
    runboldtext = ''
    for run in para.runs:                        
        if run.bold:
            runboldtext = runboldtext + run.text
    if runboldtext == str(para.text) and runboldtext != '':
        print("Bold True for:",runboldtext)
        style = 'Heading'
    #################################################################

    dftemp = pd.DataFrame({'para_text':[para.text],'style':[style]})
    document_text_dataframe=document_text_dataframe.append(dftemp,sort=False) # Now append each paragraph along with its style into "document_text_dataframe"

document_text_dataframe = document_text_dataframe.reset_index(drop=True)

#Need to iterate through the files and start deleting data.
h1 = "Learner Declaration"
h2 = "Back to Business Project"
h3 = "Assessor Observation Checklist / Marking Guide"
h4 = "Skills Demonstration"
h5 = "Assessor Observation Checklist / Marking Guide"

h_list = [h1,h2,h3,h4]

#Initialize a list to store the extracted information relevant to each "h" value and store them in it
extracted_content=[]

for h in h_list:
    df_temp = pd.DataFrame(columns=['para_text','style'])

    ###########Loop through the document to extract the content related to each "h" value######
    start_index=0
    end_index=0
    for index, row in document_text_dataframe.iterrows():
        if h == row['para_text']:
            print("Found match in document for: ",h)
            start_index = index
            print("Matching index=",index)
            break

    if start_index != 0:     
        for i in range(start_index+1,len(document_text_dataframe)-1):
            if 'Heading' in document_text_dataframe.loc[i,'style']:
                end_index = i
                break
        if end_index !=0:
            for i in range(start_index,end_index):
                df_temp = df_temp.append(document_text_dataframe.loc[i])
    ############################################################################################

    #Append every extracted content into the list "extracted_content"
    if start_index != 0 and end_index!=0:
        extracted_content.append(df_temp)


#The list "extracted_content" will consist of dataframes. Each dataframe will correspond to the extracted information of each "h" value.
print(extracted_content)

现在，通过使用extracted_content，您可以使用您的代码将列表extracted_content中的每个条目写入一个单独的.docx文档。

干杯!

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60405741

复制

相似问题

问查找文档中的一段信息并删除前后的所有内容
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问查找文档中的一段信息并删除前后的所有内容EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问查找文档中的一段信息并删除前后的所有内容
EN