问如何捕捉每个段落的列表编号
EN

Stack Overflow用户

提问于 2019-12-25 09:34:56

回答 1查看 230关注 0票数 1

我尝试使用python docx来读取word文件的内容。例如:附件演示word文件，它包含几个段落。有些段落包含标题编号，如1.3、1.4.1等。

我的程序是尝试打开文档，并在每个段落中搜索一个关键字。如果该关键字存在于专用段落中，则打印出该段落及其标题编号。

但是，它无法打印标题编号。例如，我搜索关键字"wall"，它只打印出带有“wall”的段落，而没有标题编号1.4.1。我也需要电话号码。

def search_word(filename,word):
#open the word file
document=Document(filename)
#read every paragraph
l=[paragraph.text.encode('utf-8') for paragraph in document.paragraphs]
result=[]
for i in l:
    i=i.strip()
    i=str(i)
    pattern=re.compile(r"(.*)(%s)(.*)"%word,re.I|re.M)
    rel=pattern.findall(i)
    if  len(rel):
        result.append(rel)
print(filename+"="*30+"Search Result"+"="*30)
print("-"*150)
for k in result:
    for m in k:  
        print("".join(m).strip('b\'')+"\n"*1)
print("-"*150+"\n"*2)

python

docx

回答 1

Stack Overflow用户

发布于 2020-01-18 09:32:34

最后，我找到了一个愚蠢的方法来捕捉每个段落的标题和内容。我首先将docx转换为HTML，然后使用漂亮的汤& re搜索我的关键字。

def search_file(file,word):
global output_content
output_content=output_content+"\n"+"*"*30+file.split("\\")[-1]+" Search Result" +"*"*30+"\n"*2
url=file
htmlfile = open(url, 'r', encoding='utf-8')
demo = htmlfile.read()
soup=BeautifulSoup(demo,'lxml')
all_content=soup.find_all(['h1','h2','h3', 'h4', 'h5','p'])
new_list=[]
for item in all_content:
    if item.text not in new_list:
        new_list.append(item.text)
dic1={}   #Build a empty dic to store each clause no, and its detail content from every paragraph
Target=""
content=""
for line in new_list:
    line=str(line.replace("\n"," "))
    pattern=re.compile(r"(^[1-9].+)")   #Judge the paragraph whether start with heading no. 
    line_no=bool(pattern.search(line))  
    if line_no:                                          #If the paragraph start with heading no
        dic1[Target]=content               #Save the conent to heading no. in dic.
        Target=line                                  
        content=""
        continue
    else:                                                   #if the paragraph is detail, not heading line, 
        content=content+line+"\n"     # save the content
        continue
result=[]  #The keyword search from the dic item, if the keyword in the item, shall print the dic key and item at the same time.     
for value in dic1.values():
    pattern=re.compile(r".*%s.*"%word,re.I|re.M)
    rel=pattern.findall(value)
    if len(rel):
        result.append((list(dic1.keys())[list(dic1.values()).index(value)]))
        result.append(list(rel))
        result.append("\n")
return print_result(file,result)

def print_result(文件，数字)：数字中i的全局列表: if isinstance(i，output_content )：print_result(文件，i)否则: output_content=output_content+i

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59474565

复制

相似问题

问如何捕捉每个段落的列表编号
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何捕捉每个段落的列表编号EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何捕捉每个段落的列表编号
EN