blocks|key|65274|text|Word文档不是文本，而是文档:它们有控制信息(如格式)和文本。如果忽略控制信息，文本将非常无用。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|65275|因此，您必须深入了解细节，如何导航文档的控制结构，以找到您感兴趣的文本，然后获取该结构的文本内容。|65276|注意:你会发现这个词非常复杂。如果可以的话，也可以考虑这两种方法：|65277|将Word文档从Word中保存为HTML。它将失去一些格式，但列表将保持原样。HTML的解析和理解比Word简单得多。|unordered-list-item|65278|将文档保存为OOXML+(至少从Office+10开始就存在，扩展名是.docx)。这是一个包含XML文档的ZIP归档文件。XML比完整的Word文档更容易解析/理解，但比HTML版本更难。|offset|length|style|CODE|65279|entityMap^0|0|0|0|0|Z|5|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|R|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|S|8|@]|9|@]|A|$]]|$1|F|3|G|5|H|7|T|8|@]|9|@]|A|$]]|$1|I|3|J|5|H|7|U|8|@$K|V|L|W|M|N]]|9|@]|A|$]]|$1|O|3|-4|5|6|7|X|8|@]|9|@]|A|$]]]|P|$]]

Word documents aren't text, they are documents: They have control information (like formatting) and text. If you ignore the control information, the text is pretty useless.

So you have to dig into the details how to navigate the control structure of the document to find the texts that you're interested in and then get the text content of that structures.

Note: You'll find that Word is very complex. If you can, consider these two approaches as well:

<ul>
<li>Save the Word document as HTML from within Word. It'll lose some formatting but lists will stay intact. HTML is much more simple to parse and understand than Word.</li>
<li>save the document as OOXML (exists at least since Office 10, the extension is <code>.docx</code>). This is a ZIP archive with XML documents inside. The XML is again easier to parse/understand than the full Word document but harder than the HTML version.</li>
</ul>

blocks|key|266132|text|现在我想把这个文本转换成一个包含所有项目的列表。我用过
content+=+".join(content.replace(u"\xa0"，“").strip().split())
这不是一份名单。有什么问题吗？|type|blockquote|depth|inlineStyleRanges|entityRanges|data|266133|.join方法总是返回字符串。。它希望您传递一个列表，然后将该列表与给定的分隔符(“”在您的情况下)连接起来。|unstyled|offset|length|266134|除此之外，亚伦·迪古拉说的话。|266135|entityMap|0|LINK|mutability|MUTABLE|url|http://docs.python.org/library/string.html#string.join^0|0|7|8|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|R|8|@]|9|@$E|S|F|T|1|U]]|A|$]]|$1|G|3|H|5|D|7|V|8|@]|9|@]|A|$]]|$1|I|3|-4|5|D|7|W|8|@]|9|@]|A|$]]]|J|$K|$5|L|M|N|A|$O|P]]]]

<blockquote>
Now i want to convert this text into a list which contains all its items. I used
content = &quot; &quot;.join(content.replace(u&quot;\xa0&quot;, &quot; &quot;).strip().split())
Its not a list. What is the problem?
</blockquote>
The .join method <a href="http://docs.python.org/library/string.html#string.join" rel="nofollow noreferrer">always returns a string</a>. It expects you to pass a list and will then concatenate that list with the given delimiter (&quot; &quot; in your case).
Apart from that, what Aaron Digulla said.

blocks|key|548187|text|请查看此链接中的帖子及其注释：将Word文档转换为文本(Python食谱)|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|548188|另外，这篇文章可能很有用：python将microsoft+office文档转换为linux上的纯文本。|548189|entityMap|0|LINK|mutability|MUTABLE|url|http://code.activestate.com/recipes/279003-converting-word-documents-to-text/|1|https://stackoverflow.com/questions/685533/python-convert-microsoft-office-docs-to-plain-text-on-linux^0|F|M|0|0|D|13|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]]|C|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]|N|$5|I|J|K|C|$L|O]]]]

check this post in this link and its comments : <a href="http://code.activestate.com/recipes/279003-converting-word-documents-to-text/" rel="nofollow noreferrer">Converting Word documents to text (Python recipe) </a> 

also this post may be useful: <a href="https://stackoverflow.com/questions/685533/python-convert-microsoft-office-docs-to-plain-text-on-linux">python convert microsoft office docs to plain text on linux</a>

blocks|key|65348|text|您可以逐行解析单词文档。它不优雅，当然也不漂亮，但它很有效。下面是我在python+3.3中所做的类似工作的片段。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|65349|import+os
directory='your/path/to/file/'
file='yourword.doc'
doc=open(directory%2Bfile,'r%2Bb')
for+line+in+doc:
++++line2=str(line)
++++print(line2))|code-block|syntax|javascript|65350|我用一个正则表达式来得到我所需要的东西。但是，这段代码将读取word文档的每一行(格式化和所有)，并将其转换为您可以处理的漂亮字符串。不确定这是否有帮助(这篇文章有几年的历史了)，但至少它解析了文档这个词。然后，这只是在编写txt文件之前去掉不想要的字符串的问题。|65351|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

You could just parse the word document line by line. It isn't elegant and it certainly isn't pretty but it works. Here's a snippet from something similar I've done in python 3.3.

<pre><code>import os
directory='your/path/to/file/'
file='yourword.doc'
doc=open(directory+file,'r+b')
for line in doc:
 line2=str(line)
 print(line2))
</code></pre>

I used a regular expression to get just what I needed. But this code will read each line of your word document (formatting and all) and convert it to nice strings that you can deal with. Not sure if this is helpful at all (this post is a couple of years old) but at least it parses the word document. Then it's just a matter of getting rid of strings you don't want before writing to a txt file.

i wanted to convert a word document to text. So i used a script. 

<pre><code>import win32com.client 

app = win32com.client.Dispatch('Word.Application') 
doc = app.Documents.Open(r'C:\Users\SBYSMR10\Desktop\New folder (2)\GENERAL DATA.doc') 
content=doc.Content.Text
app.Quit()
print content
</code></pre>

i have the folllowing result:

<img src="https://i.stack.imgur.com/smoWA.png" alt="enter image description here">

Now i want to convert this text into a list which contains all its items. I used 

<pre><code>content = " ".join(content.replace(u"\xa0", " ").strip().split())
</code></pre>

EDIT

When i do that, i get :

<img src="https://i.stack.imgur.com/YWOAD.png" alt="enter image description here">

Its not a list. What is the problem? What is that big dot character?

Parse Word Document in Python

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云AI代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

功能1上新10个字符

功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符功能2描述100个字符。

功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符功能2上新100个字符。

功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符功能5描述100个字符

功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符功能5上新100个字符

功能4上新

文章&问答评论现已支持表情

全新交互，全新视觉，新增快捷键、悬浮工具栏、高亮块等功能并同时优化现有功能，全面提升创作效率和体验

社区富文本编辑器全新改版！诚邀体验～ 

精选全网热门MCP server，让你的AI更好用 🚀

💥开发者 MCP广场重磅上线！

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我想把一个word文档转换成文本。所以我用了剧本。import win32com.client app = win32com.client.Dispatch('Word.Application') doc = app.Documents.Open(r'C:\Users\SBYSMR10\Desktop\New folder (2)\GENERAL DATA.doc') content=doc.Co

问用Python解析Word文档
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python解析Word文档EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python解析Word文档
EN