首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >NLTK - Python :如何格式化原始文本

NLTK - Python :如何格式化原始文本
EN

Stack Overflow用户
提问于 2019-01-11 01:51:29
回答 1查看 709关注 0票数 3

您知道使用NLTK (或任何其他NLP) & Python是否可以格式化原始文本(没有标点符号、大写或段落之间的行间隔)吗?

我已经看过了文档,但是我找不到任何能帮助我完成这项任务的东西。

示例:

输入:

代码语言:javascript
复制
python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community

输出:

代码语言:javascript
复制
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. In July 2018, Van Rossum stepped down as the leader in the language community.

谢谢,

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-01-11 16:12:57

有趣的问题。至于边界的插入,您可以训练NLTK的标记器(或者语句拆分器)(如果你谷歌的话,就会有大量的文档)。有一件事你可以尝试得到一些文本的句子分裂,删除标点符号,然后训练,看看你得到了什么。如下所示。如前所述,该算法可能非常依赖标点符号,而且在任何情况下,下面的代码对示例句都不起作用,但是如果您使用其他/更大/不同的域培训文本,则可能值得尝试。不完全确定这是否也适用于插入逗号和其他(非句子-最终/初始)标点符号。

代码语言:javascript
复制
from nltk.corpus import gutenberg
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
import re

text = ""
for file_id in gutenberg.fileids():
    text += gutenberg.raw(file_id)
# remove punctuation
text = re.sub('[\.\?!]\n', '\n', text) #  you will probably want to include some other potential sentence final punctuation here
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(text)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
sentences = "python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community"
 print(tokenizer.tokenize(sentences))
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54139341

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档