文章/答案/技术大牛

发布

社区首页 >问答首页 >nltk wordpunct_tokenize vs word_tokenize

问nltk wordpunct_tokenize vs word_tokenize
EN

Stack Overflow用户

提问于 2018-05-09 02:25:03

回答 1查看 13.6K关注 0票数 18

有人知道nltk的wordpunct_tokenize和word_tokenize之间的区别吗?我使用的是nltk=3.2.4，但wordpunct_tokenize的文档字符串中没有任何东西可以解释这种不同。我在nltk的文档中也找不到这方面的信息(也许我没有搜索到正确的位置！)。我本以为第一个会去掉标点符号之类的东西，但它没有。

python

nltk

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-05-09 03:07:29

wordpunct_tokenize基于一个简单的正则表达式标记化。它被定义为

wordpunct_tokenize = WordPunctTokenizer().tokenize

你可以找到here。基本上，它使用正则表达式\w+|[^\w\s]+来拆分输入。

另一方面，word_tokenize是基于TreebankWordTokenizer的，请参阅文档here。它基本上像Penn Treebank中那样对文本进行标记化。这里有一个愚蠢的例子，它应该说明这两者的不同之处。

sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
 'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'", 
 'Hey', "',", 'she', "'", 'll', 'say', '!']

正如我们所看到的，wordpunct_tokenize将在所有特殊符号上进行拆分，并将它们视为独立的单元。另一方面，word_tokenize将像're这样的东西放在一起。但它似乎并不那么聪明，因为我们可以看到，它未能将初始单引用从'Hey'中分离出来。

有趣的是，如果我们这样写句子(单引号作为字符串分隔符，双引号将“嘿”括起来)：

sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'

我们会得到

>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 
 'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''", 
 ',', 'she', "'ll", 'say', '!']

因此，word_tokenize确实分离了双引号，但是它也会将它们转换为``和''。wordpunct_tokenize不会这样做：

>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'", 
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don', 
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"', 
 'Hey', '",', 'she', "'", 'll', 'say', '!']

票数 28

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50240029

复制

相似问题

问nltk wordpunct_tokenize vs word_tokenize
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问nltk wordpunct_tokenize vs word_tokenizeEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问nltk wordpunct_tokenize vs word_tokenize
EN