文章/答案/技术大牛

发布

社区首页 >问答首页 >滥用nltk的word_tokenize(发送)的后果

问滥用nltk的word_tokenize(发送)的后果
EN

Stack Overflow用户

提问于 2013-10-15 04:27:10

回答 2查看 4K关注 0票数 6

我正试图把一段话分成几个字。我手头上有一个可爱的nltk.tokenize.word_tokenize(发送)，但是help(word_tokenize)说，“这个标记器被设计成一次处理一个句子。”

有没有人知道，如果你把它用在一个段落上，比如最多5个句子，那会发生什么？我自己也试过几段短短的段落，这似乎很有效，但这还不是决定性的证据。

nltk

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-10-15 04:46:20

nltk.tokenize.word_tokenize(text)只是一个瘦包装函数，它调用TreebankWordTokenizer类实例的tokenize方法，该方法显然使用简单的正则表达式来解析句子。

该类的文档说明：

这个标记器假设文本已经被分割成句子。任何句点--除了字符串末尾的句点--都被认为是它们所连接的单词的一部分(例如缩写等)，而不是单独标记的。

底层的tokenize方法本身非常简单：

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

基本上，如果句点落在字符串的末尾，该方法通常所做的就是将句点标记为单独的令牌：

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

字符串中的任何句点都被标记为单词的一部分，假设它是一个缩写：

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要这种行为是可以接受的，你就应该没事。

票数 7

Stack Overflow用户

发布于 2013-10-15 14:40:37

试试这种黑客：

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

那么，最有可能的是下面的代码是您需要计数的频率也=)

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
...     print i, fdist[i]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/19373296

复制

相似问题

问滥用nltk的word_tokenize(发送)的后果
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问滥用nltk的word_tokenize(发送)的后果EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问滥用nltk的word_tokenize(发送)的后果
EN