文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从MWE托卡器输出令牌？

问如何从MWE托卡器输出令牌？
EN

Data Science用户

提问于 2019-03-18 19:36:02

回答 1查看 1.1K关注 0票数 1

如何输出使用MWE托卡器生成的令牌？

NLTK的多字表达式标记器(MWETokenizer)提供了一个方法/函数add_mwe()，允许用户在文本上使用标记器之前输入多个单词表达式。

目前，我有一个由短语/多个单词表达式组成的文件，我想在标记器中使用。我所关心的是，我正确地将短语呈现给函数的方式并没有导致在标记传入文本时使用所需的一组标记。

因此，我会问，是否有人知道如何输出add_mwe()生成的令牌，以便验证我是否正确地将短语传递给函数？

nlp

nltk

tokenization

回答 1

Data Science用户

回答已采纳

发布于 2019-03-20 15:33:15

您可以在NLTK的add_mwe类文档中检查这里方法的确切输入和输出参数。

这是预期的投入：

>>> tokenizer.add_mwe(('in', 'spite', 'of'))

因此，每个短语都必须是一个元组，并包含该短语中的单词。如果提供该输入，则应该得到所期望的输出(in_spite_of)。为了方便起见，我在下面添加了一个完整的工作代码片段，在这里您可以看到如何按预期使用该类。

对于add_mwe的输出，每次调用该方法时，它都会向字典中添加一个新单词，所有单词都存储在该类的_mwes属性中。因此，给定mwe = MWETokenizer()，您可以检查mwe的内容(例如print mwe._mwes)，以查看类实际存储的内容。

正如文档中所述，它实际上是一个包含所有术语的Trie，因此它看起来与您添加的单词不完全一样(这是一个更有效的表示)。我前面提到的链接有更多的细节。

希望这能有所帮助！

import nltk

from nltk import (
    sent_tokenize as splitter,
    wordpunct_tokenize as tokenizer
)

from nltk.tokenize.mwe import MWETokenizer

test = """Anyone know how to output the tokens produced using MWE Tokenizer?

For a clearer explanation of what I am asking for those who did not understand my original brief question.

The multi-word expression tokenizer (MWETokenizer) provides a method/function (add_mwe()) that allows the user to enter multiple word expressions prior to using the tokenizer on text. Currently I have a file consisting of phrases / multi-word expression I want to use with the tokenizer. My concern is that the manner in which I am presenting the phrases to the function correctly and so not resulting in the desired set of tokens to be used in tokenizing the incoming text. So this leads me to ask if anyone knows how to output the token generated by this method/function so that I can verify that I am correctly passing the phrase to the function (add_mwe()).?"""

mwe = MWETokenizer()

phrases = [
    ('multi', '-', 'word'),
    ('expression', 'tokenizer'),
    ('word', 'expressions'),
    ('multi', '-', 'word', 'expression')
]

for phrase in phrases:
    mwe.add_mwe(phrase)


for sent in splitter(test):
    tokens = tokenizer(sent)
    print ' '.join(tokens)
    print ' '.join(mwe.tokenize(tokens))
    print '---'



# Expected output:
#
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# Anyone know how to output the tokens produced using MWE Tokenizer ?
# ---
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# For a clearer explanation of what I am asking for those who did not understand my original brief question .
# ---
# The multi - word expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word expressions prior to using the tokenizer on text .
# The multi_-_word_expression tokenizer ( MWETokenizer ) provides a method / function ( add_mwe ()) that allows the user to enter multiple word_expressions prior to using the tokenizer on text .
# ---
# Currently I have a file consisting of phrases / multi - word expression I want to use with the tokenizer .
# Currently I have a file consisting of phrases / multi_-_word_expression I want to use with the tokenizer .
# ---
# ...

票数 0

页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://datascience.stackexchange.com/questions/47556

复制

相似问题

问如何从MWE托卡器输出令牌？
EN

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从MWE托卡器输出令牌？EN

回答 1

Data Science用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从MWE托卡器输出令牌？
EN