首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python mrjob -查找10个最长的单词,但mrjob返回重复的单词

Python mrjob -查找10个最长的单词,但mrjob返回重复的单词
EN

Stack Overflow用户
提问于 2021-10-28 10:51:33
回答 1查看 120关注 0票数 1

我正在使用Python mrjob从一个文本文件中查找10个最长的单词。我已经得到了一个结果,但是结果包含重复的单词。如何仅获取唯一的单词(即删除重复的单词)?

代码语言:javascript
复制
%%file most_chars.py  
from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+") # any whitespace or apostrophe, used to split lines below


class MostChars(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words,
                  reducer=self.reducer_find_longest_words)
        ]

    def mapper_get_words(self, _, line):
        for word in WORD_RE.findall(line):   
            yield None, (len(word), word.lower().strip())

    # discard the key; it is just None
    def reducer_find_longest_words(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word

        sorted_pair = sorted(word_count_pairs, reverse=True)
        
        for pair in sorted_pair[0:10]:
            yield pair
              
if __name__ == '__main__':
    MostChars.run()

实际输出:

代码语言:javascript
复制
18  "overcapitalization"
18  "overcapitalization"
18  "overcapitalization"
17  "uncomprehendingly"
17  "misunderstandings"
17  "disinterestedness"
17  "disinterestedness"
17  "disinterestedness"
17  "disinterestedness"
17  "conventionalities"

预期输出:

代码语言:javascript
复制
18  "overcapitalization"
17  "uncomprehendingly"
17  "misunderstandings"
17  "disinterestedness"
17  "conventionalities"

和另外5个独特的单词

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-10-28 10:56:15

更新reducer_find_longest_words以仅获取唯一的元素。注意list(set())的用法。

代码语言:javascript
复制
    def reducer_find_longest_words(self, _, word_count_pairs):
        # each item of word_count_pairs is (count, word),
        # so yielding one results in key=counts, value=word

        unique_pairs = [list(x) for x in set(tuple(x) for x in word_count_pairs)]
        sorted_pair = sorted(unique_pairs, reverse=True)
        
        for pair in sorted_pair[0:10]:
            yield pair
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69752739

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档