文章/答案/技术大牛

发布

社区首页 >问答首页 >Python中的合并排序效率

问Python中的合并排序效率
EN

Code Review用户

提问于 2018-02-14 02:48:12

回答 1查看 951关注 0票数 4

我正在编写一个脚本，它接受许多排序的.dat文件。我已经包含了样本数据，但是数据集相当大。期望的结果是有一个按字母顺序排列的单词列表的文件。

我想知道在处理大量单词方面是否遗漏了什么，主要是处理内存使用和捕获错误、日志记录和测试，但仍然不间断地遍历文件。

通过使用数据* 100000，我可以对大约11,000,000行进行排序，没有问题。在处理较大的集合时，我想对它们进行排序，但不要崩溃。

Python的sort()适用于这个操作吗?还是应该考虑其他可以更快或更有效的操作？

是否值得使用多处理模块来帮助处理这些任务？如果是的话，最好的实现是什么？通过研究，我发现了这篇文章，它在处理大型文件时，可能是一个类似于在函数末尾排序大列表的过程。

这些操作的‘成本’对于这项任务来说是非常重要的。

REPL

    dat1 ='allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
    'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks',
    'allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
    'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks'

    dat2 = 'abut', 'actuators', 'advert', 'altitude', 'animals', 'aquaplaned', 'battlement', 'bedside', 'bludgeoning', 'boeing', 'bubblier', 'calendaring', 'callie', 'cardiology', 'caryatides', 'chechnya', 'coffey', 'collage', 'commandos', 'defensive', 'diagnosed', 'doctor', 'elaborate', 'elbow', 'enlarged', 'evening', 'flawed', 'glowers', 'guested', 'handel', 'homogenized', 'husbands', 'hypermarket', 'inge', 'inhibits', 'interloper', 'iowan', 'junco', 'junipers', 'keen', 'logjam', 'lonnie', 'louver', 'low', 'marcelo', 'marginalia', 'matchmaker', 'mold', 'monmouth', 'nautilus', 'noblest', 'north', 'novelist', 'oblations', 'official', 'omnipresent', 'orators', 'overproduce', 'passbooks', 'penalizes', 'pisses', 'precipitating', 'primness', 'quantity', 'quechua', 'rama', 'recruiters', 'recurrent', 'remembrance', 'rumple', 'saguaro', 'sailboard', 'salty', 'scherzo', 'seafarer', 'settles', 'sheryl', 'shoplifter', 'slavs', 'snoring', 'southern', 'spottiest', 'squawk', 'squawks', 'thievish', 'tightest', 'tires', 'tobacconist', 'tripling', 'trouper', 'tyros', 'unmistakably', 'unrepresentative', 'waviest'

    dat3 = 'administrated', 'aggressively', 'albee', 'amble', 'announcers', 'answers', 'arequipa', 'artichoke', 'awed', 'bacillus', 'backslider', 'bandier', 'bellow', 'beset', 'billfolds', 'boneless', 'braziers', 'brick', 'budge', 'cadiz', 'calligrapher', 'clip', 'confining', 'coronets', 'crispier', 'dardanelles', 'daubed', 'deadline', 'declassifying', 'delegating', 'despairs', 'disembodying', 'dumbly', 'dynamically', 'eisenhower', 'encryption', 'estes', 'etiologies', 'evenness', 'evillest', 'expansions', 'fireproofed', 'florence', 'forcing', 'ghostwritten', 'hakluyt', 'headboards', 'hegel', 'hibernate', 'honeyed', 'hope', 'horus', 'inedible', 'inflammation', 'insincerity', 'intuitions', 'ironclads', 'jeffrey', 'knobby', 'lassoing', 'loewi', 'madwoman', 'maurois', 'mechanistic', 'metropolises', 'modified', 'modishly', 'mongols', 'motivation', 'mudslides', 'negev', 'northward', 'outperforms', 'overseer', 'passport', 'pathway', 'physiognomy', 'pi', 'platforming', 'plodder', 'pools', 'poussin', 'pragmatically', 'premeditation', 'punchier', 'puncture', 'raul', 'readjusted', 'reflectors', 'reformat', 'rein', 'relives', 'reproduces', 'restraining', 'resurrection', 'revving', 'rosily', 'sadr', 'scolloped', 'shrubbery', 'side', 'simulations', 'slashing', 'speculating', 'subsidization', 'teaser', 'tourism', 'transfers', 'transnationals', 'triple', 'undermining', 'upheavals', 'vagina', 'victims', 'weird', 'whereabouts', 'wordiness'

    # lines = open(combined_file_name + file_type, 'r').readlines()
    # output = open("intermediate_alphabetical_order.dat", 'w')
    # for line in sorted(lines, key=lambda line: line.split()[0]):
    #         output.write(line)
    # output.close()

    import datetime
    from functools import wraps
    from time import time

    datetme = datetime.datetime.now()
    date = datetme.strftime('%d %b %Y %H:%M:%S ').upper()

    # tuples are used to read in the data to save cost of memory usage
    combined_dat = dat1, dat2, dat3 # * 100000

    results = []

    log = () #TUPLE

    # decorator for speed test.
    def speed_test(f):
        @wraps(f)
        def wrapper(*a, **kw):
            start = time()
            result = f(*a, **kw)
            end = time()
            print('Elapsed time: {} s'.format(round(end - start, 8)))
            return result
        return wrapper

    @speed_test
    def merge_sort_lists(list_of_lists, *a, **kw):
        """takes in a list of lists/tuples and returns a sorted list"""
        try:
            for f in list_of_lists:
                try:
                    for c in f:
                        # recursion for lists in the list of lists... 
                        if isinstance(c, list):
                            merge_sort_lists([c])
                        else:
                            results.append(c)
                except:
                    datetme, ":: Item: {} not added".format(c)
                    # Logging
                        # log.append("file {} not found".format(f))
        except:
            "file {} not found".format(f)
            # Logging
            # log.append("file {} not found".format(f))

        results.sort()

        with open('log.txt', 'a') as f:
            for line in log:
                f.write(line)

    merge_sort_lists(combined_dat)

    # Tests

    def combined_length(combined):
        """calculates the length of a list of lists"""
        com_len = 0

        for i in combined:
            # if isinstance(i, list):
            #     combined_length(i)
            # else:
                com_len += int(len(i))
        return com_len

    com_length = (combined_length(combined_dat))

    res_len = len(results)
    print('\nResult Length: ', res_len, '\nCombined Lists: ', com_length)
    assert(res_len == com_length)

python

python-3.x

mergesort

回答 1

Code Review用户

回答已采纳

发布于 2018-02-14 05:22:48

我突然意识到，您的问题描述与代码不匹配。

您可以将问题描述为“接收多个已排序的.dat文件”，并需要“有一个按字母顺序排序的单词列表的文件”。

这是一个合并，甚至不是一个合并排序。您似乎试图处理内存中的所有数据，这也是不必要的。然而，Timsort是可怕的-快。因此，可能将所有内容加载到内存中，对其进行排序和卸载是最快的选择。如果总数据大小不超过1GB，或>可用RAM (以较小的为准)，则这可能是正确的。

选项1: Timsort！

def merge_files(outfile, *filenames):
    words = []
    for filename in filenames:
        with open(filename) as f:
            words.extend(f.readlines())

    words.sort()  # NB: in-place sort! Don't call sorted(words)!

    with open(outfile, 'w') as out:
        map(out.write, words)

选项2: ZOMG！你有多少数据？？！

def merge(*iters):
    if len(iters) == 1:
        return iters[0]

    iterlist = []
    values = []

    # Initialize values[] and also convert iters tuple to list (for del)
    for i, it in enumerate(iters):
        try:
            values.append(next(it))
            iterlist.append(it) # Order matter: next() might throw.
        except StopIteration:
            pass

    iters = iterlist

    while values:
        nextvalue = min(values)
        yield nextvalue

        try:
            i = values.index(nextvalue)
            values[i] = next(iters[i])
        except StopIteration:
            del values[i], iters[i]

def merge_files(outfile, *filenames):
    iters = [iter(open(fn)) for fn in filenames]

    with open(outfile, 'w') as out:
        map(out.write, merge(iters))

这看起来实际上可能为迭代工具提供了一个很好的函数。我想，它只需要一个key=选项。

票数 3

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/187517

复制

相似问题

问Python中的合并排序效率
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python中的合并排序效率EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python中的合并排序效率
EN