我正在编写一个脚本,它接受许多排序的.dat文件。我已经包含了样本数据,但是数据集相当大。期望的结果是有一个按字母顺序排列的单词列表的文件。
我想知道在处理大量单词方面是否遗漏了什么,主要是处理内存使用和捕获错误、日志记录和测试,但仍然不间断地遍历文件。
通过使用数据* 100000,我可以对大约11,000,000行进行排序,没有问题。在处理较大的集合时,我想对它们进行排序,但不要崩溃。
Python的sort()适用于这个操作吗?还是应该考虑其他可以更快或更有效的操作?
是否值得使用多处理模块来帮助处理这些任务?如果是的话,最好的实现是什么?通过研究,我发现了这篇文章,它在处理大型文件时,可能是一个类似于在函数末尾排序大列表的过程。
这些操作的‘成本’对于这项任务来说是非常重要的。
dat1 ='allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks',
'allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn',
'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks'
dat2 = 'abut', 'actuators', 'advert', 'altitude', 'animals', 'aquaplaned', 'battlement', 'bedside', 'bludgeoning', 'boeing', 'bubblier', 'calendaring', 'callie', 'cardiology', 'caryatides', 'chechnya', 'coffey', 'collage', 'commandos', 'defensive', 'diagnosed', 'doctor', 'elaborate', 'elbow', 'enlarged', 'evening', 'flawed', 'glowers', 'guested', 'handel', 'homogenized', 'husbands', 'hypermarket', 'inge', 'inhibits', 'interloper', 'iowan', 'junco', 'junipers', 'keen', 'logjam', 'lonnie', 'louver', 'low', 'marcelo', 'marginalia', 'matchmaker', 'mold', 'monmouth', 'nautilus', 'noblest', 'north', 'novelist', 'oblations', 'official', 'omnipresent', 'orators', 'overproduce', 'passbooks', 'penalizes', 'pisses', 'precipitating', 'primness', 'quantity', 'quechua', 'rama', 'recruiters', 'recurrent', 'remembrance', 'rumple', 'saguaro', 'sailboard', 'salty', 'scherzo', 'seafarer', 'settles', 'sheryl', 'shoplifter', 'slavs', 'snoring', 'southern', 'spottiest', 'squawk', 'squawks', 'thievish', 'tightest', 'tires', 'tobacconist', 'tripling', 'trouper', 'tyros', 'unmistakably', 'unrepresentative', 'waviest'
dat3 = 'administrated', 'aggressively', 'albee', 'amble', 'announcers', 'answers', 'arequipa', 'artichoke', 'awed', 'bacillus', 'backslider', 'bandier', 'bellow', 'beset', 'billfolds', 'boneless', 'braziers', 'brick', 'budge', 'cadiz', 'calligrapher', 'clip', 'confining', 'coronets', 'crispier', 'dardanelles', 'daubed', 'deadline', 'declassifying', 'delegating', 'despairs', 'disembodying', 'dumbly', 'dynamically', 'eisenhower', 'encryption', 'estes', 'etiologies', 'evenness', 'evillest', 'expansions', 'fireproofed', 'florence', 'forcing', 'ghostwritten', 'hakluyt', 'headboards', 'hegel', 'hibernate', 'honeyed', 'hope', 'horus', 'inedible', 'inflammation', 'insincerity', 'intuitions', 'ironclads', 'jeffrey', 'knobby', 'lassoing', 'loewi', 'madwoman', 'maurois', 'mechanistic', 'metropolises', 'modified', 'modishly', 'mongols', 'motivation', 'mudslides', 'negev', 'northward', 'outperforms', 'overseer', 'passport', 'pathway', 'physiognomy', 'pi', 'platforming', 'plodder', 'pools', 'poussin', 'pragmatically', 'premeditation', 'punchier', 'puncture', 'raul', 'readjusted', 'reflectors', 'reformat', 'rein', 'relives', 'reproduces', 'restraining', 'resurrection', 'revving', 'rosily', 'sadr', 'scolloped', 'shrubbery', 'side', 'simulations', 'slashing', 'speculating', 'subsidization', 'teaser', 'tourism', 'transfers', 'transnationals', 'triple', 'undermining', 'upheavals', 'vagina', 'victims', 'weird', 'whereabouts', 'wordiness'
# lines = open(combined_file_name + file_type, 'r').readlines()
# output = open("intermediate_alphabetical_order.dat", 'w')
# for line in sorted(lines, key=lambda line: line.split()[0]):
# output.write(line)
# output.close()
import datetime
from functools import wraps
from time import time
datetme = datetime.datetime.now()
date = datetme.strftime('%d %b %Y %H:%M:%S ').upper()
# tuples are used to read in the data to save cost of memory usage
combined_dat = dat1, dat2, dat3 # * 100000
results = []
log = () #TUPLE
# decorator for speed test.
def speed_test(f):
@wraps(f)
def wrapper(*a, **kw):
start = time()
result = f(*a, **kw)
end = time()
print('Elapsed time: {} s'.format(round(end - start, 8)))
return result
return wrapper
@speed_test
def merge_sort_lists(list_of_lists, *a, **kw):
"""takes in a list of lists/tuples and returns a sorted list"""
try:
for f in list_of_lists:
try:
for c in f:
# recursion for lists in the list of lists...
if isinstance(c, list):
merge_sort_lists([c])
else:
results.append(c)
except:
datetme, ":: Item: {} not added".format(c)
# Logging
# log.append("file {} not found".format(f))
except:
"file {} not found".format(f)
# Logging
# log.append("file {} not found".format(f))
results.sort()
with open('log.txt', 'a') as f:
for line in log:
f.write(line)
merge_sort_lists(combined_dat)
# Tests
def combined_length(combined):
"""calculates the length of a list of lists"""
com_len = 0
for i in combined:
# if isinstance(i, list):
# combined_length(i)
# else:
com_len += int(len(i))
return com_len
com_length = (combined_length(combined_dat))
res_len = len(results)
print('\nResult Length: ', res_len, '\nCombined Lists: ', com_length)
assert(res_len == com_length)发布于 2018-02-14 05:22:48
我突然意识到,您的问题描述与代码不匹配。
您可以将问题描述为“接收多个已排序的.dat文件”,并需要“有一个按字母顺序排序的单词列表的文件”。
这是一个合并,甚至不是一个合并排序。您似乎试图处理内存中的所有数据,这也是不必要的。然而,Timsort是可怕的-快。因此,可能将所有内容加载到内存中,对其进行排序和卸载是最快的选择。如果总数据大小不超过1GB,或>可用RAM (以较小的为准),则这可能是正确的。
选项1: Timsort!
def merge_files(outfile, *filenames):
words = []
for filename in filenames:
with open(filename) as f:
words.extend(f.readlines())
words.sort() # NB: in-place sort! Don't call sorted(words)!
with open(outfile, 'w') as out:
map(out.write, words)选项2: ZOMG!你有多少数据??!
def merge(*iters):
if len(iters) == 1:
return iters[0]
iterlist = []
values = []
# Initialize values[] and also convert iters tuple to list (for del)
for i, it in enumerate(iters):
try:
values.append(next(it))
iterlist.append(it) # Order matter: next() might throw.
except StopIteration:
pass
iters = iterlist
while values:
nextvalue = min(values)
yield nextvalue
try:
i = values.index(nextvalue)
values[i] = next(iters[i])
except StopIteration:
del values[i], iters[i]
def merge_files(outfile, *filenames):
iters = [iter(open(fn)) for fn in filenames]
with open(outfile, 'w') as out:
map(out.write, merge(iters))这看起来实际上可能为迭代工具提供了一个很好的函数。我想,它只需要一个key=选项。
https://codereview.stackexchange.com/questions/187517
复制相似问题