我正在做一个用于崇高文本的插件。其中一个未决问题围绕着这样一个事实:内存中的标记文件可能会导致某种形式的内存或缓冲区溢出错误(崇高文本中的堆栈显然限制在25 MB以内)。我设计了一个简单的外部桶排序来解决这个问题:
#/usr/bin/env python
#
# CSV external bucket sort
import codecs
import tempfile
import os
# column indexes
SYMBOL = 0
FILENAME = 1
TAG_FILE = 'tags'
OUT_FILE = ''.join([TAG_FILE, '_sorted_by_file'])
def sort():
"""External bucket sort of tab delimited CTag files"""
temp_files = {}
def get_file(filename):
"""Get a file from the store of files"""
if not filename in temp_files:
temp_files[filename] = tempfile.NamedTemporaryFile(delete=False)
# close and reopen using codecs to avoid problems described here:
# http://stackoverflow.com/a/10490859/613428
temp_files[filename].close()
temp_files[filename] = codecs.open(
temp_files[filename].name, 'w+', 'utf-8', 'ignore')
return temp_files[filename]
try:
with codecs.open(TAG_FILE, 'r+', 'utf-8', 'ignore') as file_o:
for _ in range(6): # skip the header
next(file_o)
for line in file_o:
temp_file_o = get_file(line.split('\t')[FILENAME])
split = line.split('\t')
split[FILENAME] = split[FILENAME].lstrip('.\\')
temp_file_o.write('\t'.join(split))
with codecs.open(OUT_FILE, 'w+', 'utf-8', 'ignore') as file_o:
# we only need to sort the file names - the symbols were already
# sorted!
for key in sorted(temp_files):
temp_files[key].seek(0)
file_o.write(temp_files[key].read())
finally:
for key in temp_files:
temp_files[key].close()
os.remove(temp_files[key].name)
os.remove(OUT_FILE) # just for testing - remove when done我打算这样做:
For each line (a.k.a. tag) in a tag file
Read line into memory
Get the filename from the line.
Use filename as key to a temp file and write whole this line to said file
Sort final list of keys (i.e. filenames)
For each file in sorted key list
Append contents of file to sorted tag file
Close all files and delete temp files以及表演:
$ python -m timeit -n 100 'import external_sort; external_sort.sort()'
100 loops, best of 3: 12 msec per loop我的问题是:这是否尽其所能?我可以以任何方式改进它,例如失踪的角落案件吗?
注意:我知道它实际上不是一个真正的桶类,因为桶不是任意的,也不是大小相等的(一个包含许多小文件和很少大文件的项目会导致桶的不平衡)。然而,我指望有许多平均大小的文件,没有一个大到足以自己填满内存。
发布于 2014-03-15 09:30:09
split('\t')和lstrip('.\\')可以很好地处理UTF-8编码的字符串.file_o.write(temp_files[key].read())在内存中读取整个文件。要节省内存,请使用循环,并向read提供块大小参数。https://codereview.stackexchange.com/questions/43760
复制相似问题