首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python (List)反/序列化性能

Python (List)反/序列化性能
EN

Stack Overflow用户
提问于 2014-04-18 19:46:25
回答 1查看 906关注 0票数 6

我正在编写一个脚本,它需要处理一个相当大(620,000字)的启动词典。输入词典被逐字处理成一个defaultdict(list),其中键是字母bi,而值和值是包含键字母n-gram的单词列表。

代码语言:javascript
复制
for word in lexicon_file:
    word = word.lower()
    for letter n-gram in word:
        lexicon[n-gram].append(word)

比如

代码语言:javascript
复制
> lexicon["ab"]
["abracadabra", "abbey", "abnormal"]

结果结构包含25000个键,每个键包含一个包含1到133 000个字符串的列表(平均500,中位数20)。所有字符串都采用windows-1250编码。

这个处理需要很长时间(考虑到脚本的预期实际运行时可以忽略不计,但在测试时通常会进行调整),而且由于词汇表本身从未改变,所以我认为序列化结果的defaultdict(list)并在随后的每个启动时反序列化它可能会更快。

我发现,即使在使用cPickle时,反序列化过程也比简单地处理字典慢一倍,平均值接近:

代码语言:javascript
复制
> normal lexicon creation
45 seconds
> cPickle deserialization
80 seconds

我对序列化没有任何经验,但我希望反序列化比正常处理更快,至少对于cPickle模块是这样。

我的问题是,这个结果是否值得期待?为什么?有什么方法可以更快地存储/加载我的结构吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-04-18 21:49:58

解决这类问题的最好方法是编写一系列测试,并使用timeit查看哪个测试更快。我在下面做了一些测试,但是你应该用你的字典字典来尝试,因为你的结果可能会有所不同。

如果希望时间更稳定(准确),可以将number参数增加到timeit --这只会使测试耗时更长。另外,请注意,timeit返回的值是总执行时间,而不是每次运行的时间。

代码语言:javascript
复制
testing with 10 keys...
serialize flat: 2.97198390961
serialize eval: 4.60271120071
serialize defaultdict: 20.3057091236
serialize dict: 20.2011070251
serialize defaultdict new pickle: 14.5152060986
serialize dict new pickle: 14.7755970955
serialize json: 13.5039670467
serialize cjson: 4.0456969738
unserialize flat: 1.29577493668
unserialize eval: 25.6548647881
unserialize defaultdict: 10.2215960026
unserialize dict: 10.208122015
unserialize defaultdict new pickle: 5.70747089386
unserialize dict new pickle: 5.69750404358
unserialize json: 5.34811091423
unserialize cjson: 1.50241613388
testing with 100 keys...
serialize flat: 2.91076397896
serialize eval: 4.72978711128
serialize defaultdict: 21.331786871
serialize dict: 21.3218340874
serialize defaultdict new pickle: 15.7140991688
serialize dict new pickle: 15.6440980434
serialize json: 14.3557379246
serialize cjson: 5.00576901436
unserialize flat: 1.6677339077
unserialize eval: 22.9142649174
unserialize defaultdict: 10.7773029804
unserialize dict: 10.7524499893
unserialize defaultdict new pickle: 6.13370203972
unserialize dict new pickle: 6.18057107925
unserialize json: 5.92281794548
unserialize cjson: 1.91151690483

代码:

代码语言:javascript
复制
import cPickle
import json
try:
    import cjson  # not Python standard library
except ImportError:
    cjson = False
from collections import defaultdict

dd1 = defaultdict(list)
dd2 = defaultdict(list)

for i in xrange(1000000):
    dd1[str(i % 10)].append(str(i))  
    dd2[str(i % 100)].append(str(i))

dt1 = dict(dd1)
dt2 = dict(dd2)

from timeit import timeit

def testdict(dd, dt):
    def serialize_defaultdict():
        with open('defaultdict.pickle', 'w') as f:
            cPickle.dump(dd, f)

    def serialize_p2_defaultdict():
        with open('defaultdict.pickle2', 'w') as f:
            cPickle.dump(dd, f, -1)

    def serialize_dict():
        with open('dict.pickle', 'w') as f:
            cPickle.dump(dt, f)

    def serialize_p2_dict():
        with open('dict.pickle2', 'w') as f:
            cPickle.dump(dt, f, -1)

    def serialize_json():
        with open('dict.json', 'w') as f:
            json.dump(dt, f)

    if cjson:
        def serialize_cjson():
            with open('dict.cjson', 'w') as f:
                f.write(cjson.encode(dt))

    def serialize_flat():
        with open('dict.flat', 'w') as f:
            f.write('\n'.join([' '.join([k] + v) for k, v in dt.iteritems()]))

    def serialize_eval():
        with open('dict.eval', 'w') as f:
            f.write('\n'.join([k + '\t' + repr(v) for k, v in dt.iteritems()]))

    def unserialize_defaultdict():
        with open('defaultdict.pickle') as f:
            assert cPickle.load(f) == dd

    def unserialize_p2_defaultdict():
        with open('defaultdict.pickle2') as f:
            assert cPickle.load(f) == dd

    def unserialize_dict():
        with open('dict.pickle') as f:
            assert cPickle.load(f) == dt

    def unserialize_p2_dict():
        with open('dict.pickle2') as f:
            assert cPickle.load(f) == dt

    def unserialize_json():
        with open('dict.json') as f:
            assert json.load(f) == dt

    if cjson:
        def unserialize_cjson():
            with open('dict.cjson') as f:
                assert cjson.decode(f.read()) == dt

    def unserialize_flat():
        with open('dict.flat') as f:
            dtx = {}
            for line in f:                                                                                                                                                                                                                                
                vals = line.split()
                dtx[vals[0]] = vals[1:]
            assert dtx == dt

    def unserialize_eval():
        with open('dict.eval') as f:
            dtx = {}
            for line in f:                                                                                                                                                                                                                                       
                vals = line.split('\t')
                dtx[vals[0]] = eval(vals[1])
            assert dtx == dt

    print 'serialize flat:', timeit(serialize_flat, number=10)
    print 'serialize eval:', timeit(serialize_eval, number=10)
    print 'serialize defaultdict:', timeit(serialize_defaultdict, number=10)
    print 'serialize dict:', timeit(serialize_dict, number=10)
    print 'serialize defaultdict new pickle:', timeit(serialize_p2_defaultdict, number=10)
    print 'serialize dict new pickle:', timeit(serialize_p2_dict, number=10)
    print 'serialize json:', timeit(serialize_json, number=10)
    if cjson:
        print 'serialize cjson:', timeit(serialize_cjson, number=10)
    print 'unserialize flat:', timeit(unserialize_flat, number=10)
    print 'unserialize eval:', timeit(unserialize_eval, number=10)
    print 'unserialize defaultdict:', timeit(unserialize_defaultdict, number=10)
    print 'unserialize dict:', timeit(unserialize_dict, number=10)
    print 'unserialize defaultdict new pickle:', timeit(unserialize_p2_defaultdict, number=10)
    print 'unserialize dict new pickle:', timeit(unserialize_p2_dict, number=10)
    print 'unserialize json:', timeit(unserialize_json, number=10)
    if cjson:
        print 'unserialize cjson:', timeit(unserialize_cjson, number=10)

print 'testing with 10 keys...'
testdict(dd1, dt1)

print 'testing with 100 keys...'
testdict(dd2, dt2)
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/23161166

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档