文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在Python中实现更快的文件I/O？

问如何在Python中实现更快的文件I/O？
EN

Stack Overflow用户

提问于 2018-10-13 09:29:44

回答 1查看 1.6K关注 0票数 2

关于Python，我有一个与speed/efficiency相关的问题：

我需要从一个嵌套的JSON中提取多个字段(在写入.txt文件之后，它们有~64k行，当前的片段在~ 9分钟中完成它)，其中每一行都可以包含浮标和字符串。

通常，我只需将所有数据放在numpy中，并使用np.savetxt()保存它。

我用简单的方式把线条组装成字符串，但这是相当慢的。到目前为止我正在做：

将每一行组装为一个字符串(从JSON中提取所需的字段)
将字符串写入相关文件

我对此有几个问题：

这会导致更多单独的file.write()命令，这些命令也非常慢(大约64k *8次调用(针对8个文件))

所以我的问题是：

什么是解决这种问题的好办法？平衡speed vs memory-consumption，以便最有效地写入磁盘。
我应该增加DEFAULT_BUFFER_SIZE吗？(目前是8192)

我已经检查了这个每种编程语言中的文件I/O和这个python org: IO，但是除了(在我看完它之后，io文件应该已经在python3.6.x中缓冲了)之外没有什么帮助，我发现我的默认DEFAULT_BUFFER_SIZE是8192。

这是我的片段-

def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

def extract_features_and_write(path_to_data, inp_filename, is_train=True):
    # It's currently having 8 lines of file.write(), which is probably making it slow as writing to disk is  involving a lot of overheads as well
    features = ['meta_tags__twitter-data1', 'url', 'meta_tags__article-author', 'domain', 'title', 'published__$date',\
                'content', 'meta_tags__twitter-description']
    
    prefix = 'train' if is_train else 'test'
    
    feature_files = [open(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),'w', encoding='utf-8')
                    for feat in features]
    
    with open(os.path.join(PATH_TO_RAW_DATA, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            for idx, features in enumerate(features):
                json_data = read_json_line(line)  

                content = json_data['meta_tags']["twitter:data1"].replace('\n', ' ').replace('\r', ' ').split()[0]
                feature_files[0].write(content + '\n')

                content = json_data['url'].split('/')[-1].lower()
                feature_files[1].write(content + '\n')

                content = json_data['meta_tags']['article:author'].split('/')[-1].replace('@','').lower()
                feature_files[2].write(content + '\n')

                content = json_data['domain']
                feature_files[3].write(content + '\n')

                content = json_data['title'].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[4].write(content + '\n')

                content = json_data['published']['$date']
                feature_files[5].write(content + '\n')

                content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
                content = strip_tags(content).lower()
                content = re.sub(r"[^a-zA-Z0-9]", " ", content)
                feature_files[6].write(content + '\n')

                content = json_data['meta_tags']["twitter:description"].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[7].write(content + '\n')

python

python-3.x

performance

file-io

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-10-13 09:58:07

来自评论：

为什么您认为8写会导致8物理写入您的硬盘？文件对象本身缓冲要写什么，如果它决定写到您的操作系统，那么您的操作系统最好等到它实际写东西之后再写--即使这样，您的硬盘也会得到缓冲区，这些缓冲区可能会使文件内容保持一段时间，直到它真正开始编写。请参阅python多久刷新一次文件？

不应将异常用作控制流，也不应在不需要异常的地方使用递归。每个递归都为函数调用准备了新的调用堆栈--这需要重新来源和时间--而且所有这些都必须被恢复。

最好的方法是在将数据输入到json.load()之前清理数据.下一件最好的事就是避免递归..。尝试一些类似于：

def read_json_line(line=None):
    result = None

    while result is None and line: # empty line is falsy, avoid endless loop
        try:        
            result = json.loads(line)
        except Exception as e:
            result = None      
            # Find the offending character index:
            idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
            # slice away the offending character:
            line = line[:idx_to_replace]+line[idx_to_replace+1:]

     return result

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52791502

复制

相似问题

问如何在Python中实现更快的文件I/O？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Python中实现更快的文件I/O？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在Python中实现更快的文件I/O？
EN