首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >由于256 of内存不足,Python进程被终止

由于256 of内存不足,Python进程被终止
EN

Stack Overflow用户
提问于 2022-09-28 23:06:08
回答 1查看 54关注 0票数 0

训练数据集是一个42 is的JSON文件。mesh是医学主题标题,将其视为标识或标签。neighbors_mesh是一个28,000个维度列表,其中包含关于网格的信息,这些信息彼此非常接近。我们通过KNN从1.07 M数据的训练网格项中得到这些数据。MLB变换返回一个28,000维向量0和1,但默认情况下每个元素都是int64。我试着用mask.astype(int__)来减少它。还剩32位。

在运行大约100万次迭代后,迭代会阻塞256 1M的内存,并且仍然会被终止。

我的python版本是3.9机器有256 48内存,20 48交换内存,48核心CPU和GPU。

代码语言:javascript
复制
def build_dataset(train_path, neighbors, journal_mesh, MeSH_id_pair_file, index_dic):

    mapping_id = {}
    with open(MeSH_id_pair_file, 'r') as f:
        for line in f:
            (key, value) = line.split('=')
            mapping_id[key] = value.strip()

    meshIDs = list(mapping_id.values())
    meshIDs = label2index(meshIDs, index_dic)
    meshIDs_str = [str(x) for x in meshIDs]

    print('Total number of labels %d' % len(meshIDs_str))
    mlb = MultiLabelBinarizer(classes=meshIDs_str)
    mlb.fit(meshIDs_str)

    pmid_neighbors, neighbors_mesh = read_neighbors(neighbors, index_dic)

    f = open(train_path, encoding="utf8")
    objects = ijson.items(f, 'articles.item')
    

    dataset = []
    print("Objects: ", type(objects))
    print("pmid neighboors: ", type(pmid_neighbors))

    for i, obj in enumerate(tqdm(objects)):
        data_point = {}
        try:
            ids = obj["pmid"]
            heading = obj['title'].strip()
            heading = heading.translate(str.maketrans('', '', '[]'))
            abstract = obj["abstractText"].strip()
            clean_abstract = abstract.translate(str.maketrans('', '', '[]'))
            if len(heading) == 0 or heading == 'In process':
                print('paper ', ids, ' does not have title!')
                continue
            elif len(clean_abstract) == 0:
                print('paper ', ids, ' does not have abstract!')
                continue
            else:
                mesh_id = obj['mesh']
                journal = obj['journal']
                year = obj['year']
                mesh_from_journal = journal_mesh[journal]
                mesh_from_neighbors = []
                if i < len(pmid_neighbors) and ids == pmid_neighbors[i]:
                    mesh_from_neighbors = neighbors_mesh[i]
                mesh_from_journal_str = [str(x) for x in mesh_from_journal]
                mesh_from_neighbors_str = [str(x) for x in mesh_from_neighbors]
                mesh = list(set(mesh_from_journal_str + mesh_from_neighbors_str))
                mask = mlb.fit_transform([mesh])
                mask = mask.astype(np.int_)
                mask = mask.tolist()
                print("MEsh Size: ", sys.getsizeof(mask))
                print("Mesh content size: ", sys.getsizeof(mask[0][0]))
                print("Mesh content type: ", type(mask[0][0]))
                data_point['pmid'] = ids
                data_point['title'] = heading
                data_point['abstractText'] = clean_abstract
                data_point['meshID'] = mesh_id
                data_point['meshMask'] = mask
                data_point['year'] = year
                dataset.append(data_point)
                print("dataset Size: ", sys.getsizeof(dataset))
        

        except AttributeError:
            print(f'An excaption occured for pmid: {obj["pmid"].strip()}', AttributeError.args())


    pubmed = {'articles': dataset}
    return pubmed
EN

回答 1

Stack Overflow用户

发布于 2022-09-28 23:25:30

在迭代完成后,我通过添加f.close()成功地完成了代码的运行。结果是一个88 is的数据集。但我还是很好奇它为什么要占用这么多空间。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73888672

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档