训练数据集是一个42 is的JSON文件。mesh是医学主题标题,将其视为标识或标签。neighbors_mesh是一个28,000个维度列表,其中包含关于网格的信息,这些信息彼此非常接近。我们通过KNN从1.07 M数据的训练网格项中得到这些数据。MLB变换返回一个28,000维向量0和1,但默认情况下每个元素都是int64。我试着用mask.astype(int__)来减少它。还剩32位。
在运行大约100万次迭代后,迭代会阻塞256 1M的内存,并且仍然会被终止。
我的python版本是3.9机器有256 48内存,20 48交换内存,48核心CPU和GPU。
def build_dataset(train_path, neighbors, journal_mesh, MeSH_id_pair_file, index_dic):
mapping_id = {}
with open(MeSH_id_pair_file, 'r') as f:
for line in f:
(key, value) = line.split('=')
mapping_id[key] = value.strip()
meshIDs = list(mapping_id.values())
meshIDs = label2index(meshIDs, index_dic)
meshIDs_str = [str(x) for x in meshIDs]
print('Total number of labels %d' % len(meshIDs_str))
mlb = MultiLabelBinarizer(classes=meshIDs_str)
mlb.fit(meshIDs_str)
pmid_neighbors, neighbors_mesh = read_neighbors(neighbors, index_dic)
f = open(train_path, encoding="utf8")
objects = ijson.items(f, 'articles.item')
dataset = []
print("Objects: ", type(objects))
print("pmid neighboors: ", type(pmid_neighbors))
for i, obj in enumerate(tqdm(objects)):
data_point = {}
try:
ids = obj["pmid"]
heading = obj['title'].strip()
heading = heading.translate(str.maketrans('', '', '[]'))
abstract = obj["abstractText"].strip()
clean_abstract = abstract.translate(str.maketrans('', '', '[]'))
if len(heading) == 0 or heading == 'In process':
print('paper ', ids, ' does not have title!')
continue
elif len(clean_abstract) == 0:
print('paper ', ids, ' does not have abstract!')
continue
else:
mesh_id = obj['mesh']
journal = obj['journal']
year = obj['year']
mesh_from_journal = journal_mesh[journal]
mesh_from_neighbors = []
if i < len(pmid_neighbors) and ids == pmid_neighbors[i]:
mesh_from_neighbors = neighbors_mesh[i]
mesh_from_journal_str = [str(x) for x in mesh_from_journal]
mesh_from_neighbors_str = [str(x) for x in mesh_from_neighbors]
mesh = list(set(mesh_from_journal_str + mesh_from_neighbors_str))
mask = mlb.fit_transform([mesh])
mask = mask.astype(np.int_)
mask = mask.tolist()
print("MEsh Size: ", sys.getsizeof(mask))
print("Mesh content size: ", sys.getsizeof(mask[0][0]))
print("Mesh content type: ", type(mask[0][0]))
data_point['pmid'] = ids
data_point['title'] = heading
data_point['abstractText'] = clean_abstract
data_point['meshID'] = mesh_id
data_point['meshMask'] = mask
data_point['year'] = year
dataset.append(data_point)
print("dataset Size: ", sys.getsizeof(dataset))
except AttributeError:
print(f'An excaption occured for pmid: {obj["pmid"].strip()}', AttributeError.args())
pubmed = {'articles': dataset}
return pubmed发布于 2022-09-28 23:25:30
在迭代完成后,我通过添加f.close()成功地完成了代码的运行。结果是一个88 is的数据集。但我还是很好奇它为什么要占用这么多空间。
https://stackoverflow.com/questions/73888672
复制相似问题