文章/答案/技术大牛

发布

社区首页 >问答首页 >使用hdf5storage的python中的Matlab7.3文件会膨胀文件，而且创建文件的速度很慢。

问使用hdf5storage的python中的Matlab7.3文件会膨胀文件，而且创建文件的速度很慢。
EN

Stack Overflow用户

提问于 2022-02-23 01:36:20

回答 1查看 267关注 0票数 2

我试图使用hdf5storage将numpy数据写入.mat文件。

import hdf5storage

# For example
numpy_array = [array([(b'<detect>', 192, 1)], dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')]), array([(b'<detect>', 192, 2)], dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')])]

# The actual array is 192 bytes. and a binary file I am attempting to create a .mat file for contains thousands of these packets.

data = {"data": numpy_array}

hdf5storage.savemat(file_name="data.mat", mdict=data, format="7.3")

使用此召集函数，或等量地使用

hdf5storage.write(data, '.', 'data.mat', matlab_compatible=True)

该文件扩展到>10X二进制文件大小，它是一个python列表，包含由基本c类型(1小时，这似乎有些不对劲，但我对HDF5格式没有丰富的经验，所以这可能是意料之中的。

当测试从MATLAB中保存类似变量时，

save("test.mat", 'variable', '-v7.3')

文件大小仍然比二进制大小大得多。因此，正如@hpaulj所指出的，HDF5不是一种紧凑的格式。但是在python中保存所需的时间也是不可接受的。在MATLAB中，文件保存几秒钟，使用hdf5storage库保存相同的文件，大约需要一个小时。也许这个图书馆就是没表现出来？

但是，在运行磁盘时，通过iotop可以看到2-3M/s的stat，而文件只增长到0.5MB/s。

我想避免写入分开的.mat文件。

当我使用v5的savemat时，我可以将文件保存到matlab的2GB的极限，但是我们正在生成更多的数据，并且希望能够使用v7.3matlab格式。因此，问题在于hdf5storage库仍然有效。

是否有一些与matlab v7.3格式有关的numpy类型限制？

为什么这些文件会膨胀？在hdf5storage中有我错过的选项吗？我已经翻阅了文档，部分地通过了代码，但没有结果。

或者，我可以尝试将一个hdf5文件加载到MATLAB中。

import h5py
hf = h5py.File("test.h5", "w")
hf.create_dataset("data", data=data)
hf.close()

编辑:我发现我的麻烦可能是由于数据形状不一致。我可以有可变大小的包。显然，HDF5没有很好地处理这一问题，因此为同质性构建数据是很重要的。

python

matlab

hdf5

h5py

回答 1

Stack Overflow用户

发布于 2022-02-23 02:10:37

我没有hdf5storage。

In [21]: numpy_array = np.array(
    ...:     [(b"<detect>", 192, 1)],
    ...:     dtype=[("packet_sync", "S8"), ("n_bytes", "<u4"), ("n_detect", "<u4")],
    ...: ) 
In [22]: numpy_array
Out[22]: 
array([(b'<detect>', 192, 1)],
      dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')])
In [23]: numpy_array.nbytes
Out[23]: 16
In [24]: data = {"data": numpy_array}

但在pre 7.3格式中：

In [25]: from scipy import io
In [26]: io.savemat("test712.mat", data)
In [27]: io.loadmat("test712.mat")
Out[27]: 
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue Feb 22 18:02:29 2022',
 '__version__': '1.0',
 '__globals__': [],
 'data': array([[(array(['<detect>'], dtype='<U8'), array([[192]], dtype=uint32), array([[1]], dtype=uint32))]],
       dtype=[('packet_sync', 'O'), ('n_bytes', 'O'), ('n_detect', 'O')])}
In [28]: ll test712.mat
...
-rw-rw-r-- 1 paul 408 Feb 22 18:02 test712.mat

您确实给出了“通货膨胀”的详细信息，但是16字节数组被保存到408字节文件中。

使用本机numpy保存，文件会更小一些。其中大部分是指定形状和dtype的标头块：

In [29]: np.save("test712.npy", numpy_array)
In [30]: ll test712.npy
-rw-rw-r-- 1 paul 208 Feb 22 18:05 test712.npy

使用更基本的h5py保存：

In [32]: f = h5py.File("test712.h5", "w")
In [33]: f.create_dataset("array", data=numpy_array)
Out[33]: <HDF5 dataset "array": shape (1,), type "|V16">
In [34]: f.close()
In [35]: %ll test712.h5
-rw-rw-r-- 1 paul 2064 Feb 22 18:08 test712.h5

In [37]: f = h5py.File("test712.h5", "r")
In [40]: f["array"][:]
Out[40]: 
array([(b'<detect>', 192, 1)],
      dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')])

HDF5文件格式不是紧凑的，所以我认为谈论通货膨胀是没有意义的。在这种情况下，我怀疑大部分的大小是由于布局和标题，而不是数据本身。保存一个4元素数组(64字节而不是16字节)可能不会对任何格式的文件大小造成如此大的改变。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71230362

复制

相似问题

问使用hdf5storage的python中的Matlab7.3文件会膨胀文件，而且创建文件的速度很慢。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用hdf5storage的python中的Matlab7.3文件会膨胀文件，而且创建文件的速度很慢。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用hdf5storage的python中的Matlab7.3文件会膨胀文件，而且创建文件的速度很慢。
EN