最近有人问how to do a file slurp in python,被接受的答案是这样的:
with open('x.txt') as x: f = x.read()我该如何读入文件并转换数据的字节顺序表示法呢?
例如,我有一个1 1GB的二进制文件,它只是一堆打包为高字节顺序的单精度浮点数,我想将其转换为低字节顺序并转储到一个numpy数组中。下面是我为完成此任务而编写的函数,以及调用它的一些实际代码。我使用struct.unpack进行字节顺序转换,并试图通过使用mmap来加快速度。
我的问题是,我是否在mmap和struct.unpack中正确使用了slurp?有没有一种更干净、更快的方法来做这件事?现在我所做的一切都是可行的,但我真的很想学习如何做得更好。
提前感谢!
#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np
def mmapChannel(arrayName, fileName, channelNo, line_count, sample_count):
"""
We need to read in the asf internal file and convert it into a numpy array.
It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
and channels all come from the .meta text file
Also, internal format files are packed big endian, but most systems use little endian, so we need
to make that conversion as well.
Memory mapping seemed to improve the ingestion speed a bit
"""
# memory-map the file, size 0 means whole file
# length = line_count * sample_count * arrayName.itemsize
print "\tMemory Mapping..."
with open(fileName, "rb") as f:
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
map.seek(channelNo*line_count*sample_count*arrayName.itemsize)
for i in xrange(line_count*sample_count):
arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]
# Same method as above, just more verbose for the maintenance programmer.
# for i in xrange(line_count*sample_count): #row
# be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
# le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
# arrayName[0, i]= le_float
map.close()
return arrayName
print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1, line_count*sample_count), dtype='float32')
HHphase = np.ones((1, line_count*sample_count), dtype='float32')
HVamp = np.ones((1, line_count*sample_count), dtype='float32')
HVphase = np.ones((1, line_count*sample_count), dtype='float32')
print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img', 0, line_count, sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img', 1, line_count, sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img', 2, line_count, sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img', 3, line_count, sample_count)
print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)发布于 2009-10-28 04:19:50
with open(fileName, "rb") as f:
arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)在速度和简洁性方面很难被击败;-)。对于byteswap,请参见here ( True参数的意思是“就地执行”);对于fromfile,请参见here。
这在小端机器上是有效的(因为数据是大端的,所以需要byteswap )。您可以测试是否需要有条件地执行byteswap,将最后一行从无条件调用更改为byteswap,例如:
if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
arrayName.byteswap(True)即以低字节顺序测试为条件的对byteswap的调用。
发布于 2009-10-28 04:42:18
稍微修改了一下@Alex Martelli's answer
arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine发布于 2009-10-28 02:30:38
您可以使用CorePy将ASM based solution组合在一起。不过,我想知道您是否能够从算法的其他部分获得足够的性能。I/O和对1 1GB数据块的操作将需要一段时间,无论您以何种方式对其进行切片。
您可能会发现另一件有用的事情是,一旦您在python中构建了算法的原型,就切换到C语言。有一次,我这样做是为了对整个世界的DEM (高度)数据集进行操作。一旦我摆脱了解释过的脚本,整个事情就变得更容易接受了。
https://stackoverflow.com/questions/1632673
复制相似问题