文章/答案/技术大牛

发布

社区首页 >问答首页 >数据压缩

问数据压缩
EN

Stack Overflow用户

提问于 2009-11-30 09:38:58

回答 5查看 1.3K关注 0票数 1

我有一个任务是压缩股票市场数据，数据在一个文件中，每天的股票价值都在一行中，所以on...so，这是一个非常大的文件。

例如，

123.45

234.75

345.678

889.56

现在的问题是如何使用标准算法压缩数据(也就是减少冗余)，比如霍夫曼或算术编码，或者LZ coding...which编码对于这种类型的数据是最可取的？？

我注意到，如果我采用第一个数据，然后考虑每个连续数据之间的差异，在差异中有很多重复values...this让我想知道，如果首先采用这些差异，找到它们的频率和概率，然后使用霍夫曼编码是一种方法吗？

我是right?...can任何人给我一些建议。

compression

huffman-code

entropy

data-compression

回答 5

Stack Overflow用户

回答已采纳

发布于 2009-12-01 01:31:09

我认为你的问题比简单地减去股票价格要复杂得多。您还需要存储日期(除非可以从文件名推断出一致的时间跨度)。

不过，数据量并不是很大。即使你在过去30年里每天每秒每一天都有300个库存的数据，你仍然可以设法将所有这些数据存储在一台更高端的家用计算机(比如MAC Pro)中，因为这相当于5Tb的未压缩容量。

我写了一个快速和肮脏的脚本，这将追逐IBM股票在雅虎的每一天，并存储它“正常”(只有调整后的收盘价)，并使用你提到的“差异方法”，然后使用gzip压缩它们。您确实获得了节省: 16K与10K。问题是我没有存储日期，我不知道什么值对应哪个日期，当然，你必须包括这个。

祝好运。

import urllib as ul
import binascii as ba

# root URL
url = 'http://ichart.finance.yahoo.com/table.csv?%s'

# dictionary of options appended to URL (encoded)
opt = ul.urlencode({
    's':'IBM',       # Stock symbol or ticker; IBM
    'a':'00',        # Month January; index starts at zero
    'b':'2',         # Day 2
    'c':'1978',      # Year 2009
    'd':'10',        # Month November; index starts at zero
    'e':'30',        # Day 30
    'f':'2009',      # Year 2009
    'g':'d',         # Get daily prices
    'ignore':'.csv', # CSV format
    })

# get the data
data = ul.urlopen(url % opt)

# get only the "Adjusted Close" (last column of every row; the 7th)

close = []

for entry in data:
    close.append(entry.strip().split(',')[6])

# get rid of the first element (it is only the string 'Adj Close') 
close.pop(0)

# write to file
f1 = open('raw.dat','w')
for element in close:
    f1.write(element+'\n')
f1.close()

# simple function to convert string to scaled number
def scale(x):
    return int(float(x)*100)

# apply the previously defined function to the list
close = map(scale,close)

# it is important to store the first element (it is the base scale)
base = close[0]

# normalize all data (difference from nom)
close = [ close[k+1] - close[k] for k in range(len(close)-1)]

# introduce the base to the data
close.insert(0,base)



# define a simple function to convert the list to a single string
def l2str(list):
    out = ''
    for item in list:
        if item>=0:
            out += '+'+str(item)
        else:
            out += str(item)
    return out

# convert the list to a string
close = l2str(close)

f2 = open('comp.dat','w')
f2.write(close)
f2.close()

现在比较“原始数据”(raw.dat)和您建议的“压缩格式”(comp.dat)。

:sandbox jarrieta$ ls -lh
total 152
-rw-r--r--  1 jarrieta  staff    23K Nov 30 09:28 comp.dat
-rw-r--r--  1 jarrieta  staff    47K Nov 30 09:28 raw.dat
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py
:sandbox jarrieta$ gzip --best *.dat
:sandbox jarrieta$ ls -lh
total 64
-rw-r--r--  1 jarrieta  staff    10K Nov 30 09:28 comp.dat.gz
-rw-r--r--  1 jarrieta  staff    16K Nov 30 09:28 raw.dat.gz
-rw-r--r--  1 jarrieta  staff   1.7K Nov 30 09:13 stock.py

票数 2

Stack Overflow用户

发布于 2009-11-30 09:49:54

如今，许多压缩工具使用这些技术的组合来在各种数据上提供良好的比率。可能值得从一些相当通用和现代的东西开始，比如bzip2，它使用霍夫曼编码，结合各种技巧对数据进行打乱，以带来各种冗余(页面包含到各种实现的链接)。

票数 2

Stack Overflow用户

发布于 2009-11-30 10:28:58

计算连续数据的差值，然后使用游程编码(RLE)。

您还需要将数据转换为整数，然后计算差值。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/1817482

复制

相似问题

问数据压缩
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据压缩EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据压缩
EN