我正试图计算一份公司名单的2天回报。所有信息都保存在CSV中。结构如下:第一列是公司名称,第二列是日期,第三列是价格,第四列是返回= p(t+2)/p(t)。
(1) CSV为1.8G。使用“CSV.each_with_index.”非常慢。如果我使用"CSV.foreach",它不会让我在两天内找到价格。
(2)价格存在缺失值。因此,即使我使用CSV.each_with_index,i+2也可能无法确定正确的日期。
谢谢你的帮助。
输入:
[
['a', '2014-6-1', '1'],
['a', '2014-6-2', '2'],
['a', '2014-6-4', '3'],
['a', '2014-6-5', '4'],
['b', '2014-6-1', '1'],
['b', '2014-6-2', '2'],
['b', '2014-6-3', '3'],
['b', '2014-6-4', '4'],
['b', '2014-6-5', '5']
]产出:
[
['a', '2014-6-1', '1', ''], # Missing because no 2014-6-3 price for a
['a', '2014-6-2', '2', '1.5'], # p(a2014-6-4)/p(a2014-6-2) = 1.5
['a', '2014-6-4', '3', ''], # Missing because no 2014-6-6 price
['a', '2014-6-5', '4', ''], # Missing because no 2014-6-7 price
['b', '2014-6-1', '1', '3'],
['b', '2014-6-2', '2', '2'],
['b', '2014-6-3', '3', '1.7'],
['b', '2014-6-4', '4', ''],
['b', '2014-6-5', '5', '']
]我所想到的逻辑如下。这与第一个评论中的逻辑相同。我没有编码第二部分,因为我不知道什么是一个很好的方式合并一个大的CSV与自己在红宝石。我还考虑在下面的观察中搜索第n天的营业日。但是我想避免使用each_with_index,因为CSV非常大。我不知道如何在ruby中实现这个逻辑。
(1)计算某一日期后的第n个营业日(2)将数据集与其本身合并,使我在第n个营业日得到价格。
require 'csv'
require 'business_time'
# 30/60/90/365 business days
# cdate ncusip prc permno firm
csvIn = 'in.csv'
csvOut = 'out.csv'
csv = CSV.open(csvOut, "w")
csv << ['cdate', 'ncusip', 'prc', 'permon', 'firm', 'day60']
CSV.foreach(csvIn, :headers => true) do |row|
current_date = Time.parse(row['cdate'])
day60 = 42.business_days.after(current_date)
csv << [row['cdate'], row['ncusip'], row['prc'], row['permno'], row['firm'], day60]
end
csv.close发布于 2014-09-07 07:26:44
您的代码引入了一些新的需求,比如查找nth business day,但是问题中没有明确定义它们,也许更合适的方法是打开另一个关于“在ruby中找到第n个营业日的最快方式”的问题。
所以,让我们只讨论一下你在结果样本中评论的要求。
这一要求的要点:
基本基准:
由于样本数据重复了45,000次,我在其中得到了一个包含360,000条记录的10 of文件。
我的第一个想法是生成一个缓冲区类来缓冲尚未满足下一个n天记录的记录。当将新记录推送到缓冲区时,缓冲区将移除新记录前n天的所有记录。
但是,我需要知道一些基本操作的处理时间可能在这个实现中使用,然后我可以通过选择更有效的操作来计算出整个处理时间的下限:
我听说CSV是一种非常低效率的方法,所以我也将比较两个文件解析处理时间:
基本基准脚本:
require 'csv'
require 'benchmark'
Benchmark.bm{|x|
epoch=Date.new(1970,1,1)
date1=Date.today
date2=Date.today.next
i1=1
i2=200000
date_str='2014-6-1'
a = [[1,2,4,date2],2,[1,2,4,date1]]
# 1. convert date formatted string to date at least 360,000 times
x.report("1.string2date"){
360000.times{Date.strptime(date_str,"%Y-%m-%d")}
}
# 2. compare two days for 360,000 times
x.report("2.DateCompare"){
360000.times{date2>=date1}
}
# 3. get the date that is n days after another date for 360,000 times
x.report("3.DateAdd2 "){
360000.times{date1 + 2}
}
# 4. calculate the days between two dates for 360,000 times
x.report("4.Date Differ"){
360000.times{date2-date1}
}
# 5. compare two dates stored in an array of arrays for 360,000 times
x.report("5.ArrDateComp"){
360000.times{ a.last[3] > a.first[3]}
}
# 6. push a row into buffer and shift out for 360,000 times
x.report("6.array shift"){
360000.times{ a<<[1,2,3]; a.shift}
}
# 7. append a ratio or empty string to every record for 360,000 times
x.report("7.Add Ratio "){
360000.times{ res << (['1','2014-6-1',"3"]<< (2==2 ? (3.to_f/2.to_f).round(2) : "" ))}
}
x.report('CSVparse '){
CSV.foreach("data.csv"){|row|
}
}
x.report('IOread '){
data = IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each{|x| }
}
}结果:
user system total real
1.string2date 0.827000 0.000000 0.827000 ( 0.820001)
2.DateCompare 0.078000 0.000000 0.078000 ( 0.070000)
3.DateAdd2 0.109000 0.000000 0.109000 ( 0.110000)
4.Date Differ 0.359000 0.000000 0.359000 ( 0.360000)
5.ArrDateComp 0.109000 0.000000 0.109000 ( 0.110001)
6.array shift 0.094000 0.000000 0.094000 ( 0.090000)
7.Add Ratio 0.530000 0.000000 0.530000 ( 0.530000)
CSVparse 2.902000 0.016000 2.918000 ( 2.910005)
IOread 0.515000 0.015000 0.530000 ( 0.540000)分析结果
缓冲区类和推送方法的实现
class Buff
def initialize
@buff=[]
@epoch = Date.new(1970,1,1)
@n=2
end
def push_date( row )
# store buff with two date value appended, ["a", "2014-6-1", "1", #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
# the last element of date is n days after the record's date
res = []
@buff << (row << (row[3] + @n) )
while (@buff.last[3] >= @buff.first[4] || row[0] != @buff.first[0])
v = (@buff.last[3] == @buff.first[4] && row[0] == @buff.first[0] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
res <<(@buff.shift[0..2]<< v)
end
return res
end
def tails
@buff.inject([]) {|res,x| res << (x[0..2]<< "")}
end
def clear
@buff=[]
end
end基准测试
buff=Buff.new
res=[]
Benchmark.bm{|x|
buff.clear
res = []
x.report("CSVdate"){
CSV.foreach("data.csv"){|row|
buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
}
buff.tails.each{|x| res << x}
}
buff.clear
res = []
x.report("IOdate"){
IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each {|row|
buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x}
}
buff.tails.each{|x| res << x}
}
}
puts "output result count:#{res.size}"
puts "Here is the fist 12 sample outputs:"
res[0..11].each{|x| puts x.to_s}结果
user system total real
CSVdate 6.411000 0.047000 6.458000 ( 6.500009)
IOdate 3.557000 0.109000 3.666000 ( 3.710005)
output result count:360000
Here is the fist 12 sample outputs:
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]
["a", "2014-6-5", "4", ""]
["b", "2014-6-1", "1", 3.0]
["b", "2014-6-2", "2", 2.0]
["b", "2014-6-3", "3", 1.67]
["b", "2014-6-4", "4", ""]
["b", "2014-6-5", "5", ""]
["a", "2014-6-1", "1", ""]
["a", "2014-6-2", "2", 1.5]
["a", "2014-6-4", "3", ""]结论
UPDATE1:
调优
通过改变分组比较和日期比较的顺序,更快地推动:
class Buff
def push_fast( row )
# store buff with two date value appended, ["a", "2014-6-1", "1", #<Date: 2014-06-01 ((2456908j,0s,0n),+0s,2299161j)>,#<Date: 2014-06-03 ((2456908j,0s,0n),+0s,2299161j)>]
# the last element of date is n days after the record's date
res = []
row << (row[3] + @n)
# change the order of the two compares, can reduce the counts of date compares
while @buff.first && (row[0] != @buff.first[0] || row[3] >= @buff.first[4] )
v = (row[0] == @buff.first[0] && row[3] == @buff.first[4] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "")
res <<(@buff.shift[0..2]<< v)
end
@buff << row
return res
end
end基准结果
user system total real
IOdate 3.806000 0.031000 3.837000 ( 3.830005)
IOfast 3.323000 0.062000 3.385000 ( 3.390005)能得到0.480秒的升职。先通过比较组节省多个数据比较时间,如果组发生更改,则将所有缓冲区记录移出而不进行日期比较。
发布于 2014-09-07 17:49:29
以下是另一种可能性:
假设您有一个如上面所定义的名为input的数组,它将如下所示:
# convert.rb
require 'date'
class Convert
attr_reader :dates_hash
def initialize
@input_array = []
@dates_hash = {}
@n = 2
@date_converter = {}
end
def add_to_hash(row)
# create a hash of ids, dates and values, like this:
# {"a"=>{#Date: 2014-6-1 => 1, #Date: 2014-6-4 => 2} ... etc.}
id = row[0]
date = to_date(row[1])
value = row[2].to_i
# Merge using a block, so that for a given id (like "a"), the inner hashes
# append rather than replace each other
@dates_hash.merge!( { id => { date => value } } ) do |key, x, y|
x.merge(y)
end
end
def input(input_array)
output = input_array.map do |row|
id = row[0]
date = to_date(row[1])
value = row[2]
#create a row of output with id, date, original value, and modified price
#set variable date2 to the date + @n value. If a exists, do the calculation
row[0..2] << ( (date2 = @dates_hash[id][date+@n] ) ? date2/value.to_f : '')
end
end
def to_date(date)
# convert to Date and memoize
if converted_date = @date_converter[date]
converted_date
else
@date_converter[date] = Date.parse(date)
end
end
end我学习了Jaugar的例子,并以以下为基准:
require 'csv'
require 'benchmark'
require './convert'
buff=[]
output = []
Benchmark.bm{|x|
x.report("convert with csv"){
converter = Convert.new()
CSV.foreach("data.csv") do |row|
buff << row
converter.add_to_hash(row)
end
output = converter.input(buff)
}
x.report("convert with IO"){
converter = Convert.new()
IO.readlines("data.csv").map{|row| row.split(',')}.each do |row|
buff << row
converter.add_to_hash(row)
end
output = converter.input(buff)
}
}
puts "Here is the first 12 sample outputs:"
output[0..11].each{|x| puts x.to_s}贾古尔在我的电脑上的基准是:
user system total real
CSVdate 11.270000 0.020000 11.290000 ( 11.302404)
IOdate 8.740000 0.020000 8.760000 ( 8.756997)我的基准是:
user system total real
convert with csv 10.450000 0.090000 10.540000 ( 10.546727)
convert with IO 12.850000 0.120000 12.970000 ( 12.972962)
["a", " 2014-6-1", " 1", ""]
["a", " 2014-6-2", " 2", 1.5]
["a", " 2014-6-4", " 3", ""]
["a", " 2014-6-5", " 4", ""]
["b", " 2014-6-1", " 1", 3.0]
["b", " 2014-6-2", " 2", 2.0]
["b", " 2014-6-3", " 3", 1.6666666666666667]
["b", " 2014-6-4", " 4", ""]
["b", " 2014-6-5", " 5", ""]
["a", " 2014-6-1", " 1", ""]
["a", " 2014-6-2", " 2", 1.5]
["a", " 2014-6-4", " 3", ""]不知道为什么我的IO速度慢,但它在一个类似的范围内。我的散列结果非常小,仅仅是因为csv中的数据不断重复。不确定如何在一个更现实的情况下,更大的哈希。不应该严重降级,因为哈希查找是非常有效的。
https://stackoverflow.com/questions/25695638
复制相似问题