如何获取Apache Common Log文件并以整齐的直方图列出其中的所有URL,如:
/favicon.ico ##
/manual/mod/mod_autoindex.html #
/ruby/faq/Windows/ ##
/ruby/faq/Windows/index.html #
/ruby/faq/Windows/RubyonRails #
/ruby/rubymain.html #
/robots.txt ########测试文件示例:
65.54.188.137 - - [03/Sep/2006:03:50:20 -0400] "GET /~longa/geomed/ppa/doc/localg/localg.htm HTTP/1.0" 200 24834
65.54.188.137 - - [03/Sep/2006:03:50:32 -0400] "GET /~longa/geomed/modules/sv/scen1.html HTTP/1.0" 200 1919
65.54.188.137 - - [03/Sep/2006:03:53:51 -0400] "GET /~longa/xlispstat/code/statistics/introstat/axis/code/axisDens.lsp HTTP/1.0" 200 15962
65.54.188.137 - - [03/Sep/2006:04:03:03 -0400] "GET /~longa/geomed/modules/cluster/lab/nm.pop HTTP/1.0" 200 66302
65.54.188.137 - - [03/Sep/2006:04:11:15 -0400] "GET /~longa/geomed/data/france/names.txt HTTP/1.0" 200 20706
74.129.13.176 - - [03/Sep/2006:04:14:35 -0400] "GET /~jbyoder/ambiguouslyyours/ambig.rss HTTP/1.1" 304 -这就是我现在所拥有的(但我不确定如何制作直方图):
...
---
$apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
$parts = apache_line.match(file)
$p parts[:ip_address], parts[:status], parts[:method], parts[:url]
def get_url(file)
hits = Hash.new {|h,k| h[k]=0}
File.read(file).to_a.each do |line|
while $p parts[:url]
if k = k
h[k]+=1
puts "%-15s %s" % [k,'#'*h[k]]
end
end
end
...
---下面是完整的问题:http://pastebin.com/GRPS6cTZ伪代码很好。
发布于 2011-04-06 10:46:35
hits = Hash.new{ |h,k| hk=0 } ...hits"/favicon.ico“+= 1 hits"/ruby/faq/Windows/”+= 1 hits"/favicon.ico“+= 1 p hits #=> {"/favicon.ico"=>2,case case日志文件真的很大,而不是将整个日志文件存储到内存中,一次处理一行。(查看Apache class.)
File方法没有标准分隔符,我建议使用正则表达式获取每一行并将其分成您想要的块。假设您使用的是Ruby1.9,稍后我将使用命名捕获来干净地访问这些方法。例如:apache_line = /\A(?\S+) \S+ \S+ [(?^]+)] "(?GET|POST) (?\S+) \S+?“(?\d+) (?\S+)/ ...parts = apache_line.match(log_line) p parts:ip_address,parts: status,parts:method,parts:
Array#select,而是在循环过程中跳过它们。
1. `Hash#keys` can give you all the keys of the array (the paths) at once. You probably want to write out all the paths with the same amount of whitespace, so you need to figure out which is the longest. Perhaps you want to `map` the paths to their lengths and then get the `max` element, or perhaps you want to use [`max_by`](http://ruby-doc.org/core/classes/Enumerable.html#M001507) to find the longest path and then find its length.
2. Although geeky, using `sprintf` or `String#%` is a great way to lay out formatted reports. For example:将“%-15s%s”% "Hello","####“#=> "Hello ####”
3.就像您需要找到最长的名称以获得良好的格式一样,您可能希望找到命中次数最多的URL,以便您可以将最长的散列量扩展到该值。Hash#values会给你一个包含所有值的数组。或者,您可能有一个要求,即一个#必须始终表示100个命中。
4.请注意,String#*允许您通过重复创建字符串:
P '#'*10 #=> "##########“
如果你对你的代码有特定的问题,可以问更多的问题!
发布于 2011-04-06 10:30:58
因为这是作业,所以我不会给你确切的答案,但是Simone Carletti已经实现了一个Ruby class来解析Apache日志文件。你可以从那里开始,看看他是如何做事的。
https://stackoverflow.com/questions/5560796
复制相似问题