我有一些代码来解析Apache日志文件(start_search和end_search是apache中格式的日期字符串):
with open("/var/log/apache2/access.log",'r') as log:
from itertools import takewhile, dropwhile
s_log = dropwhile(lambda L: start_search not in L, log)
e_log = takewhile(lambda L: end_search not in L, s_log)
query = [line for line in e_log if re.search(r'GET /(.+veggies|.+fruits)',line)]
import csv
query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")
import re
veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]
fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]第二个列表生成器总是空的;也就是说,如果我切换最后两行的顺序:
fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]
veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]第二个列表总是空的。
为什么?(我如何填充fruits和veggies列表?)
发布于 2013-10-26 00:23:36
您只能在迭代器上循环一次;query_dict是一种迭代器,一旦扫描到veggies,就不能再次迭代以搜索fruits。
这里不要使用列表理解。遍历query_dict一次,检查veggies和fruits的每个条目
veggies = []
fruits = []
for x in query_dict:
if re.search('veggies',x['url']):
veggies.append(x)
if re.search('fruits',x['url']):
fruits.append(x)备选办法是:
csv.DictReader()列表重新创建fruits对象:
query_dict =csv.DictReader(查询,fieldnames=('ip','na-1','na-2','time','zone','url',‘参考’,'client'),quotechar=‘,’,delimiter=‘)蔬菜=[x in query_dict re.search(’素食‘,x'url') ] query_dict =csv.DictReader(查询,fieldnames=('ip','na-1','na-2',’na-2‘,'time','zone',‘'url',’参考‘,’client‘,quotechar=’‘,delimiter=’‘) query_dict =[x for x in query_dict if re.search(’re.search‘,x’‘url’)]
这样做是双重的;您可以循环整个数据集两次。itertools.tee()‘克隆’迭代器:
从迭代工具导入tee veggies_query_dict,fruits_query_dict = in (Query_dict)素食=[x在veggies_query_dict中为x,如果re.search(' veggies ',x‘’url‘)]re.search=[x for x in fruits_query_dict if re.search(’‘,x’‘url’)]
这最终将所有query_dict缓存在tee缓冲区中,需要为同一任务增加两倍的内存,直到fruits再次清空缓冲区为止。https://stackoverflow.com/questions/19601355
复制相似问题