我已经编写了这段代码(它有效--我在小批量的MBOX文件上试用了它)。然而,当我在一个大小为2.9GB的大约5万封邮件的MBOX文件上尝试时,内存消耗急剧上升,使计算机无法使用。该代码在内存消耗方面有什么问题,是否有一种方法可以修复它,例如通过增量而不是整体来进行代码处理?这个脚本的目标是生成一个CSV文件,该文件以x作为日期,Y作为每个日期收到的消息的计数,以便绘制它们并生成电子邮件的统计表示形式。未来的:--我计划扩展它,读取电子邮件消息,并按时间顺序在pdf上生成输出,因此需要对其进行排序(内存消耗猛增)。
import mailbox
from email.utils import parsedate
from dateutil.parser import parse
import itertools
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
import csv
from itertools import izip
path = 'mail.mbox'
mbox = mailbox.mbox(path)
def extract_date(email):
date = email.get('Date')
return parsedate(date)
#sort the email by a given date
sorted_mails = sorted(mbox, key=extract_date)
mbox.update(enumerate(sorted_mails))
mbox.flush()
#it finds all the dates within the MBOX and split
all_dates = []
mbox = mailbox.mbox(path)
for message in mbox:
all_dates.append( str( parse( message['date'] ) ).split(' ')[0] )
#counts the number of emails per given date
email_count = [(g[0], len(list(g[1]))) for g in itertools.groupby(all_dates)]
email_count[0]
#makes a list of (x,y)
x = []
y = []
for date, count in email_count:
x.append(date)
y.append(count)
#produce a CSV file of X and Y, for plotting
with open('data.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(izip(x, y))
"""
data = Data([x, y])
plot_url = py.iplot(Data, filename='line-scatter' )
"""
py.iplot( Data([ Scatter( x=x, y=y ) ]) )发布于 2016-02-11 15:19:40
我对这些库不是很熟悉,但我认为主要的问题是,您正在使用以下行将所有消息读入内存:
sorted_mails = sorted(mbox, key=extract_date)这个脚本的目标是什么?你真的需要整理什么吗?如果您只需要生成一个带有每次日期计数的CSV,请尝试如下:
import mailbox
from email.utils import parsedate
from dateutil.parser import parse
import itertools
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls
import csv
from itertools import izip
path = 'mail.mbox'
mbox = mailbox.mbox(path)
# map date to number of emails seen on that date
date_counts = {}
for message in mbox:
date = str( parse( message['date'] ) ).split(' ')[0]
try:
date_counts[date] += 1
except KeyError:
date_counts[date] = 1
with open('data.csv', 'wb') as f:
writer = csv.writer(f)
for date in date_counts:
writer.writerow([date, date_counts[date]])https://stackoverflow.com/questions/35342465
复制相似问题