我希望做到这一点,并在可能的情况下更快地做到这一点:
users列表。我有这段代码,但是它非常慢:
for u in users:
content = ""
contentfiles = glob.glob("raw_data/" + "*_" + str(u) + ".txt")
for c in contentfiles:
txt = open(c, "r").read()
content += txt
with open("docs/" + str(u) + ".txt", "w") as outfile:
outfile.write(content)是否有更快的方法来实现这一目标?我有400 k用户,它以每秒大约一个文件的速度运行= 18小时。
编辑:将glob从循环中移出将产生更快的结果
datafiles = glob.glob("raw_data/*.txt")
for u in users:
content = ""
filestring = "_" + str(u) + ".txt"
contentfiles = [i for i in datafiles if filestring in i]
for c in contentfiles:
txt = open(c, "r").read()
content += txt发布于 2020-05-20 07:49:54
假设glob是瓶颈,那么在编辑中列表过滤是新的瓶颈,下面是一个命题:
如果
glob移出循环,使其只在中
datafiles = glob.glob("raw_data/*.txt")
userfiles = {} # Dictionary of "user: [file list]"
# Prepare the file list
for file in datafiles:
user = file.split('.')[-2].split('_')[-1]
ufiles = userfiles.get(user, default=[])
ufiles.append(file)
userfiles[user] = ufiles
# Loop over the list
for user, ufiles in userfiles.items():
with open("docs/{}.txt".format(user), "w") as outfile:
for infile in ufiles:
outfile.write(infile.read())您甚至完全可以不过滤每个用户的文件,只需在datafiles中遍历任意排序的文件。这意味着以附加模式( outfile )打开a,这样就不会用用户的每个新文件覆盖原始内容。
https://stackoverflow.com/questions/61898037
复制相似问题