首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >查找重复文件并将其删除

查找重复文件并将其删除
EN

Stack Overflow用户
提问于 2009-04-15 01:48:22
回答 10查看 81.7K关注 0票数 71

我正在编写一个Python程序来查找和删除文件夹中的重复文件。

我有mp3文件的多个副本,还有一些其他文件。我使用的是sh1算法。

如何找到这些重复文件并将其删除?

EN

回答 10

Stack Overflow用户

发布于 2016-03-20 19:33:10

最快的算法-与公认的答案相比,性能提高了100倍(真的:)

其他解决方案中的方法非常酷,但它们忘记了重复文件的一个重要属性-它们具有相同的文件大小。仅在相同大小的文件上计算昂贵的散列将节省大量CPU;性能比较结束时,这里是解释。

对@nosklo给出的可靠答案进行迭代,并借用@Raffi的想法对每个文件的开头进行快速散列,并仅对快速散列中的冲突计算完整的散列,以下是步骤:

  1. 构建文件的哈希表,其中文件大小是关键字。
  2. 对于大小相同的文件,使用其前1024字节的哈希创建哈希表;不冲突元素是唯一的
  3. 对于前1k字节具有相同哈希的文件,计算完整内容的哈希-具有匹配的哈希的文件不是唯一的。

代码:

代码语言:javascript
复制
#!/usr/bin/env python
# if running in py3, change the shebang, drop the next import for readability (it does no harm in py3)
from __future__ import print_function   # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys


def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk


def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):
    hashobj = hash()
    file_object = open(filename, 'rb')

    if first_chunk_only:
        hashobj.update(file_object.read(1024))
    else:
        for chunk in chunk_reader(file_object):
            hashobj.update(chunk)
    hashed = hashobj.digest()

    file_object.close()
    return hashed


def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue


    # For all files with the same file size, get their hash on the 1st 1024 bytes only
    for size_in_bytes, files in hashes_by_size.items():
        if len(files) < 2:
            continue    # this file size is unique, no need to spend CPU cycles on it

        for filename in files:
            try:
                small_hash = get_hash(filename, first_chunk_only=True)
                # the key is the hash on the first 1024 bytes plus the size - to
                # avoid collisions on equal hashes in the first part of the file
                # credits to @Futal for the optimization
                hashes_on_1k[(small_hash, size_in_bytes)].append(filename)
            except (OSError,):
                # the file access might've changed till the exec point got here 
                continue

    # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates
    for __, files_list in hashes_on_1k.items():
        if len(files_list) < 2:
            continue    # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it

        for filename in files_list:
            try: 
                full_hash = get_hash(filename, first_chunk_only=False)
                duplicate = hashes_full.get(full_hash)
                if duplicate:
                    print("Duplicate found: {} and {}".format(filename, duplicate))
                else:
                    hashes_full[full_hash] = filename
            except (OSError,):
                # the file access might've changed till the exec point got here 
                continue


if __name__ == "__main__":
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print("Please pass the paths to check as parameters to the script")

而且,这是有趣的部分-性能比较。

基准-

  • 目录有1047个文件,32个mp4,1015 - MiB,总大小- 5445.998 MiB-即我手机的摄像头自动上传目录:)
  • 小型(但功能齐全)处理器- 1600 BogoMIPS,1.2 GHz 32L1 + 256L2 Kbs缓存,jpg

处理器: Feroceon 88FR131版本1 (v5l) BogoMIPS : 1599.07

(即我的低端NAS :),运行Python 2.7.11。

所以,@nosklo非常方便的解决方案的输出:

代码语言:javascript
复制
root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py 
Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg
Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg

real    5m44.198s
user    4m44.550s
sys     0m33.530s

并且,这是带有过滤大小检查的版本,然后是小散列,最后是完整散列,如果发现冲突的话:

代码语言:javascript
复制
root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone"
Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg
Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg

real    0m1.398s
user    0m1.200s
sys     0m0.080s

两个版本都运行了3次,以获得所需时间的平均值。

所以v1是(user+sys) 284s,另一个是2s;相当不同,哈:)随着这个增加,一个人可以使用SHA512,甚至更花哨-性能损失将因所需的较少计算而减轻。

负面影响:

  • 比其他版本更多的磁盘访问-每个文件访问一次大小统计信息(这很便宜,但仍然是磁盘IO),每个副本打开两次(对于较小的前1千字节哈希,对于完整内容哈希)
  • 将消耗更多内存,因为存储哈希表运行时
票数 109
EN

Stack Overflow用户

发布于 2009-04-14 19:00:38

递归文件夹版本:

此版本使用文件大小和内容的散列来查找重复项。您可以向它传递多个路径,它将递归地扫描所有路径,并报告找到的所有重复项。

代码语言:javascript
复制
import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                for chunk in chunk_reader(open(full_path, 'rb')):
                    hashobj.update(chunk)
                file_id = (hashobj.digest(), os.path.getsize(full_path))
                duplicate = hashes.get(file_id, None)
                if duplicate:
                    print "Duplicate found: %s and %s" % (full_path, duplicate)
                else:
                    hashes[file_id] = full_path

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print "Please pass the paths to check as parameters to the script"
票数 47
EN

Stack Overflow用户

发布于 2009-04-14 18:51:47

代码语言:javascript
复制
def remove_duplicates(dir):
    unique = []
    for filename in os.listdir(dir):
        if os.path.isfile(filename):
            filehash = md5.md5(file(filename).read()).hexdigest()
            if filehash not in unique: 
                unique.append(filehash)
            else: 
                os.remove(filename)

//编辑:

对于MP3,您可能也会对此主题Detect duplicate MP3 files with different bitrates and/or different ID3 tags?感兴趣

票数 21
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/748675

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档