文章/答案/技术大牛

发布

社区首页 >问答首页 >检查另一个文件夹/子文件夹中重复的文件夹/子文件夹，但不检查第二个文件夹本身中的副本

问检查另一个文件夹/子文件夹中重复的文件夹/子文件夹，但不检查第二个文件夹本身中的副本
EN

Software Recommendation用户

提问于 2017-08-20 19:04:30

回答 3查看 125关注 0票数 0

我需要检查A文件夹(及其子文件夹)中所包含的文件是否存在于B文件夹或其子文件夹中(作为副本)，但我不希望看到B文件夹/子文件夹本身存在重复文件！

为了更清楚地说明这一点，以下面的示例场景为例：

文件夹A树是：

A
|_ DIR A1
|  |_ file_a1_1
|
|_ DIR A2
|  |_ file_a2_1
|
|_ file_a_1
|_ file_a_2
|_ a_file_not_duplicated

文件夹B树是：

B
|_ DIR B1
|  |_ file_a1_1
|  |_ file_a_1
|
|_ DIR B2
|  |_ file_a2_1
|  |_ file_b_1   (<= only duplicated in B)
|  |_ file_b_2   (<= only duplicated in B)
|
|_ file_a_1
|_ file_a_2
|_ file_a1_1
|_ file_a2_1
|_ file_b_1
|_ file_b_2

因此，除了"a_file_not_duplicated“文件夹之外，A文件夹/子文件夹中的每个文件在B文件夹或其子文件夹中都有一个副本。现在，一个“传统”复制查找软件会列出我，除了这些，甚至那些文件是重复的B文件夹/子文件夹(在本例中是file_b_1和file_b_2 )，但这些文件不存在于A文件夹/子文件夹中，而这些文件是我不想被显示的，而且这些文件也不包括在我期望的重复列表的结果中。

一个免费的软件将是伟大的，但我准备考虑/评估付费的一个也。

有什么建议吗？谢谢!

在示例中，我使用相同的文件名来指出哪些文件在B文件夹/子文件夹中有重复项，但是，当然，重复标识方法必须主要基于文件内容/校验和(与重复查找器通常发生的情况一样)。

编辑:我需要一些可以在Windows或Linux上运行的东西(如果可能的话)。

第二次编辑：

为了更好地澄清我在这里要求的内容，我添加了两个图形表示，并附加了一些解释：

传奇：

A\b=A_(根)文件夹中包含的所有文件集加上它的所有子文件夹；
a/b n. d. = A/B设置的文件没有重复(唯一的)；
类型1 dupes =A\\B文件，该文件(至少)在B_集_
类型2 dupes =A\\B文件，其中(至少)有一个副本在B_
类型3a/b dupes = A/B文件，该文件仅在A/B集中有(至少)一个重复，而在B/A集中没有重复；

(总结类型1和类型2 dupes =文件(至少)在另一个“文件夹和子文件夹”文件中具有(至少)一个副本，并且可能(或不)在自己的“文件夹和子文件夹”文件集中也有重复的文件。)

另一种代表情况的方法是：

除非重复文件外，暗区表示我不想在结果中显示的重复文件:我不想要“3b”副本，即B+其子文件夹中包含的重复文件，以及A+其子文件夹中没有重复的文件。

gratis

file-synchronization

duplicate-files

duplicate-display

回答 3

Software Recommendation用户

发布于 2017-09-05 19:52:34

下面是在Python中可以在Linux和Windows中运行的另一个可能的解决方案：

#!/usr/bin/env python

#This script originates from the need to give an answer to this question on SoftwareRecs:
#https://softwarerecs.stackexchange.com/questions/45293/check-for-folder- subfolders-duplicates-in-another-folder-subfolder-but-not-check
#Credit goes to Kodiologist for having deeply inspired and motivated me with his answer: https://softwarerecs.stackexchange.com/a/45316
#(this script is based on his code)

#This uses scandir and requires Python >= 3.5
#
#USEFUL REFERENCES:
#http://benhoyt.com/writings/scandir/
#https://github.com/benhoyt/scandir

from os import scandir, path
import hashlib

class element_hashed(object):
    __slots__ = ('path', 'name')
    def __repr__(self):
        return "<element_hashed path:%s name:%s>" % (self.path, self.name)
    def __str__(self):
        return "element_hashed: path is %s, name is %s" % (self.path, self.name)

#Source dir data structure:
#dict[HASH] -> dict[id_1...n] -> element_hashed('path', 'name')

#Repository dir data structure:
#dict[HASH] -> dict[id_1...n] -> "a found item" # just a simple filler, this information isn't really important for the purpose

srcdir = 'A'
repdir = 'B'
dcount = 0

def hashes(topdir, repository=False):
    for entry in scandir(topdir):

        #I'm not going to follow symlinks, I want only "effective" files/directory
        #https://www.python.org/dev/peps/pep-0471/
        #there was basic consensus among the most involved participants, and this PEP's author [...] to warrant following symlinks by default
        #it's straightforward to call the relevant methods with follow_symlinks=False if the other behaviour is desired.

        if entry.is_dir(follow_symlinks=False):
            yield from hashes(entry.path)

        elif entry.is_file(follow_symlinks=False):
            with open(entry.path, "rb") as o:
                if repository:
                    eh = "a found item"
                else:
                    eh = element_hashed()
                    eh.path = topdir
                    eh.name = entry.name
                yield eh, hashlib.sha256(o.read()).hexdigest()


if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        sys.exit("usage: " + sys.argv[0] + " source_dir repository_dir \n(relative paths may be used)")
    else:
        if not (path.isdir(sys.argv[1]) and path.isdir(sys.argv[2])):
            sys.exit("ERROR: one or both non-existent directory!")
        srcdir = sys.argv[1]
        repdir = sys.argv[2]

source_hashes = dict()

for eh, key in hashes(path.abspath(srcdir)): #use source absolute path always
    if not key in source_hashes:
        source_hashes[key] = dict()
    source_hashes[key][path.join(eh.path, eh.name)] = eh

for _, h in hashes(path.abspath(repdir), True): #use repository absolute path always
    if h in source_hashes: # found _ 1b|2b type _ dup. item
        if len(source_hashes[h]) > 1:
            for eh in source_hashes[h]:
                print("Type 2a dup. :=> ", eh)
                dcount += 1
        else:
            print("Type 1a dup. :=> ", next(iter(source_hashes[h])))
            dcount += 1
        source_hashes.pop(h, None)

#now the remainders are all the "srcdir" files that don't have any duplicate in "repdir" = unique ones + type 3a dupes. I'm only interested in the latter
for fn in source_hashes.values():
    if len(fn) > 1:
        for eh in fn:
            print("Type 3a dup. :=> ", eh)
            dcount += 1

print("\nFound ", dcount, " dupes")

票数 1

Software Recommendation用户

发布于 2017-08-21 18:53:09

如果我正确理解你的问题，你想要验证A中的每个文件在B中至少有一个副本。这里有一个Python程序，它计算两个目录中每个文件的散列，并打印出出现在A而不是B中的文件的名称。这个程序是自由软件，运行在Windows和Linux上。

import os, hashlib

source = 'A'
destination = 'B'

def hashes(topdir):
    for dirpath, _, filenames in os.walk(topdir):
       for fn in filenames:
           fn = os.path.join(dirpath, fn)
           with open(fn, "rb") as o:
               yield fn, hashlib.sha256(o.read()).hexdigest()

source_hashes = dict()
for fn, h in hashes(source):
    source_hashes[h] = fn
for _, h in hashes(destination):
    source_hashes.pop(h, None)

for fn in source_hashes.values():
   print("Didn't find a copy of", fn)

票数 0

Software Recommendation用户

发布于 2017-09-04 19:06:51

我想我已经尝试过几乎所有可用的免费副本查找软件，但是没有人提供这样的“文件夹自扫描排除”功能，所以我开始尝试付费软件，然后看到了双杀手专业版的演示。

DK Pro允许您比较内容，选择哪些文件夹和子文件夹必须被视为“库”。这意味着，该文件夹中的元素不会单独进行比较(即目录自扫描排除)，下面是一个使用示例，首先，如果使用“经典”文件夹比较，会发生什么：

正如您所看到的，在B文件夹/子文件夹中有一些我不想列出的重复文件，因为它们实际上在A文件夹/子文件夹中没有副本。现在，下面是将B作为“库”目录列表的情况：

你会注意到不想要的重复文件现在不再出现，并且远离结果。如果我想更好地突出显示结果列表中存在哪些文件夹/子文件夹重复文件，我现在可以检查“根据标准.”(使用适当的路径匹配掩码)，然后按“路径”对结果进行排序，我将得到：

所以，最终，这成功了。如果我想删除所选的文件或导出条目，我将不得不购买许可证，尽管这不会是一项令人望而却步的费用(私人使用许可费用为14.95欧元/ 19.95美元--毕竟，如果值得的话，价格相当便宜和合理)，但在这种情况下，你只需要看到一个列表(当你不处理数百个文件，或者如果你不想删除任何东西，甚至不需要导出条目列表)，你甚至可以节省你的钱。

票数 0

页面原文内容由Software Recommendation提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://softwarerecs.stackexchange.com/questions/45293

复制

相似问题

问检查另一个文件夹/子文件夹中重复的文件夹/子文件夹，但不检查第二个文件夹本身中的副本
EN

第二次编辑：

传奇：

回答 3

Software Recommendation用户

Software Recommendation用户

Software Recommendation用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查另一个文件夹/子文件夹中重复的文件夹/子文件夹，但不检查第二个文件夹本身中的副本EN

第二次编辑：

传奇：

回答 3

Software Recommendation用户

Software Recommendation用户

Software Recommendation用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查另一个文件夹/子文件夹中重复的文件夹/子文件夹，但不检查第二个文件夹本身中的副本
EN