文章/答案/技术大牛

发布

社区首页 >问答首页 >通过一次打开一个子文件夹，使文件夹的比较更有效

问通过一次打开一个子文件夹，使文件夹的比较更有效
EN

Stack Overflow用户

提问于 2022-08-17 23:56:06

回答 1查看 30关注 0票数 0

我是比较两个文件夹包含图像(文件夹A和文件夹B)，以确定副本在文件夹B，比较方法是OpenCV的SIFT。

我的原始代码将每个图像的SIFT特性分析存储在字典中，直到我尝试比较两个非常大的文件夹，使计算机冻结(大概是因为字典中存储了太多的数据)。

我用循环重写了代码，以便每次只分析文件夹B中的一个子文件夹。问题:代码可以工作，但需要很长时间。我正在寻找关于如何重新排序嵌套元素以使代码运行更快的建议。。

下面是我尝试过的:我试着将读取文件夹A的SIFT值的代码放置一次，但是如果我把它放在循环之外(即。在行for folder in glob.glob...之前，变量desc_1为空，结果是空电子表格。我还尝试将以for a in glob.iglob开头的块和以for b in glob.iglob开头的块放在for folder in glob.glob循环中相同的缩进级别上，但是，最后是一个空的电子表格。

我还试图将对文件夹A的分析作为循环中的第一步，但这会导致每次循环运行时都会覆盖电子表格。

#Based on the tutorial provided by Sergio Canu (pysource) - https://pysource.com/2018/07/20/find-similarities-between-two-images-with-opencv-and-python/

from __future__ import division

import cv2
import os, os.path


import numpy as np
import glob
import pandas as pd

    # Sift and Flann
sift = cv2.SIFT_create()

index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
 

Match = []
Match2 = []
listOfSimilarities = []
countInner = 0
countOuter = 0

 
#Identify the images

folder1 = "/home/oem/Desktop/Folder1/**"

folder1_count = sum(len(files) for _,_, files in os.walk('/home/oem/Desktop/Folder1/'))

folder2_count = sum(len(files) for _,_, files in os.walk('/home/oem/Desktop/_test'))

print(folder1_count)
print(folder2_count)

extensionsOnly = ('.jpeg','.jpg','.png','.tif','.tiff','.gif')

#Make a dictionary representing SIFT readings of each photo in folder one

for folder in glob.glob(r"/home/oem/Desktop/_test/*/",recursive=True):
        folderPrint = folder.split(os.sep)[-2]
        folder = folder + "**"
        
        print(folderPrint)

        siftOut1 = {}

        for a in glob.iglob(folder1,recursive=True):

            if not a.lower().endswith(extensionsOnly):

                continue

            image1 = cv2.imread(a)

            kp_1, desc_1 = sift.detectAndCompute(image1, None)

            siftOut1[a]=(kp_1,desc_1)

            siftOut2 = {}

            for b in glob.iglob(folder,recursive=True):
                

                if not b.lower().endswith(extensionsOnly):

                    continue

                image2 = cv2.imread(b)

                kp_2, desc_2 = sift.detectAndCompute(image2, None)

                siftOut2[b]=(kp_2,desc_2)

                countOuter += 1

                
                    #fill the variable with the second sift value in the dictionary
                    #calculate the matches between the two sift analyses, and store in variable 'matches'

                matches = flann.knnMatch(desc_1, desc_2, k=2)

                good_points = []
                #for every match, check that the distance meets a certain threshold (lower distance suggests a higher quality match).  Add matches which meet this threshold to the list good_points
                for m, n in matches:
                    if m.distance < 0.6*n.distance:
                        good_points.append(m)
                        # print(m)
                #calculate which photo has the highest number of key points, and store in the variable 'number_keypoints'
                number_keypoints = 0
                if len(kp_1) >= len(kp_2):
                    number_keypoints = len(kp_1)
                else:
                    number_keypoints = len(kp_2)
                #calculate the percentage to which the photos are similar by dividing the good points by the keypoints
                percentage_similarity = int(float(len(good_points)) / number_keypoints * 100)
                # If the photo matches, add to the list Match, Match2 and the list of Similarities.  

                if percentage_similarity > 16:
                    # print(a)
                    Match.append(a)
                    Match2.append(b)
                    listOfSimilarities.append(percentage_similarity)

            #This takes the three lists and places them side-by-side in a dataframe

        zippedList = list(zip(Match,Match2, listOfSimilarities))

        dfObj = pd.DataFrame(zippedList, columns = ['Original', 'Title', 'listofSimilarities' ])

        dfObj.to_csv(r"/home/oem/Desktop/Results2/" + folderPrint +".csv")

        Match = []

        Match2 = []

        zippedList = []

根据要求，我列举了一个最少可重复的例子：

import time
import glob
import pandas as pd
import os

    # Sift and Flann
    
startTime = time.time()
 
#Identify the images

folder1 = "C:\\Users\\Desktop\\folder1\\**"

Match = []

for folder2 in glob.glob("C:\\Users\\folder2\*\\",recursive=True):

        for a in glob.iglob(folder1,recursive=True):

            h1 = os.path.getsize(a)
            
            for b in glob.iglob(folder2,recursive=True):
                
                h2 = os.path.getsize(b)
                
                if h1 >= h2:
                    
                   Match.append(a)

        dfObj = pd.DataFrame(Match)

        dfObj.to_csv("C:\\Users\\Desktop\\Results2\\"+ folder2.split(os.sep)[-2] + ".csv")

        Match = []
        

executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))

os.path

python

loops

filesystems

glob

回答 1

Stack Overflow用户

发布于 2022-08-18 17:14:20

如果要标识大小相同的文件，只需将所有路径粘贴到按文件大小编制索引的字典中即可。

import collections

# this lambda creates a new empty list for any key that doesn't already exist
by_size = collections.defaultdict(lambda: [])

for path in all_your_files:
    size = os.path.getsize(path)
    by_size[size].append(path)

same_size = {
    size: paths
    for (size, paths) in by_size.items()
    if len(paths) > 1
}

使用加密散列(MD5、SHAs、.)也可以这样做。

如果您真的要使用内容比较，那么您需要了解Content-based image retrieval 。

通常的方法是有一个“特征向量”的数据库(简单的numpy数组)，每幅图片有一个特征向量。在该数据库中，您可以执行最近邻查询。如果查询图像的特征向量有一个足够近的最近邻，则可能是相似的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73396054

复制

相似问题

问通过一次打开一个子文件夹，使文件夹的比较更有效
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过一次打开一个子文件夹，使文件夹的比较更有效EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过一次打开一个子文件夹，使文件夹的比较更有效
EN