我是比较两个文件夹包含图像(文件夹A和文件夹B),以确定副本在文件夹B,比较方法是OpenCV的SIFT。
我的原始代码将每个图像的SIFT特性分析存储在字典中,直到我尝试比较两个非常大的文件夹,使计算机冻结(大概是因为字典中存储了太多的数据)。
我用循环重写了代码,以便每次只分析文件夹B中的一个子文件夹。问题:代码可以工作,但需要很长时间。我正在寻找关于如何重新排序嵌套元素以使代码运行更快的建议。。
下面是我尝试过的:我试着将读取文件夹A的SIFT值的代码放置一次,但是如果我把它放在循环之外(即。在行for folder in glob.glob...之前,变量desc_1为空,结果是空电子表格。我还尝试将以for a in glob.iglob开头的块和以for b in glob.iglob开头的块放在for folder in glob.glob循环中相同的缩进级别上,但是,最后是一个空的电子表格。
我还试图将对文件夹A的分析作为循环中的第一步,但这会导致每次循环运行时都会覆盖电子表格。
#Based on the tutorial provided by Sergio Canu (pysource) - https://pysource.com/2018/07/20/find-similarities-between-two-images-with-opencv-and-python/
from __future__ import division
import cv2
import os, os.path
import numpy as np
import glob
import pandas as pd
# Sift and Flann
sift = cv2.SIFT_create()
index_params = dict(algorithm=0, trees=5)
search_params = dict()
flann = cv2.FlannBasedMatcher(index_params, search_params)
Match = []
Match2 = []
listOfSimilarities = []
countInner = 0
countOuter = 0
#Identify the images
folder1 = "/home/oem/Desktop/Folder1/**"
folder1_count = sum(len(files) for _,_, files in os.walk('/home/oem/Desktop/Folder1/'))
folder2_count = sum(len(files) for _,_, files in os.walk('/home/oem/Desktop/_test'))
print(folder1_count)
print(folder2_count)
extensionsOnly = ('.jpeg','.jpg','.png','.tif','.tiff','.gif')
#Make a dictionary representing SIFT readings of each photo in folder one
for folder in glob.glob(r"/home/oem/Desktop/_test/*/",recursive=True):
folderPrint = folder.split(os.sep)[-2]
folder = folder + "**"
print(folderPrint)
siftOut1 = {}
for a in glob.iglob(folder1,recursive=True):
if not a.lower().endswith(extensionsOnly):
continue
image1 = cv2.imread(a)
kp_1, desc_1 = sift.detectAndCompute(image1, None)
siftOut1[a]=(kp_1,desc_1)
siftOut2 = {}
for b in glob.iglob(folder,recursive=True):
if not b.lower().endswith(extensionsOnly):
continue
image2 = cv2.imread(b)
kp_2, desc_2 = sift.detectAndCompute(image2, None)
siftOut2[b]=(kp_2,desc_2)
countOuter += 1
#fill the variable with the second sift value in the dictionary
#calculate the matches between the two sift analyses, and store in variable 'matches'
matches = flann.knnMatch(desc_1, desc_2, k=2)
good_points = []
#for every match, check that the distance meets a certain threshold (lower distance suggests a higher quality match). Add matches which meet this threshold to the list good_points
for m, n in matches:
if m.distance < 0.6*n.distance:
good_points.append(m)
# print(m)
#calculate which photo has the highest number of key points, and store in the variable 'number_keypoints'
number_keypoints = 0
if len(kp_1) >= len(kp_2):
number_keypoints = len(kp_1)
else:
number_keypoints = len(kp_2)
#calculate the percentage to which the photos are similar by dividing the good points by the keypoints
percentage_similarity = int(float(len(good_points)) / number_keypoints * 100)
# If the photo matches, add to the list Match, Match2 and the list of Similarities.
if percentage_similarity > 16:
# print(a)
Match.append(a)
Match2.append(b)
listOfSimilarities.append(percentage_similarity)
#This takes the three lists and places them side-by-side in a dataframe
zippedList = list(zip(Match,Match2, listOfSimilarities))
dfObj = pd.DataFrame(zippedList, columns = ['Original', 'Title', 'listofSimilarities' ])
dfObj.to_csv(r"/home/oem/Desktop/Results2/" + folderPrint +".csv")
Match = []
Match2 = []
zippedList = []根据要求,我列举了一个最少可重复的例子:
import time
import glob
import pandas as pd
import os
# Sift and Flann
startTime = time.time()
#Identify the images
folder1 = "C:\\Users\\Desktop\\folder1\\**"
Match = []
for folder2 in glob.glob("C:\\Users\\folder2\*\\",recursive=True):
for a in glob.iglob(folder1,recursive=True):
h1 = os.path.getsize(a)
for b in glob.iglob(folder2,recursive=True):
h2 = os.path.getsize(b)
if h1 >= h2:
Match.append(a)
dfObj = pd.DataFrame(Match)
dfObj.to_csv("C:\\Users\\Desktop\\Results2\\"+ folder2.split(os.sep)[-2] + ".csv")
Match = []
executionTime = (time.time() - startTime)
print('Execution time in seconds: ' + str(executionTime))发布于 2022-08-18 17:14:20
如果要标识大小相同的文件,只需将所有路径粘贴到按文件大小编制索引的字典中即可。
import collections
# this lambda creates a new empty list for any key that doesn't already exist
by_size = collections.defaultdict(lambda: [])
for path in all_your_files:
size = os.path.getsize(path)
by_size[size].append(path)
same_size = {
size: paths
for (size, paths) in by_size.items()
if len(paths) > 1
}使用加密散列(MD5、SHAs、.)也可以这样做。
如果您真的要使用内容比较,那么您需要了解Content-based image retrieval 。
通常的方法是有一个“特征向量”的数据库(简单的numpy数组),每幅图片有一个特征向量。在该数据库中,您可以执行最近邻查询。如果查询图像的特征向量有一个足够近的最近邻,则可能是相似的。
https://stackoverflow.com/questions/73396054
复制相似问题