文章/答案/技术大牛

发布

问图像复现器
EN

Code Review用户

提问于 2015-11-10 04:48:28

回答 1查看 1.3K关注 0票数 5

我有一堆复制的照片，这些年来我都买到了。我想创建一个所有的列表，以便我可以最终删除一些。我的想法很简单:在MongoDB中的路径下转储每个图像文件的散列和位置，以供以后分析。这就是我想出来的：

import com.david.mongodocs.ImageEntry;
import com.mongodb.MongoClient;
import org.apache.commons.codec.digest.DigestUtils;
import org.mongodb.morphia.Datastore;
import org.mongodb.morphia.Morphia;

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class MD5Deduplicator {
    private static Datastore datastore;

    public static void main(String[] args) throws Exception {
        long startTime = System.nanoTime();
        Morphia morphia = new Morphia();
        morphia.mapPackage("com.david.mongodocs");
        datastore = morphia.createDatastore(new MongoClient(), "md5Deduplicator");
        datastore.ensureIndexes();
        logDuplicates(Paths.get(args[0]));
        System.out.println("Completed scan in " + (System.nanoTime() - startTime )+ " nanosecs");
    }

    private static void logDuplicates(Path path) throws IOException {
        Files.walk(path).parallel()
                .filter(Files::isReadable)
                .filter(Files::isRegularFile)
                .forEach(filePath -> {
                    try {
                        String contentType = Files.probeContentType(filePath);
                        if (contentType != null && contentType.startsWith("image")) {
                            FileInputStream fis = new FileInputStream(filePath.toFile());
                            String md5 = DigestUtils.md5Hex(fis);
                            fis.close();
                            ImageEntry imageEntry = new ImageEntry(filePath.toAbsolutePath().toString(), md5);
                            datastore.save(imageEntry);

                        }
                    } catch (Exception e) {
                        e.printStackTrace();
                    }

                });
    }
}

ImageEntry类：

package com.david.mongodocs;

import org.mongodb.morphia.annotations.Entity;
import org.mongodb.morphia.annotations.Id;
import org.mongodb.morphia.annotations.Indexed;

@Entity
public class ImageEntry {
    @Id
    public final String filePath;
    @Indexed
    public final String md5;

    public ImageEntry(String filePath, String md5) {
        this.filePath = filePath;
        this.md5 = md5;
    }
}

在visualVM分析器中，它看起来最慢的部分是md5hex函数(我有点惊讶，我希望最慢的部分与FileInputStream或保存函数相关)。我是否应该使用另一种哈希函数或md5实现？

我还有点担心，Files.walk().parallel()可能过于相信Java的默认设置，并且可能不是最优化的并行化方法。

java

mongodb

hashcode

回答 1

Code Review用户

发布于 2015-11-11 21:48:32

这个问题和主题很有趣。最近我需要做一些类似的处理；这个问题促使我继续下去。

首先，对最初的实施提出了一些看法。

据我所见，logDuplicates(Path)方法不记录重复文件，而是将对应于路径中所有图像文件的所有ImageEntry对象保存到mongodb中。因此，要跟踪副本，仍然需要在mongodb中执行几个请求。

这种方法还有一些改进的余地。分别为每个项目调用datastore.save(imageEntry)看起来相当可疑。对于所有可用的项目，在批处理模式下这样做应该会使事情更快。实际上，Datastore.save超载了save(Iterable<T>)和save(T... entities)。该方法的稍微改进的版本如下所示：

private void digestImages(Path path) throws IOException {
  List<ImageEntry> images = new ArrayList<>(APPROX_IMAGES_COUNT);
  Files.walk(path)
       .parallel()
       .filter(Files::isReadable)
       .filter(Files::isRegularFile)
       .forEach(filePath -> {
         if (isImage(filePath)) {
             ImageEntry img = digestAndBuildImageEntry(filePath);
             if (img != null) {
               images.add(img);
             } else {
               System.out.println(String.format("Failed to digest image: %1$s", filePath));
             }
         }});
  datastore.save(images);
}

private boolean isImage(Path path) {
  try {
    String contentType = Files.probeContentType(path);
    return contentType != null && contentType.startsWith("image");
  } catch (IOException ex) {
    ex.printStackTrace();
    return false;
  }
}

private ImageEntry digestAndBuildImageEntry(Path filePath) {
  try (InputStream is = Files.newInputStream(filePath);
       BufferedInputStream buffered = new BufferedInputStream(is)) {
    String hash = DigestUtils.md5Hex(buffered);
    return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
  } catch (IOException ex) {
    ex.printStackTrace();
    return null;
  }
}

测试

我有一个文件夹与大约900个JPG图像文件，在众多的子文件夹，我使用了测试。引入try-with-resources +批保存似乎可以提高大约10%的整体性能(见下面的结果)。

基于帮助我回忆产生校验和散列的不同方法和API的这是如此的帖子，我执行了几个测试，以便与评审的实现以及在使用番石榴时进行比较。在我的旧i7上运行5次的平均结果如下：

impl               avg time,ms    %
-------------------------------------
original              6583      100.0 
reviewed              5873       89.2 
guava/sha1            8267      125.6
guava/md5             5865       89.1 
guava/murmur3-128     3819       58.0
guava/murmur3-32      2689       40.9 
guava/adler32         2432       36.9
-------------------------------------

对于番石榴的测试，我使用了以下digestAndBuildImageEntry重载：

private ImageEntry digestAndBuildImageEntry(Path filePath, HashFunction hashFunc) {
  try {
    String hash = com.google.common.io.Files.hash(filePath.toFile(), hashFunc).toString();
    return new ImageEntry(filePath.toAbsolutePath().toString(), hash);
  } catch (IOException ex) {
    ex.printStackTrace();
    return null;
  }
}

因此，我们可以看到，至少对于我的测试用例，并且没有进行理论讨论，番石榴的adler32哈希执行速度几乎是原始实现的三倍。

票数 2

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/110314

复制

相似问题

问图像复现器
EN

回答 1

Code Review用户

评论

测试

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问图像复现器EN

回答 1

Code Review用户

评论

测试

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问图像复现器
EN