你如何测试两个大视频是否相同?

2022-01-25 00:00:00 compare md5 video java

I have a system where video files are ingested and then multiple CPU intensive tasks are started. As these tasks are computationally expensive I would like to skip processing a file if it has already been processed.

Videos come from various sources so file names etc are not viable options.

If I was using pictures I would compare the MD5 hash but on a 5GB - 40GB video this can take a long time to compute.

To compare the 2 videos I am testing this method:

  • check relevant metadata matches
  • check length of file with ffmpeg / ffprobe
  • use ffmpeg to extract frames at 100 predfined timestamps [1-100]
  • create MD5 hashes of each of those frames
  • compare the MD5 hashes to check for a match

Does anyone know a more efficient way of doing this? Or a better way to approach the problem?

解决方案

First, you need to properly define under which conditions two video files are considered the same. Do you mean exactly identical as in byte-for-byte? Or do you mean identical in content, then you need to define a proper comparison method for the content.

I'm assuming the first (exactly identical files). This is independent of what the files actually contain. When you receive a file, always build the a hash for the file, store the hash along with the file.

Checking for duplicates then is a multi-step process:

1.) Compare hashes, if you find no matching hash, file is new. In most cases of a new file you can expect this step to be the only step, a good hash (SHA1 or something bigger) will have few collisions for any practical number of files.

2.) If you found other files with the same hash, check file length. If they don't match, the file is new.

3.) If both hash and file length matched, you have to compare the entire file contents, stop when you find the first difference. If the entire file compare turns out to be identical the file it the same.

In the worst case (files are identical) this should take no longer than the raw IO speed for reading the two files. In the best case (hashes differ) the test will only take as much time as the hash lookup (in a DB or HashMap or whatever you use).

EDIT: You are concerned about the IO to build the hash. You may partially avoid that if you compare the file length first and skip everything of the file length is unique. On the other hand, you then need to also keep track for which files you already did build the hash. This would allow you to defer building the hash until you really need it. In case of a missing hash you could skip directly to comparing the two files, while building the hashes in the same pass. Its a lot more state to keep track of, but it may be worth it depending on your scenario (You need a solid data basis of how often duplicate files occur and their average size distribution to make a decision).

相关文章