Duplicate File Detection

Any collection of digital media is likely to contain some duplicates, and the time and effort spent on finding these make this a high priority for any digital asset manager

File name matching

The most basic approach, filename matching, involves looking for two files with same filename and then highlighting them as duplicates. This method is poor, as many digital files have overlapping filenames (eg. DSC0001.JPG is the first filename produced by a digital camera from the factory, but your staff may have several such cameras). Additionally, many kinds of duplication arise when several versions of the same file are saved with different filenames (for example, DSC0001.JPG might be saved as Example.jpg later).


Hash checking

If two identical files are saved with different filenames, they can still be tested as duplicates by looking at the distinctive file signature, or hash. This is obtained easily using MD5 (a checksumming tool) or other equivalent tools like SHA1 or CRC.

A hash is a "message digest" of a large file, resulting in a very short string which is easy to compare against the hash of another file. A hash looks like this:

4b62753267da6995182dec1b7ff523a0

Although the hash is much shorter and smaller than the whole file, it is highly unlikely that two non-identical files will have the same hash, and so it provides a very easy way to test for data duplication. If two files have the same hash, they are almost certainly duplicates.

If you are using an Apple Mac, you can try this from the Terminal (open Applications > Tools > Terminal). In this test, there are two files on the Desktop with the filenames DSC0001.JPG and Example.JPG.

  • macpro:Desktop user$ md5 DSC0001.JPG MD5 (DSC0001.JPG) = e8735480f03aeecfa21aec49e8f95d0a
  • macpro:Desktop user$ md5 Example.JPG MD5 (Example.JPG) = e8735480f03aeecfa21aec49e8f95d0a

As these two files result in the same hash, they are identical - even though the filenames are different.

Tip: using a hash check is better than using a file size check, as two files can have the same file size in many different circumstances, but not be duplicates.

Unfortunately, there are some situations where using a hash is still insufficient. For example, each of the following changes would result in a change which would not be detected by a hash check:

  • Adding metadata to the file
  • Changing the size of the image (even if only very slightly)
  • Saving the file in a different format (eg. TIFF to JPEG)
  • Cropping or scaling the image
  • Making adjustments in PhotoShop, however slight
  • Images saved repeatedly with JPEG (resulting in compound compression losses)
  • Because of these concerns, hash checking has limited practical applications