Find duplicate files with different extensions using python

Question

SOLUTION see EDIT at bottom of this comment.

PROBLEM: I have a directory with a heap of images, named something like below:

image001.nef
image002.nef
image003.nef
image003 - 20170609.jpg
image004.nef
image005.nef
image006 - 20170609.nef
image007.nef
image007 - 20170609.jpg
image008.jpg
image008 - 20170609.nef

I want to find all images that are a duplicate base name (like imageXXX) AND the extension is JPG

So from my above list, there are only three items that match the criteria to delete (i have bold those items).

I have 2,500 images so a pythonic way is desirable to me manually going through.

I am having a hard time finding an example script to use, all the ones I have found are checking the HASH or something, which I don't believe is useful as the images are indeed similar, but not identical.

Cheers

EDIT: thanks to dawg I was able to get the output I desire... here is the final code that worked for me:

import os

directory = r'C:\temp'
out_directory = r'C:\temp\temp_usa_photos'
fns = os.listdir(directory)


ref_nef = {fn[0:15] for fn in fns if fn.upper().endswith('.NEF')}

print ref_nef

out_list = filter(lambda e: e[0:15] in ref_nef, [fn for fn in fns if fn.upper().endswith('.JPG')])

print out_list

for f in out_list:
    input_file = os.path.join(directory, f)
    output_file = os.path.join(out_directory, f)
    os.rename(input_file, output_file)

You have to delete them only based on the filename? I don't exactly understand what prevents you from looping over all images, extracting base names, writing them to a dict/list and then removing all further duplicates encountered. — Euphe
– Euphe, Commented Jun 9, 2017 at 6:21

dawg · Accepted Answer · 2017-06-11 05:25:48Z

2

Given:

>>> fns
['image001.nef', 'image002.nef', 'image003.nef', 'image003 - 20170609.jpg', 'image004.nef', 'image005.nef', 'image006 - 20170609.nef', 'image007.nef', 'image007 - 20170609.jpg', 'image008.jpg', 'image008 - 20170609.nef']

(I can use that list as a proxy for a listing of file names. Just use a glob or listdir for files...)

If your file names are all of the form imageXXX you can first use that to create a set of the file names first 8 letters of the .nef files:

>>> ref_nef={fn[0:8] for fn in fns if fn.upper().endswith('.NEF')}
>>> ref_nef
set(['image008', 'image005', 'image004', 'image007', 'image006', 'image001', 'image003', 'image002'])

Then use that to filter the .jpg files to delete:

>>> filter(lambda e: e[0:8] in ref_nef, [fn for fn in fns if fn.upper().endswith('.JPG')])
['image003 - 20170609.jpg', 'image007 - 20170609.jpg', 'image008.jpg']

edited Jun 11, 2017 at 5:25

answered Jun 11, 2017 at 5:16

dawg

105k24 gold badges142 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Benno Over a year ago

I am wondering if there is a very simple solution to my new issue. Your solution fixed 99% of my problem, but I just found out there are some rogue NEF files. If you see this screenshot you can see some duplicate NEF files are present, I am wondering if there is a way to rid my folder of all the NEWER nef files. In this case the top one needs to go, it will have a longer name AND be newer. Can you help with this one? Thanks heaps for your assistance!

dawg Over a year ago

If this does 99%, then use this. After, you can use a dup finding approach where you actually read the file and compare. A md5 hash is useful for this. Good luck. Ask a new question if you get stuck

Collectives™ on Stack Overflow

Find duplicate files with different extensions using python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related