Using python code to remove duplicate files from a directory and subdirectory

Question

I am trying to iterate through directories and subdirectories to find duplicate files but issue encountered here is script is giving some errors:

Traceback (most recent call last):
  File "./fileDupchknew.py", line 29, in <module>
    dup_fileremove(dirname)
  File "./fileDupchknew.py", line 26, in dup_fileremove
    os.remove(filepath)
  OSError: [Errno 21] Is a directory: '/tmp/rishabh-test/new-test'

Script:

#!/usr/bin/python
import os
import hashlib
import sys


dirname = sys.argv[1] os.chdir(dirname)

 def dup_fileremove(dir):
    duplicate = set()
    os.chdir(dir)
    path=os.getcwd()
    print ("The dir is: ", path)
    for filename in os.listdir(dir):
        filehash = None
        filepath=os.path.join(dir, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
        if filehash not in duplicate:
            duplicate.add(filehash)
        else:
            os.remove(filepath)
            print("removed : ", filepath)

dup_fileremove(dirname)

Thanks for providing the stack trace. But what was the actual error message? That should appear on the line just below the stack trace. — Kevin
– Kevin, Commented Aug 10, 2015 at 12:07
Is def dup_fileremove(dir): commented out in your original code, or is that a transcription error from writing this post? — Kevin
– Kevin, Commented Aug 10, 2015 at 12:09
What's up with your multiple import statements being all on one line? — Jonathon Reinhart
– Jonathon Reinhart, Commented Aug 10, 2015 at 12:10
My bad complete stack trace: Traceback (most recent call last): File "./fileDupchknew.py", line 28, in <module> dup_fileremove(dirname) File "./fileDupchknew.py", line 25, in dup_fileremove os.remove(filepath) OSError: [Errno 21] Is a directory: '/tmp/rishabh-test/new-test' — Rishabh Dixit
– Rishabh Dixit, Commented Aug 10, 2015 at 12:13
BTW, dir is not a good variable name since it's the name of a built-in function. — PM 2Ring
– PM 2Ring, Commented Aug 10, 2015 at 12:15

Anand S Kumar · Accepted Answer · 2015-08-10 12:46:46Z

1

Since you do not want to delete directories (as can be seen from comments in question) -

No i don't want to delete directories

If the above is the case, then your issue occurs because you are not creating filehash for the directories. Because when you do not create a filehash for the directory, you get the filehash as None , and for the first directory, None is not present in the duplicates set , so it adds None to the set. From next directory onwards, it sees that None is already present in the set() , hence it tries to use os.remove() on it causing the issue.

A simple fix would be to check whether filehash is None or not , before trying to remove as well as before adding to set. Example -

#!/usr/bin/python
import os 
import hashlib
import sys


dirname = sys.argv[1] 
os.chdir(dirname)

 def dup_fileremove(dir):
    duplicate = set()
    os.chdir(dir)
    path=os.getcwd()
    print ("The dir is: ", path)
    for filename in os.listdir(dir):
        filehash = None
        filepath=os.path.join(dir, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
        if filehash is not None and filehash not in duplicate:
            duplicate.add(filehash)
        elif filehash is not None:
            os.remove(filepath)
            print("removed : ", filepath)

dup_fileremove(dirname)

edited Aug 10, 2015 at 12:46

answered Aug 10, 2015 at 12:31

Anand S Kumar

91.4k18 gold badges196 silver badges179 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rishabh Dixit Over a year ago

its working as expected. Really want to thank you over this.

Anand S Kumar Over a year ago

I am glad I could be helpful :) .

Rishabh Dixit Over a year ago

In addition to this i also updated my script to check duplication between various sub directories eg. if a file is in a-test directory and same file is in b-test directory within /tmp directory then it will remove one file and hence redundancy is removed. This is achieved by declaring set outside the function.

PM 2Ring · Accepted Answer · 2015-08-10 12:33:47Z

You're actually lucky you got that error message, otherwise your code would have deleted directories!

The problem is that after control returns from the recursive call to

dup_fileremove(filepath)

it then continues on to

if filehash not in duplicate:

You don't want that!

A simple way to fix it is to put a continue statement after dup_fileremove(filepath).

But a much better fix is to indent the if filehash not in duplicate: stuff so that it's aligned with the filehash = hashlib.md5(file(filepath).read()).hexdigest() line.

For example:

#!/usr/bin/python
import os 
import hashlib
import sys

def dup_fileremove(dirname):
    duplicate = set()
    os.chdir(dirname)
    path=os.getcwd()
    print ("The dirname is: ", path)
    for filename in os.listdir(dirname):
        filehash = None
        filepath=os.path.join(dirname, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
            if filehash not in duplicate:
                duplicate.add(filehash)
            else:
                os.remove(filepath)
                print("removed : ", filepath)

dirname = sys.argv[1] 
os.chdir(dirname)

dup_fileremove(dirname)

I haven't tested this modified version of your code. It looks ok, but I make no guarantees. :)

BTW, it is recommended to not use the file() class directly to open files. In Python 3, file() no longer exists, but even in Python the docs have recommended the use of the open() function since at least Python 2.5, if not earlier.

Collectives™ on Stack Overflow

Using python code to remove duplicate files from a directory and subdirectory

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related