2

I am trying to iterate through directories and subdirectories to find duplicate files but issue encountered here is script is giving some errors:

Traceback (most recent call last):
  File "./fileDupchknew.py", line 29, in <module>
    dup_fileremove(dirname)
  File "./fileDupchknew.py", line 26, in dup_fileremove
    os.remove(filepath)
  OSError: [Errno 21] Is a directory: '/tmp/rishabh-test/new-test'

Script:

#!/usr/bin/python
import os
import hashlib
import sys


dirname = sys.argv[1] os.chdir(dirname)

 def dup_fileremove(dir):
    duplicate = set()
    os.chdir(dir)
    path=os.getcwd()
    print ("The dir is: ", path)
    for filename in os.listdir(dir):
        filehash = None
        filepath=os.path.join(dir, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
        if filehash not in duplicate:
            duplicate.add(filehash)
        else:
            os.remove(filepath)
            print("removed : ", filepath)

dup_fileremove(dirname)
11
  • 4
    Thanks for providing the stack trace. But what was the actual error message? That should appear on the line just below the stack trace. Commented Aug 10, 2015 at 12:07
  • Is def dup_fileremove(dir): commented out in your original code, or is that a transcription error from writing this post? Commented Aug 10, 2015 at 12:09
  • What's up with your multiple import statements being all on one line? Commented Aug 10, 2015 at 12:10
  • My bad complete stack trace: Traceback (most recent call last): File "./fileDupchknew.py", line 28, in <module> dup_fileremove(dirname) File "./fileDupchknew.py", line 25, in dup_fileremove os.remove(filepath) OSError: [Errno 21] Is a directory: '/tmp/rishabh-test/new-test' Commented Aug 10, 2015 at 12:13
  • 1
    BTW, dir is not a good variable name since it's the name of a built-in function. Commented Aug 10, 2015 at 12:15

2 Answers 2

1

Since you do not want to delete directories (as can be seen from comments in question) -

No i don't want to delete directories

If the above is the case, then your issue occurs because you are not creating filehash for the directories. Because when you do not create a filehash for the directory, you get the filehash as None , and for the first directory, None is not present in the duplicates set , so it adds None to the set. From next directory onwards, it sees that None is already present in the set() , hence it tries to use os.remove() on it causing the issue.

A simple fix would be to check whether filehash is None or not , before trying to remove as well as before adding to set. Example -

#!/usr/bin/python
import os 
import hashlib
import sys


dirname = sys.argv[1] 
os.chdir(dirname)

 def dup_fileremove(dir):
    duplicate = set()
    os.chdir(dir)
    path=os.getcwd()
    print ("The dir is: ", path)
    for filename in os.listdir(dir):
        filehash = None
        filepath=os.path.join(dir, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
        if filehash is not None and filehash not in duplicate:
            duplicate.add(filehash)
        elif filehash is not None:
            os.remove(filepath)
            print("removed : ", filepath)

dup_fileremove(dirname)
Sign up to request clarification or add additional context in comments.

3 Comments

its working as expected. Really want to thank you over this.
I am glad I could be helpful :) .
In addition to this i also updated my script to check duplication between various sub directories eg. if a file is in a-test directory and same file is in b-test directory within /tmp directory then it will remove one file and hence redundancy is removed. This is achieved by declaring set outside the function.
1

You're actually lucky you got that error message, otherwise your code would have deleted directories!

The problem is that after control returns from the recursive call to

dup_fileremove(filepath)

it then continues on to

if filehash not in duplicate:

You don't want that!

A simple way to fix it is to put a continue statement after dup_fileremove(filepath).

But a much better fix is to indent the if filehash not in duplicate: stuff so that it's aligned with the filehash = hashlib.md5(file(filepath).read()).hexdigest() line.

For example:

#!/usr/bin/python
import os 
import hashlib
import sys

def dup_fileremove(dirname):
    duplicate = set()
    os.chdir(dirname)
    path=os.getcwd()
    print ("The dirname is: ", path)
    for filename in os.listdir(dirname):
        filehash = None
        filepath=os.path.join(dirname, filename)
        print("Current file path is: ", filepath)
        if os.path.isdir(filepath):
            dup_fileremove(filepath)
        elif os.path.isfile(filepath):
            filehash =hashlib.md5(file(filepath).read()).hexdigest()
            if filehash not in duplicate:
                duplicate.add(filehash)
            else:
                os.remove(filepath)
                print("removed : ", filepath)

dirname = sys.argv[1] 
os.chdir(dirname)

dup_fileremove(dirname)

I haven't tested this modified version of your code. It looks ok, but I make no guarantees. :)

BTW, it is recommended to not use the file() class directly to open files. In Python 3, file() no longer exists, but even in Python the docs have recommended the use of the open() function since at least Python 2.5, if not earlier.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.