0

I wrote a file dupelication processor which gets the MD5 hash of each file, adds it to a hashmap, than takes all of the files with the same hash and adds it to a hashmap called dupeList. But while running large directories to scan such as C:\Program Files\ it will throw the following error

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Unknown Source)
at java.nio.file.Files.readAllBytes(Unknown Source)
at com.embah.FileDupe.Utils.FileUtils.getMD5Hash(FileUtils.java:14)
at com.embah.FileDupe.FileDupe.getDuplicateFiles(FileDupe.java:43)
at com.embah.FileDupe.FileDupe.getDuplicateFiles(FileDupe.java:68)
at ImgHandler.main(ImgHandler.java:14)

Im sure its due to the fact it handles so many files, but im not sure of a better way to handle it. Im trying to get this working so I can sift thru all my kids baby pictures and remove dupelicates before I put them on my external harddrive for longterm storage. Thanks everyone for the help!

My code

public class FileUtils {
public static String getMD5Hash(String path){
    try {
        byte[] bytes = Files.readAllBytes(Paths.get(path)); //LINE STACK THROWS ERROR
        byte[] hash = MessageDigest.getInstance("MD5").digest(bytes);
        bytes = null;
        String hexHash = DatatypeConverter.printHexBinary(hash);
        hash = null;
        return hexHash;
    } catch(Exception e){
        System.out.println("Having problem with file: " + path);
        return null;
    }
}

public class FileDupe {
public static Map<String, List<String>> getDuplicateFiles(String dirs){
    Map<String, List<String>> allEntrys = new HashMap<>(); //<hash, file loc>
    Map<String, List<String>> dupeEntrys = new HashMap<>();
    File fileDir = new File(dirs);
    if(fileDir.isDirectory()){
        ArrayList<File> nestedFiles = getNestedFiles(fileDir.listFiles());
        File[] fileList = new File[nestedFiles.size()];
        fileList = nestedFiles.toArray(fileList);

        for(File file:fileList){
            String path = file.getAbsolutePath();
            String hash = "";
            if((hash = FileUtils.getMD5Hash(path)) == null)
                continue;
            if(!allEntrys.containsValue(path))
                put(allEntrys, hash, path);
        }
        fileList = null;
    }
    allEntrys.forEach((hash, locs) -> {
        if(locs.size() > 1){
            dupeEntrys.put(hash, locs);
        }
    });
    allEntrys = null;
    return dupeEntrys;
}

public static Map<String, List<String>> getDuplicateFiles(String... dirs){
    ArrayList<Map<String, List<String>>> maps = new ArrayList<Map<String, List<String>>>();
    Map<String, List<String>> dupeMap = new HashMap<>();
    for(String dir : dirs){ //Get all dupe files
        maps.add(getDuplicateFiles(dir));
    }
    for(Map<String, List<String>> map : maps){ //iterate thru each map, and add all items not in the dupemap to it
        dupeMap.putAll(map);
    }
    return dupeMap;
}

protected static ArrayList<File> getNestedFiles(File[] fileDir){
    ArrayList<File> files = new ArrayList<File>();
    return getNestedFiles(fileDir, files);
}

protected static ArrayList<File> getNestedFiles(File[] fileDir, ArrayList<File> allFiles){
    for(File file:fileDir){
        if(file.isDirectory()){
            getNestedFiles(file.listFiles(), allFiles);
        } else {
            allFiles.add(file);
        }
    }
    return allFiles;
}

protected static <KEY, VALUE> void put(Map<KEY, List<VALUE>> map, KEY key, VALUE value) {
    map.compute(key, (s, strings) -> strings == null ? new ArrayList<>() : strings).add(value);
}


public class ImgHandler {
private static Scanner s = new Scanner(System.in);

public static void main(String[] args){
    System.out.print("Please enter locations to scan for dupelicates\nSeperate Location via semi-colon(;)\nLocations: ");
    String[] locList = s.nextLine().split(";");
    Map<String, List<String>> dupes = FileDupe.getDuplicateFiles(locList);
    System.out.println(dupes.size() + " dupes detected!");
    dupes.forEach((hash, locs) -> {
        System.out.println("Hash: " + hash);
        locs.forEach((loc) -> System.out.println("\tLocation: " + loc));
    });
}
4
  • a) why don't you increase your heap settings? b) You data structure seems a little complex - map of a list of a map Commented Feb 14, 2018 at 0:31
  • why not read each file, once at a time, calculate the hash for the file then move onto the next file. Commented Feb 14, 2018 at 0:31
  • 1
    Your md5hash method has very poor performance. There is no need to read the entire file into memory (which causes out of memory if the file is very large). You can read blocks of, say, 8192 bytes, at a time and call the update method on the digest object. Commented Feb 14, 2018 at 0:51
  • I did try increasing my heap but I still received the error much later. And raz that is what Im doing, im calculating the hash of each file, than adding the file to a map. The map just contains the hash, than an array of locations which files associate to that hash. Commented Feb 14, 2018 at 0:55

5 Answers 5

2

Reading the entire file into a byte array does not only require sufficient heap space, it’s also limited to file sizes up to Integer.MAX_VALUE in principle (the practical limit for the HotSpot JVM is even a few bytes smaller).

The best solution is not to load the data into the heap memory at all:

public static String getMD5Hash(String path) {
    MessageDigest md;
    try { md = MessageDigest.getInstance("MD5"); }
    catch(NoSuchAlgorithmException ex) {
        System.out.println("FileUtils.getMD5Hash(): "+ex);
        return null;// TODO better error handling
    }
    try(FileChannel fch = FileChannel.open(Paths.get(path), StandardOpenOption.READ)) {
        for(long pos = 0, rem = fch.size(), chunk; rem>pos; pos+=chunk) {
            chunk = Math.min(Integer.MAX_VALUE, rem-pos);
            md.update(fch.map(FileChannel.MapMode.READ_ONLY, pos, chunk));
        }
    } catch(IOException e){
        System.out.println("Having problem with file: " + path);
        return null;// TODO better error handling
    }
    return String.format("%032X", new BigInteger(1, md.digest()));
}

If the underlying MessageDigest implementation is a pure Java implementation, it will transfer data from the direct buffer to the heap, but that’s outside your responsibility then (and it will be a reasonable trade-off between consumed heap memory and performance).

The method above will handle files beyond the 2GiB size without problems.

Sign up to request clarification or add additional context in comments.

1 Comment

It requires api >= 26
1

Whatever implementation FileUtils has is trying to read in whole files for calculating hash. This is not necessary: calculation is possible by reading content in smaller chunks. In fact it is sort of bad design to require this, instead of simply reading in chunks that are needed (64 bytes?). So maybe you need to use a better library.

Comments

0

You have a lot of solutions:

  1. Don't read all bytes at one time, try to use a BufferedInputStream, and read a lot of bytes every time. But not all the file.

    try (BufferedInputStream fileInputStream = new BufferedInputStream( 
            Files.newInputStream(Paths.get("your_file_here"), StandardOpenOption.READ))) {
    
        byte[] buf = new byte[2048];
        int len = 0;
        while((len = fileInputStream.read(buf)) == 2048) {
            // Add this to your calculation
            doSomethingWithBytes(buf);
        }
        doSomethingWithBytes(buf, len); // Do only with the bytes
                                        // read from the file
    
    
    } catch(IOException ex) {
        ex.printStackTrace();
    }
    
  2. Use C/C++ for such thing, (well, this is unsafe, because you will handle the memory yourself)

2 Comments

Might just use your first option with a table lookup function, ill just have to do some byte manipulation to keep it fast.
There is no sense in using a BufferedInputStream when you have your own (large enough) buffer array at the same time. All you achieve with that, is forcing an unnecessary copying from the BufferedInputStream’s array to your array. When you have your sufficiently sized array, just read form the source InputStream directly.
0

Consider using Guava :

    private final static HashFunction HASH_FUNCTION = Hashing.goodFastHash(32);

   //somewhere later

   final HashCode hash = Files.asByteSource(file).hash(HASH_FUNCTION);

Guava will buffer the reading of the file for you.

1 Comment

Note: I've built a file deduplicator using this method, you can read it here: github.com/amir650/FileDeduplicator/tree/master/src
0

i had this java heap space error on my windows machine and i spend weeks searching online for a solution, i tried increasing my -Xmx value higher but to no success. i even try running my spring boot app with a parameter to increase the heap size during run time with command like one below

mvn spring-boot:run -Dspring-boot.run.jvmArguments="-Xms2048m -Xmx4096m"

but still no success. until i figured out i was running jdk 32 bit which has limited memory size and i had to uninstall the 32 bit and install the 64 bit which solved my issue for me. i hope this help someone with issue similar to mine.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.