I've written a small utility program that identifies duplicate tracks in iTunes. The actual matching of tracks takes a long time, and I'd like to optimize it. I am storing track data in an NSMutableDictionary that stores individual track data in NSMutableDictionaries keyed by trackID. These individual track dictionaries have at least the following keys:
- TrackID
- Name
- Artist
- Duration (in milli ####.####)
To determine if any tracks match one another, I must check:
- If the duration of two tracks are within 5 seconds of each other
- Name matches
- Artist matches
The slow way for me to do it is using two for-loops:
-(void)findDuplicateTracks {
NSArray *allTracks = [tracks allValues];
BOOL isMatch = NO;
int numMatches = 0;
// outer loop
NSMutableDictionary *track = nil;
NSMutableDictionary *otherTrack = nil;
for (int i = 0; i < [allTracks count]; i++) {
track = [allTracks objectAtIndex:i];
NSDictionary *summary = nil;
if (![claimedTracks containsObject:track]) {
NSAutoreleasePool *aPool = [[NSAutoreleasePool alloc] init];
NSUInteger duration1 = (NSUInteger) [track objectForKey:kTotalTime];
NSString *nName = [track objectForKey:knName];
NSString *nArtist = [track objectForKey:knArtist];
// inner loop - no need to check tracks that have
// already appeared in i
for (int j = i + 1; j < [allTracks count]; j++) {
otherTrack = [allTracks objectAtIndex:j];
if (![claimedTracks containsObject:otherTrack]) {
NSUInteger duration2 = (NSUInteger)[otherTrack objectForKey:kTotalTime];
// duration check
isMatch = (abs(duration1 - duration2) < kDurationThreshold);
// match name
if (isMatch) {
NSString *onName = [otherTrack objectForKey:knName];
isMatch = [nName isEqualToString:onName];
}
// match artist
if (isMatch) {
NSString *onArtist = [otherTrack objectForKey:knArtist];
isMatch = [nArtist isEqualToString:onArtist];
}
// save match data
if (isMatch) {
++numMatches;
// claim both tracks
[claimedTracks addObject:track];
[claimedTracks addObject:otherTrack];
if (![summary isMemberOfClass:[NSDictionary class]]) {
[track setObject:[NSNumber numberWithBool:NO] forKey:@"willDelete"];
summary = [self dictionarySummaryForTrack:track];
}
[otherTrack setObject:[NSNumber numberWithBool:NO] forKey:@"willDelete"];
[[summary objectForKey:kMatches]
addObject:otherTrack];
}
}
}
[aPool drain];
}
}
}
This becomes quite slow for large music libraries, and only uses 1 processor. One recommended optimization was to use blocks and process the tracks in batches (of 100 tracks). I tried that. If my code originally took 9 hours to run, it now takes about 2 hours on a quad-core. That's still too slow. But (talking above my pay grade here) perhaps there is a way to store all the values I need in a C structure that "fits on the stack" and then I wouldn't have to fetch the values from slower memory. This seems too low-level for me, but I'm willing to learn if I had an example.
BTW, I profiled this in Instruments and [NSCFSet member:] takes up
86.6% percent of the CPU time.
Then I thought I should extract all the durations into a sorted array so I would not have to look up the duration value in the dictionary. I think that is a good idea, but when I started to implement it, I wondered how to determine the best batch size.
If I have the following durations:
2 2 3 4 5 6 6 16 17 38 59 Duration
0 1 2 3 4 5 6 7 8 9 10 Index
Then just by iterating over the array, I know that to find matching tracks of the song at index 0, I only need to compare it against songs up to index 6. That's great, I have my first batch. But now I have to start over at index 1 only to find that it's batch should also stop at index 6 and exclude index 0. I'm assuming I'm wasting a lot of processing cycles here determining what the batch should be/the duration matches. This seemed like a "set" problem, but we didn't do much of that in my Intro to Algorithms class.
My questions are:
1) What is the most efficient way to identify matching tracks? Is it something similar to what's above? Is it using disjoint and [unified] set operations that are slightly above my knowledge level? Is it filtering arrays using NSArray? Is there an online resource that describes this problem and solution?
I am willing to restructure the tracks dictionary in whatever way (datastructure) is most efficient. I had at first thought I needed to perform many lookups by TrackID, but that is no longer the case.
2) Is there a more efficient way to approach this problem? How do you rock stars go from paragraph 1 to an optimized solution?
I have searched for the answer, longer than I care to admit, and found these interesting, but unhelpful answers:
Find all duplicates and missing values in a sorted array
Thanks for any help you can provide, Lance