0

I'm trying to reduce a list of names, and in order to perform this I'm using the fuzzywuzzy library.

I perform two for loops, both over all the names. If the two names have a fuzzy match score between the 90 and the 100, Then I rewrite the second name with the first name.

Here is an example of my dataset, data.

                              nombre
0               VICTOR MORENO MORENO
1         SERGIO HERNANDEZ GUTIERREZ
2       FRANCISCO JAVIER MUÑOZ LOPEZ
3     JUAN RAYMUNDO MORALES MARTINEZ
4         IVAN ERNESTO SANCHEZ URROZ

And here is my function:

def fuzz_analisis0(top_names):
    for name2 in top_names["nombre"]:    
        for name in top_names["nombre"]: 
            if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
                top_names[top_names["nombre"]==name] = name2

When I run this with:

fuzz_analisis0(data)

Everything works fine. Here is an output that shows how it works.

print(len(data))
# 1400

data = data.drop_duplicates()
print(len(data))
# 1256

But now, if I try it with parallel processing, it no longer works as expected. Here is the parallelized code:

cores = mp.cpu_count()
df_split = np.array_split(data, cores, axis=0)
pool = Pool(cores)
df_out = np.vstack(pool.map(fuzz_analisis0, df_split))
pool.close()
pool.join()
pool.clear()

The function ends faster than expected and does not find any duplicates.

print(len(data))
# 1400

data = data.drop_duplicates()
print(len(data))
# 1400

If any can help me to figure out what is happening here and how to solve it, I'll be so grateful. Thanks in advance.

edit:

now i have this another function that works with the result of the last one

def fuzz_analisis(dataframe, top_names):  
    for index in top_names['nombre'].index:
        name2 = top_names.loc[index,'nombre']       
        for index2 in dataframe["nombre"].index:
            name = dataframe.loc[index2,'nombre']   

            if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
                    dataframe.loc[index,'nombre'] = name

the dataframe looks loke this:

    nombre  foto1   foto2   sexo    fecha_hora_registro
folio                   
131     JUAN DOMINGO GONZALEZ DELGADO   131.jpg     131.jpg     MASCULINO   2008-08-07 15:42:25
132     FRANCISCO JAVIER VELA RAMIREZ   132.jpg     132.jpg     MASCULINO   2008-08-07 15:50:42
133     JUAN CARLOS PEREZ MEDINA    133.jpg     133.jpg     MASCULINO   2008-08-07 16:37:24
134     ARMANDO SALINAS SALINAS     134.jpg     134.jpg     MASCULINO   2008-08-07 17:18:12
135     JOSE FELIX ZAMBRANO AMEZQUITA   135.jpg     135.jpg     MASCULINO   2008-08-07 17:55:05
4
  • Isn't top_names["nombre"]==name always False? It's comparing an iterable to an element of that iterable. One would expect that to be False. Commented Apr 16, 2020 at 4:28
  • Your second question is not clear to me. It seems that the two functions are doing the exact same thing, except for the exemptions. But won't those have been deduplicated already? Commented Apr 16, 2020 at 4:59
  • the second one takes the name list that we made in the first one and a bigger dataframe, the name list is about 1300 names, but the big dataframe is about 2 millions Commented Apr 16, 2020 at 5:01
  • I see, that makes sense. I believe you can use the same pattern as below. First, set that second method up to return the dataframe and also swap the parameter order so that goes second. Then you can proceed as below to array_split the big dataframe, pool.map your partial(fuzz_analisis, top_names) over those splits, and vstack them back together at the end. This may cause some np.ndarray vs pd.DataFrame confusion, in which case you would need a few other conversions probably to promote from numpy to pandas. Commented Apr 16, 2020 at 5:18

2 Answers 2

1

You are splitting the data up before entering the twice nested loop, so you are not comparing all combinations.

You can reorganize the code to split the first name, but still test all second names against it. The following modification worked for me on your test data, although it did not find any duplicates.

from functools import partial
from fuzzywuzzy import fuzz
import multiprocessing as mp
import numpy as np

def fuzz_analisis0_partial(top_names, partial_top_names): 
    for name2 in top_names["nombre"]: 
        for name in partial_top_names["nombre"]:  
            if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100: 
                partial_top_names[partial_top_names["nombre"] == name] = name2 
    return partial_top_names

cores = mp.cpu_count() 
df_split = np.array_split(data, cores, axis=0) 

pool = mp.Pool(cores)
processed_parts = pool.map(partial(fuzz_analisis0_partial, data), df_split)
processed = np.vstack(list(processed_parts))

pool.close() 
pool.join()
Sign up to request clarification or add additional context in comments.

3 Comments

ok man this is a little over me, i think i superficially understand what you have done, and it worked in my case, i was wondering if i can edit the question a little to add another very similar problem that uses the data variable edited by this fucntion or if you want i can acept the anwser and then proceed to ask another question, what do you say?
It would be best to create a separate question if it has a substantially different answer. I'm happy to answer that as well, and also update here if there is additional clarification I can give.
i've done the edition up here, thanks for the kindness
0

When you recognise your algorithm to be slow multiprocessing is a way to speed it up. However you should probably try to speed up the algorithm first. When using fuzzywuzzy fuzz.ratio will calculate a normalized levenshtein distance, which is a O(N*M) operation. Therefore you should try to minimise the usage. So here is a optimised solution of mcskinner's multiprocessed solution:

from functools import partial
from fuzzywuzzy import fuzz
import multiprocessing as mp
import numpy as np

def length_ratio(s1, s2):
    s1_len = len(s1)
    s2_len = len(s2)
    distance = s1_len - s2_len if s1_len > s2_len else s2_len - s1_len
    lensum = s1_len + s2_len
    return 100 - 100 * distance / lensum

def fuzz_analisis0_partial(top_names, partial_top_names): 
    for name2 in top_names["nombre"]: 
        for name in partial_top_names["nombre"]:
            if length_ratio(name, name2) < 90:
              continue

            ratio = fuzz.ratio(name, name2)
            if ratio>90 and ratio<100: 
                partial_top_names[partial_top_names["nombre"] == name] = name2 
    return partial_top_names

cores = mp.cpu_count() 
df_split = np.array_split(data, cores, axis=0) 

pool = mp.Pool(cores)
processed_parts = pool.map(partial(fuzz_analisis0_partial, data), df_split)
processed = np.vstack(list(processed_parts))

pool.close() 
pool.join()

First of this solution only executes fuzz.ratio once instead of twice and since this is what takes most of the time, this should give you about a 50% runtime improvement. Then as a second improvement it checks a length based ratio beforehand. This length based ratio is always at least as big as fuzz.ratio, but can be calculated in constant time. Therefore all names that have a big length difference can be processed a lot faster. Beside this make sure your using fuzzywuzzy with python-Levenshtein since this is a lot faster than the version using difflib. As an even faster alternative you could use RapidFuzz(I am the author of RapidFuzz). RapidFuzz already calculates the length ratio when you pass it a cutoff score fuzz.ratio(name, name2, score_cutoff=90), so the lenght_ratio function is not required when using it.

Using RapidFuzz the equivalent function fuzz_analisis0_partial can be programmed the following way:

from rapidfuzz import fuzz

def fuzz_analisis0_partial(top_names, partial_top_names): 
    for name2 in top_names["nombre"]: 
        for name in partial_top_names["nombre"]:
            ratio = fuzz.ratio(name, name2, score_cutoff=90)
            if ratio > 90 and ratio < 100: 
                partial_top_names[partial_top_names["nombre"] == name] = name2 
    return partial_top_names

2 Comments

This solution fails with NameError: name 'len_s1' is not defined. If that variable name is corrected, it fails to remove duplicates. The length ratio 1.0 - distance / lensum is always between 0 and 1, so this line will skip everything: if length_ratio(name, name2) < 90.
Ah yes your right. I corrected the variable name and multiply the result with 100 now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.