I'm trying to reduce a list of names, and in order to perform this I'm using the fuzzywuzzy library.
I perform two for loops, both over all the names. If the two names have a fuzzy match score between the 90 and the 100, Then I rewrite the second name with the first name.
Here is an example of my dataset, data.
nombre
0 VICTOR MORENO MORENO
1 SERGIO HERNANDEZ GUTIERREZ
2 FRANCISCO JAVIER MUÑOZ LOPEZ
3 JUAN RAYMUNDO MORALES MARTINEZ
4 IVAN ERNESTO SANCHEZ URROZ
And here is my function:
def fuzz_analisis0(top_names):
for name2 in top_names["nombre"]:
for name in top_names["nombre"]:
if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
top_names[top_names["nombre"]==name] = name2
When I run this with:
fuzz_analisis0(data)
Everything works fine. Here is an output that shows how it works.
print(len(data))
# 1400
data = data.drop_duplicates()
print(len(data))
# 1256
But now, if I try it with parallel processing, it no longer works as expected. Here is the parallelized code:
cores = mp.cpu_count()
df_split = np.array_split(data, cores, axis=0)
pool = Pool(cores)
df_out = np.vstack(pool.map(fuzz_analisis0, df_split))
pool.close()
pool.join()
pool.clear()
The function ends faster than expected and does not find any duplicates.
print(len(data))
# 1400
data = data.drop_duplicates()
print(len(data))
# 1400
If any can help me to figure out what is happening here and how to solve it, I'll be so grateful. Thanks in advance.
edit:
now i have this another function that works with the result of the last one
def fuzz_analisis(dataframe, top_names):
for index in top_names['nombre'].index:
name2 = top_names.loc[index,'nombre']
for index2 in dataframe["nombre"].index:
name = dataframe.loc[index2,'nombre']
if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
dataframe.loc[index,'nombre'] = name
the dataframe looks loke this:
nombre foto1 foto2 sexo fecha_hora_registro
folio
131 JUAN DOMINGO GONZALEZ DELGADO 131.jpg 131.jpg MASCULINO 2008-08-07 15:42:25
132 FRANCISCO JAVIER VELA RAMIREZ 132.jpg 132.jpg MASCULINO 2008-08-07 15:50:42
133 JUAN CARLOS PEREZ MEDINA 133.jpg 133.jpg MASCULINO 2008-08-07 16:37:24
134 ARMANDO SALINAS SALINAS 134.jpg 134.jpg MASCULINO 2008-08-07 17:18:12
135 JOSE FELIX ZAMBRANO AMEZQUITA 135.jpg 135.jpg MASCULINO 2008-08-07 17:55:05
top_names["nombre"]==namealwaysFalse? It's comparing an iterable to an element of that iterable. One would expect that to beFalse.dataframeand also swap the parameter order so that goes second. Then you can proceed as below toarray_splitthe bigdataframe,pool.mapyourpartial(fuzz_analisis, top_names)over those splits, andvstackthem back together at the end. This may cause somenp.ndarrayvspd.DataFrameconfusion, in which case you would need a few other conversions probably to promote fromnumpytopandas.