Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Streaming Zhe Sun and Daniel Vanderende

Large Scale Fuzzy Name Matching
ING Wholesale Banking Advanced Analytics
06/06/2018Zhe Sun & Daniel van der Ende
#MLSAIS17

ING in a nutshell
2
Worldwide financial institution
Active in over 40 countries
Approximately 50k employees
Almost 40M customers

3
Wholesale Banking Advanced Analytics (WBAA)
Data Engineers, Data Scientists, Business Developers, Product Owners, UX Designers and Software Developers
internal
external

What is Name Matching?
4
ID Name
1 Daniel Dutch
2 Daniel Irish
3 General Zhe
Ground truth
(ING Customer Records)
Name
Matching
ID Name Moody’s Rating
? Daniel Dutch B.V. AA
? Zhe General Ltd. AAA
Names to be matched
(Moody’s Credit Ratings)
ID Name
Ground Truth
name
Similarity
score
Moody’s Rating
1 Daniel Dutch B.V. Daniel Dutch 0.7 AA
2 Zhe General Ltd. General Zhe 0.8 AAA

Compute pairwise similarity between ground truth names (GT) and names to be matched
(NM), and pick the most similar ones as matching result
From business problem to data science problem
5
Pairwise similarity computation Best match selection
GT
GT_1
GT_2
GT_3
NM
N_1
N_2
NM GT
similarity
score
N_1 GT_1 f(N_1, GT_1)
N_1 GT_2 f(N_1, GT_2)
N_1 GT_3 f(N_1, GT_3)
N_2 GT_1 f(N_2, GT_1)
N_2 GT_2 f(N_2, GT_2)
N_2 GT_3 f(N_2, GT_3)
NM GT
Similarity
score
N_1 GT_1 f(N_1, GT_1)
N_2 GT_3 f(N_2, GT_3)

6
Name Matching model: token-based cosine similarity
The scale of the problem at ING:
• Match 160 million names to 10 million names ≈ 1.6 ∗ 10'(
similarity computations
• Popular approaches such as Levenshtein distance will be very slow!
Vectorization
Cosine
similarity
Candidate
selection
Preprocessing
We chose token-based cosine similarity, for the sake of speed and accuracy

7
Name Matching Pipeline
Step 2
Vectorization
Step 1
preprocessing
Names
Daniel Dutch
Daniel Irish
Zhe General
Names
Daniel
Dutch B.V.
GT
NM
Names Sparse Vector
Daniel Dutch [0, 0.2, …, 0.8, …]
Daniel Irish [0, 0.2, 0.6, …, 0]
Zhe General [0.6, 0, …, 0, 0.9]
GT
NM
Names Sparse Vector
Daniel
Dutch B.V.
[0, 0.3, …, 0.7, …]
Names GT name Score
Daniel
Dutch B.V.
Daniel Dutch 0.8
Daniel Irish 0.7
Step 3
Cosine similarity
Names GT name Score
Daniel
Dutch B.V.
Daniel Dutch 0.8
Step 4
Candidate selection

8
Scaling things up: distributed sparse matrix multiplication
NM
K
P GTK
M
Executor 1 Executor 2 Executor X
…
GTK
M
GTK
M
GTK
M
Broadcast
NM1
K
!" NM2
K
!# NMx
K
!$
map
Top N
namesP
N Reduce
Top N
names
N
Top N
names
N
Top N
names
N
!" !# !$

Combined multiplication and Topl -N selection in single operation
Implement by C++ and Cythonl
l Less memory and 40% faster
Blog post:l Boosting the selection of the most similar entities in large scale datasets
(https://medium.com/p/450b3242e618)
https://github.com/ingl -bank/sparse_dot_topn
Scaling things up: hack SciPy implementation
9

10
Customized stage: easily wrap complex tasks
class CosSimMatcherModel():
def __init__():
spark.sparkContext.broadcast(gt_features.T)
def _transform(self, names_df):
matched_rdd = (names_df
.select(col1, col2, col3)
.rdd
.mapPartitions(match_chunk_of_names)
.flatMap(lambda x: x))
return matched_rdd.toDF(output_schema)

11
Final spark ML pipeline: elegant and easily maintainable
stages += [Preprocessor(params['preprocessor'], input_col=params['name_col’],
output_col='preprocessed')]
stages += [RegexTokenizer(inputCol='preprocessed', outputCol='tokens', pattern=r"w+", gaps=False)]
stages += [NGram(inputCol='tokens', outputCol='ngram_tokens', n=params['ngram'])]
stages += [CountVectorizer(inputCol='ngram_tokens', outputCol='tf', vocabSize=2<<24)]
stages += [NormalizedTfidf(count_col="tf", token_col="ngram_tokens", output_col="features”)]
stages += [CosSimMatcher(num_candidates=params['num_candidates'],
cos_sim_lower_bound=params['cos_sim_lower_bound'],
index_col=params['index_col'],
name_col=params['name_col'],
chunk_size=params['chunk_size’])]
snm = load_pickle(params['supervised_model_filename'], params['supervised_model_path'])
stages += [SupervisedNMTransformer(snm)]
self.pipeline = Pipeline(stages=stages)

12
160M names matched to 10M ground truth
names on 10 node cluster in 5 hours
Approximately 8000 names matched per
second

Matching ‘new’ names to existing ones is not a speciﬁc problem
Exposing this capability via an API adds value for other products, departments, and
perhaps companies.
This use case changes the underlying design however:
The• number of names to match will be signiﬁcantly lower (i.e. thousands
instead of millions).
Near• -real-time results are appreciated, especially for small datasets
The• ground truth may change, depending on the needs of the user
Name Matching can be applied to multiple problems

Structured Streaming
14
Structured Streaming provides fast, scalable, fault-
tolerant, end-to-end exactly-once stream processing
without the user having to reason about streaming
“
”

Our setup
15
ML
Structured
Streaming

Structured Streaming
16
nm_obj = NameMatching({
'streaming': True,
'supervised_on': False
})
nm_obj.fit(ground_truth_df)
lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", topic_in)
.option("failOnDataLoss", "false")
.load()

Structured Streaming (2)
17
names_to_match = lines
.select(lines.value.cast('string').alias("name"))
nm_obj
.transform(names_to_match)
.selectExpr("extract_json(candidates, name) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("topic", kafka_topic_out)
.start()
.awaitTermination()

matched_rdd = names_df
.select(col1, col2, col3)
.rdd
.mapPartitions(match_chunk_of_names)
.flatMap(lambda x: x)
# Make output a dataframe again
return matched_rdd.toDF(output_schema)
Remove all actions
Actions are• not allowed in Structured Streaming
A small, yet big change needed for Structured
Streaming
18

PySpark does not support map(Partitions) on a (streaming) dataframe. We need to use a
UDF:
No more actions
19
match_name_udf = udf(match_chunk_of_names, candidate_list_schema)
matched_df = names_df
.withColumn('candidates', match_name_udf(names_df.features))

21
0
0.5
1
1.5
2
2.5
3
3.5
4
1M 2M 4M 6M 12M
Transformtimepername(s)
GT size
Varying Ground Truth sizes. ‘RDD-method’ vs. ‘UDF-method’
RDD UDF

Python UDFs
22
Driver Program
Driver Program
(Python)
SparkSession
(JVM)
Executor 1
Spark DataFrame
(JVM)
UDF
(Python)
Executor 2
Spark DataFrame
(JVM)
UDF
(Python)

• Scala UDF with Java Native Interface connection to C++ matrix multiplication
(blogpost: https://medium.com/p/b70033dd69b9 )
• mapPartitions for Python (streaming) dataframes
• Sparse matrix multiplication for Scala/Spark
• Compute the cosine similarity outside of Spark
Possible workarounds
23

To broadcast or not to broadcast
24
Executor 1
Driver
GT
GT
Name1
Name2
Executor 2
GT
Tracer
Name3
Name4
Tracer

We built a large scale fuzzy name matching system in batch and streaming
Spark ML is an elegant, powerful, easy-to-use abstraction for data science pipelines/models
on Spark
Combining Spark ML with Structured Streaming is easy, but optimizing and tweaking it can
be hard and take a lot of time.
Monitoring Spark ML Stages in Structured Streaming is challenging, haven’t found a
satisfactory solution yet.
Wrapping up
25

What to broadcast?
27
Executor 1
Driver
Ground Truth
PARTITION A
Executor 2
Ground Truth
PARTITION B
Reduce
Name1
Name2
Name1
Name2

28
Spark ML pipeline: standard + customized stages
Step Stage Customized Description
Example input
HANS Investment B.V., Willem
Barentszstraat
1 Preprocessing Y
• Strip punctuation
• Accents to unicode
• All characters to lower case
• Shorthands and abbreviations replacement
hans investment bv willem
barentszstr
2 Tokenizer N Splits the input string by white spaces
[hans, investment, by, willem,
barentszstr]
3 NGram N
Converts the input tokens into an array of n-
grams
[hans, investment, by, willem,
barentszstr]
4 CountVectorizer N Extracts a vocabulary from document collections (5, [0, 1, 2, 3, 4], [1.0, 1,0, 1.0, 1.0, 1.0])
5
Normalized
TFIDF
Y
Compute the Term Frequency Inverse Document
Frequency (TF-IDF). We need a custom stage to
deal with previously unseen tokens.
(5, [0, 1, 2, 3, 4], [0.5, 0.1, 0.01, 0.2,
0.8])
6
Cosine
Similarity
Y
Compute cosine similarity between input and
ground truth
7
Candidate
selection
Y Pick the most similar names by supervised model

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Streaming Zhe Sun and Daniel Vanderende

More Related Content

What's hot

Similar to Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Streaming Zhe Sun and Daniel Vanderende

More from Databricks

Recently uploaded

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Streaming Zhe Sun and Daniel Vanderende