Large Scale Fuzzy Name Matching
ING Wholesale Banking Advanced Analytics
06/06/2018Zhe Sun & Daniel van der Ende
#MLSAIS17
ING in a nutshell
2
Worldwide financial institution
Active in over 40 countries
Approximately 50k employees
Almost 40M customers
3
Wholesale Banking Advanced Analytics (WBAA)
Data Engineers, Data Scientists, Business Developers, Product Owners, UX Designers and Software Developers
internal
external
What is Name Matching?
4
ID Name
1 Daniel Dutch
2 Daniel Irish
3 General Zhe
Ground truth
(ING Customer Records)
Name
Matching
ID Name Moody’s Rating
? Daniel Dutch B.V. AA
? Zhe General Ltd. AAA
Names to be matched
(Moody’s Credit Ratings)
ID Name
Ground Truth
name
Similarity
score
Moody’s Rating
1 Daniel Dutch B.V. Daniel Dutch 0.7 AA
2 Zhe General Ltd. General Zhe 0.8 AAA
Compute pairwise similarity between ground truth names (GT) and names to be matched
(NM), and pick the most similar ones as matching result
From business problem to data science problem
5
Pairwise similarity computation Best match selection
GT
GT_1
GT_2
GT_3
NM
N_1
N_2
NM GT
similarity
score
N_1 GT_1 f(N_1, GT_1)
N_1 GT_2 f(N_1, GT_2)
N_1 GT_3 f(N_1, GT_3)
N_2 GT_1 f(N_2, GT_1)
N_2 GT_2 f(N_2, GT_2)
N_2 GT_3 f(N_2, GT_3)
NM GT
Similarity
score
N_1 GT_1 f(N_1, GT_1)
N_2 GT_3 f(N_2, GT_3)
6
Name Matching model: token-based cosine similarity
The scale of the problem at ING:
• Match 160 million names to 10 million names ≈ 1.6 ∗ 10'(
similarity computations
• Popular approaches such as Levenshtein distance will be very slow!
Vectorization
Cosine
similarity
Candidate
selection
Preprocessing
We chose token-based cosine similarity, for the sake of speed and accuracy
7
Name Matching Pipeline
Step 2
Vectorization
Step 1
preprocessing
Names
Daniel Dutch
Daniel Irish
Zhe General
Names
Daniel
Dutch B.V.
GT
NM
Names Sparse Vector
Daniel Dutch [0, 0.2, …, 0.8, …]
Daniel Irish [0, 0.2, 0.6, …, 0]
Zhe General [0.6, 0, …, 0, 0.9]
GT
NM
Names Sparse Vector
Daniel
Dutch B.V.
[0, 0.3, …, 0.7, …]
Names GT name Score
Daniel
Dutch B.V.
Daniel Dutch 0.8
Daniel Irish 0.7
Step 3
Cosine similarity
Names GT name Score
Daniel
Dutch B.V.
Daniel Dutch 0.8
Step 4
Candidate selection
8
Scaling things up: distributed sparse matrix multiplication
NM
K
P GTK
M
Executor 1 Executor 2 Executor X
…
GTK
M
GTK
M
GTK
M
Broadcast
NM1
K
!" NM2
K
!# NMx
K
!$
map
Top N
namesP
N Reduce
Top N
names
N
Top N
names
N
Top N
names
N
!" !# !$
Combined multiplication and Topl -N selection in single operation
Implement by C++ and Cythonl
l Less memory and 40% faster
Blog post:l Boosting the selection of the most similar entities in large scale datasets
(https://medium.com/p/450b3242e618)
https://github.com/ingl -bank/sparse_dot_topn
Scaling things up: hack SciPy implementation
9
10
Customized stage: easily wrap complex tasks
class CosSimMatcherModel():
def __init__():
spark.sparkContext.broadcast(gt_features.T)
def _transform(self, names_df):
matched_rdd = (names_df
.select(col1, col2, col3)
.rdd
.mapPartitions(match_chunk_of_names)
.flatMap(lambda x: x))
return matched_rdd.toDF(output_schema)
11
Final spark ML pipeline: elegant and easily maintainable
stages += [Preprocessor(params['preprocessor'], input_col=params['name_col’],
output_col='preprocessed')]
stages += [RegexTokenizer(inputCol='preprocessed', outputCol='tokens', pattern=r"w+", gaps=False)]
stages += [NGram(inputCol='tokens', outputCol='ngram_tokens', n=params['ngram'])]
stages += [CountVectorizer(inputCol='ngram_tokens', outputCol='tf', vocabSize=2<<24)]
stages += [NormalizedTfidf(count_col="tf", token_col="ngram_tokens", output_col="features”)]
stages += [CosSimMatcher(num_candidates=params['num_candidates'],
cos_sim_lower_bound=params['cos_sim_lower_bound'],
index_col=params['index_col'],
name_col=params['name_col'],
chunk_size=params['chunk_size’])]
snm = load_pickle(params['supervised_model_filename'], params['supervised_model_path'])
stages += [SupervisedNMTransformer(snm)]
self.pipeline = Pipeline(stages=stages)
12
160M names matched to 10M ground truth
names on 10 node cluster in 5 hours
Approximately 8000 names matched per
second
Matching ‘new’ names to existing ones is not a specific problem
Exposing this capability via an API adds value for other products, departments, and
perhaps companies.
This use case changes the underlying design however:
The• number of names to match will be significantly lower (i.e. thousands
instead of millions).
Near• -real-time results are appreciated, especially for small datasets
The• ground truth may change, depending on the needs of the user
Name Matching can be applied to multiple problems
Structured Streaming
14
Structured Streaming provides fast, scalable, fault-
tolerant, end-to-end exactly-once stream processing
without the user having to reason about streaming
“
”
Our setup
15
ML
Structured
Streaming
Structured Streaming
16
nm_obj = NameMatching({
'streaming': True,
'supervised_on': False
})
nm_obj.fit(ground_truth_df)
lines = spark 
.readStream 
.format("kafka") 
.option("kafka.bootstrap.servers", servers) 
.option("subscribe", topic_in) 
.option("failOnDataLoss", "false") 
.load()
Structured Streaming (2)
17
names_to_match = lines 
.select(lines.value.cast('string').alias("name"))
nm_obj 
.transform(names_to_match) 
.selectExpr("extract_json(candidates, name) AS value")
.writeStream 
.format("kafka") 
.option("kafka.bootstrap.servers", servers) 
.option("topic", kafka_topic_out) 
.start() 
.awaitTermination()
matched_rdd = names_df
.select(col1, col2, col3)
.rdd
.mapPartitions(match_chunk_of_names)
.flatMap(lambda x: x)
# Make output a dataframe again
return matched_rdd.toDF(output_schema)
Remove all actions
Actions are• not allowed in Structured Streaming
A small, yet big change needed for Structured
Streaming
18
PySpark does not support map(Partitions) on a (streaming) dataframe. We need to use a
UDF:
No more actions
19
match_name_udf = udf(match_chunk_of_names, candidate_list_schema)
matched_df = names_df
.withColumn('candidates', match_name_udf(names_df.features))
20
However…
21
0
0.5
1
1.5
2
2.5
3
3.5
4
1M 2M 4M 6M 12M
Transformtimepername(s)
GT size
Varying Ground Truth sizes. ‘RDD-method’ vs. ‘UDF-method’
RDD UDF
Python UDFs
22
Driver Program
Driver Program
(Python)
SparkSession
(JVM)
Executor 1
Spark DataFrame
(JVM)
UDF
(Python)
Executor 2
Spark DataFrame
(JVM)
UDF
(Python)
• Scala UDF with Java Native Interface connection to C++ matrix multiplication
(blogpost: https://medium.com/p/b70033dd69b9 )
• mapPartitions for Python (streaming) dataframes
• Sparse matrix multiplication for Scala/Spark
• Compute the cosine similarity outside of Spark
Possible workarounds
23
To broadcast or not to broadcast
24
Executor 1
Driver
GT
GT
Name1
Name2
Executor 2
GT
Tracer
Name3
Name4
Tracer
We built a large scale fuzzy name matching system in batch and streaming
Spark ML is an elegant, powerful, easy-to-use abstraction for data science pipelines/models
on Spark
Combining Spark ML with Structured Streaming is easy, but optimizing and tweaking it can
be hard and take a lot of time.
Monitoring Spark ML Stages in Structured Streaming is challenging, haven’t found a
satisfactory solution yet.
Wrapping up
25
26
What to broadcast?
27
Executor 1
Driver
Ground Truth
PARTITION A
Executor 2
Ground Truth
PARTITION B
Reduce
Name1
Name2
Name1
Name2
28
Spark ML pipeline: standard + customized stages
Step Stage Customized Description
Example input
HANS Investment B.V., Willem
Barentszstraat
1 Preprocessing Y
• Strip punctuation
• Accents to unicode
• All characters to lower case
• Shorthands and abbreviations replacement
hans investment bv willem
barentszstr
2 Tokenizer N Splits the input string by white spaces
[hans, investment, by, willem,
barentszstr]
3 NGram N
Converts the input tokens into an array of n-
grams
[hans, investment, by, willem,
barentszstr]
4 CountVectorizer N Extracts a vocabulary from document collections (5, [0, 1, 2, 3, 4], [1.0, 1,0, 1.0, 1.0, 1.0])
5
Normalized
TFIDF
Y
Compute the Term Frequency Inverse Document
Frequency (TF-IDF). We need a custom stage to
deal with previously unseen tokens.
(5, [0, 1, 2, 3, 4], [0.5, 0.1, 0.01, 0.2,
0.8])
6
Cosine
Similarity
Y
Compute cosine similarity between input and
ground truth
7
Candidate
selection
Y Pick the most similar names by supervised model

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Streaming Zhe Sun and Daniel Vanderende

  • 1.
    Large Scale FuzzyName Matching ING Wholesale Banking Advanced Analytics 06/06/2018Zhe Sun & Daniel van der Ende #MLSAIS17
  • 2.
    ING in anutshell 2 Worldwide financial institution Active in over 40 countries Approximately 50k employees Almost 40M customers
  • 3.
    3 Wholesale Banking AdvancedAnalytics (WBAA) Data Engineers, Data Scientists, Business Developers, Product Owners, UX Designers and Software Developers internal external
  • 4.
    What is NameMatching? 4 ID Name 1 Daniel Dutch 2 Daniel Irish 3 General Zhe Ground truth (ING Customer Records) Name Matching ID Name Moody’s Rating ? Daniel Dutch B.V. AA ? Zhe General Ltd. AAA Names to be matched (Moody’s Credit Ratings) ID Name Ground Truth name Similarity score Moody’s Rating 1 Daniel Dutch B.V. Daniel Dutch 0.7 AA 2 Zhe General Ltd. General Zhe 0.8 AAA
  • 5.
    Compute pairwise similaritybetween ground truth names (GT) and names to be matched (NM), and pick the most similar ones as matching result From business problem to data science problem 5 Pairwise similarity computation Best match selection GT GT_1 GT_2 GT_3 NM N_1 N_2 NM GT similarity score N_1 GT_1 f(N_1, GT_1) N_1 GT_2 f(N_1, GT_2) N_1 GT_3 f(N_1, GT_3) N_2 GT_1 f(N_2, GT_1) N_2 GT_2 f(N_2, GT_2) N_2 GT_3 f(N_2, GT_3) NM GT Similarity score N_1 GT_1 f(N_1, GT_1) N_2 GT_3 f(N_2, GT_3)
  • 6.
    6 Name Matching model:token-based cosine similarity The scale of the problem at ING: • Match 160 million names to 10 million names ≈ 1.6 ∗ 10'( similarity computations • Popular approaches such as Levenshtein distance will be very slow! Vectorization Cosine similarity Candidate selection Preprocessing We chose token-based cosine similarity, for the sake of speed and accuracy
  • 7.
    7 Name Matching Pipeline Step2 Vectorization Step 1 preprocessing Names Daniel Dutch Daniel Irish Zhe General Names Daniel Dutch B.V. GT NM Names Sparse Vector Daniel Dutch [0, 0.2, …, 0.8, …] Daniel Irish [0, 0.2, 0.6, …, 0] Zhe General [0.6, 0, …, 0, 0.9] GT NM Names Sparse Vector Daniel Dutch B.V. [0, 0.3, …, 0.7, …] Names GT name Score Daniel Dutch B.V. Daniel Dutch 0.8 Daniel Irish 0.7 Step 3 Cosine similarity Names GT name Score Daniel Dutch B.V. Daniel Dutch 0.8 Step 4 Candidate selection
  • 8.
    8 Scaling things up:distributed sparse matrix multiplication NM K P GTK M Executor 1 Executor 2 Executor X … GTK M GTK M GTK M Broadcast NM1 K !" NM2 K !# NMx K !$ map Top N namesP N Reduce Top N names N Top N names N Top N names N !" !# !$
  • 9.
    Combined multiplication andTopl -N selection in single operation Implement by C++ and Cythonl l Less memory and 40% faster Blog post:l Boosting the selection of the most similar entities in large scale datasets (https://medium.com/p/450b3242e618) https://github.com/ingl -bank/sparse_dot_topn Scaling things up: hack SciPy implementation 9
  • 10.
    10 Customized stage: easilywrap complex tasks class CosSimMatcherModel(): def __init__(): spark.sparkContext.broadcast(gt_features.T) def _transform(self, names_df): matched_rdd = (names_df .select(col1, col2, col3) .rdd .mapPartitions(match_chunk_of_names) .flatMap(lambda x: x)) return matched_rdd.toDF(output_schema)
  • 11.
    11 Final spark MLpipeline: elegant and easily maintainable stages += [Preprocessor(params['preprocessor'], input_col=params['name_col’], output_col='preprocessed')] stages += [RegexTokenizer(inputCol='preprocessed', outputCol='tokens', pattern=r"w+", gaps=False)] stages += [NGram(inputCol='tokens', outputCol='ngram_tokens', n=params['ngram'])] stages += [CountVectorizer(inputCol='ngram_tokens', outputCol='tf', vocabSize=2<<24)] stages += [NormalizedTfidf(count_col="tf", token_col="ngram_tokens", output_col="features”)] stages += [CosSimMatcher(num_candidates=params['num_candidates'], cos_sim_lower_bound=params['cos_sim_lower_bound'], index_col=params['index_col'], name_col=params['name_col'], chunk_size=params['chunk_size’])] snm = load_pickle(params['supervised_model_filename'], params['supervised_model_path']) stages += [SupervisedNMTransformer(snm)] self.pipeline = Pipeline(stages=stages)
  • 12.
    12 160M names matchedto 10M ground truth names on 10 node cluster in 5 hours Approximately 8000 names matched per second
  • 13.
    Matching ‘new’ namesto existing ones is not a specific problem Exposing this capability via an API adds value for other products, departments, and perhaps companies. This use case changes the underlying design however: The• number of names to match will be significantly lower (i.e. thousands instead of millions). Near• -real-time results are appreciated, especially for small datasets The• ground truth may change, depending on the needs of the user Name Matching can be applied to multiple problems
  • 14.
    Structured Streaming 14 Structured Streamingprovides fast, scalable, fault- tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming “ ”
  • 15.
  • 16.
    Structured Streaming 16 nm_obj =NameMatching({ 'streaming': True, 'supervised_on': False }) nm_obj.fit(ground_truth_df) lines = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", servers) .option("subscribe", topic_in) .option("failOnDataLoss", "false") .load()
  • 17.
    Structured Streaming (2) 17 names_to_match= lines .select(lines.value.cast('string').alias("name")) nm_obj .transform(names_to_match) .selectExpr("extract_json(candidates, name) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", servers) .option("topic", kafka_topic_out) .start() .awaitTermination()
  • 18.
    matched_rdd = names_df .select(col1,col2, col3) .rdd .mapPartitions(match_chunk_of_names) .flatMap(lambda x: x) # Make output a dataframe again return matched_rdd.toDF(output_schema) Remove all actions Actions are• not allowed in Structured Streaming A small, yet big change needed for Structured Streaming 18
  • 19.
    PySpark does notsupport map(Partitions) on a (streaming) dataframe. We need to use a UDF: No more actions 19 match_name_udf = udf(match_chunk_of_names, candidate_list_schema) matched_df = names_df .withColumn('candidates', match_name_udf(names_df.features))
  • 20.
  • 21.
    21 0 0.5 1 1.5 2 2.5 3 3.5 4 1M 2M 4M6M 12M Transformtimepername(s) GT size Varying Ground Truth sizes. ‘RDD-method’ vs. ‘UDF-method’ RDD UDF
  • 22.
    Python UDFs 22 Driver Program DriverProgram (Python) SparkSession (JVM) Executor 1 Spark DataFrame (JVM) UDF (Python) Executor 2 Spark DataFrame (JVM) UDF (Python)
  • 23.
    • Scala UDFwith Java Native Interface connection to C++ matrix multiplication (blogpost: https://medium.com/p/b70033dd69b9 ) • mapPartitions for Python (streaming) dataframes • Sparse matrix multiplication for Scala/Spark • Compute the cosine similarity outside of Spark Possible workarounds 23
  • 24.
    To broadcast ornot to broadcast 24 Executor 1 Driver GT GT Name1 Name2 Executor 2 GT Tracer Name3 Name4 Tracer
  • 25.
    We built alarge scale fuzzy name matching system in batch and streaming Spark ML is an elegant, powerful, easy-to-use abstraction for data science pipelines/models on Spark Combining Spark ML with Structured Streaming is easy, but optimizing and tweaking it can be hard and take a lot of time. Monitoring Spark ML Stages in Structured Streaming is challenging, haven’t found a satisfactory solution yet. Wrapping up 25
  • 26.
  • 27.
    What to broadcast? 27 Executor1 Driver Ground Truth PARTITION A Executor 2 Ground Truth PARTITION B Reduce Name1 Name2 Name1 Name2
  • 28.
    28 Spark ML pipeline:standard + customized stages Step Stage Customized Description Example input HANS Investment B.V., Willem Barentszstraat 1 Preprocessing Y • Strip punctuation • Accents to unicode • All characters to lower case • Shorthands and abbreviations replacement hans investment bv willem barentszstr 2 Tokenizer N Splits the input string by white spaces [hans, investment, by, willem, barentszstr] 3 NGram N Converts the input tokens into an array of n- grams [hans, investment, by, willem, barentszstr] 4 CountVectorizer N Extracts a vocabulary from document collections (5, [0, 1, 2, 3, 4], [1.0, 1,0, 1.0, 1.0, 1.0]) 5 Normalized TFIDF Y Compute the Term Frequency Inverse Document Frequency (TF-IDF). We need a custom stage to deal with previously unseen tokens. (5, [0, 1, 2, 3, 4], [0.5, 0.1, 0.01, 0.2, 0.8]) 6 Cosine Similarity Y Compute cosine similarity between input and ground truth 7 Candidate selection Y Pick the most similar names by supervised model