Leveraging NLP and Deep Learning for Document Recommendations in the Cloud

Guoqiong Song, Intel
Leveraging NLP and Deep
Learning for Document
Recommendation in the Cloud
#UnifiedAnalytics #SparkAISummit

Agenda
• Job/Resume Search Challenges and Opportunity
• Analytics Zoo and BigDL Overview
• Resume Search Analytics Zoo Solution
• Takeaways

4#UnifiedAnalytics #SparkAISummit
Job search

5
Personalize Results Value
Job Seekers Employers
Find the right job faster
Resume
Find the right person
Job description

Traditional Information Retrieval Sufferings
Solution challenges:
stemming, synonyms,
ontologies, sensitivity
6
Warehouse
Warehouse

Stemming
Accountant Accounting=!=
7

Stemming Solution
Accountant Accounting=
8

Stemming Sufferings
Account
Representative
=
9

Synonyms
Registered
Nurse
RN=!=
10

Synonyms Solution
Registered
Nurse
RN=!=
Registered
Nurse
RN → registered nurse=
11

Synonym Sufferings
Registered
Nurse
RN=!=
Registered
Nurse
RN → registered nurse=
. . .
12

Ontologies
Dishwasher Back of House=!=
13

Ontologies Solution
Restaurant Restaurant
Dishwasher Back of House
=
14

Ontologies Sufferings
. . .
Restaurant Restaurant
Dishwasher Back of House
=
15

17
Personalize Results Value
Job Seekers Employers
Find the right job faster
Resume
Find the right person
Job description

AI on
Unifying Analytics + AI on Apache Spark
Distributed, High-Performance
Deep Learning Framework
for Apache Spark
https://github.com/intel-analytics/bigdl
DistributedTensoRflow, Keras and BigDL on Spark
Reference Use Cases,AI Models,
High-level APIs, Feature Engineering, etc.
https://github.com/intel-analytics/analytics-zoo

Unified Big Data Analytics Platform
Data
Input
Flume Kafka Storage HBaseHDFS
Resource Mgmt
& Co-ordination
ZooKeeperYARN
Data
Processing
& Analysis
MR
Storm
Apache Hadoop & Spark Platform
Parquet Avro
Spark Core
SQL Streaming MLlib GraphX
DataFrame
ML Pipelines
SparkR
Flink
Giraph
Batch Streaming Interactive
Machine
Leaning
Graph
Analytics SQL
R PythonJava
Notebook Spreadsheet

Chasm b/w Deep Learning and Big Data
Communities
Average users (big data users, data scientists, analysts, etc.)Deep learning experts
The
Chasm

Bridging the Chasm
Make deep learning more accessible to big data and data
science communities
• Continue the use of familiar SW tools and HW infrastructure to build deep learning
applications
• Analyze “big data” using deep learning on the same Hadoop/Spark cluster where the
data are stored
• Add deep learning functionalities to large-scale big data programs and/or workflow
• Leverage existing Hadoop/Spark clusters to run deep learning applications
• Shared, monitored and managed with other workloads (e.g., ETL, data warehouse, feature
engineering, traditional ML, graph analytics, etc.) in a dynamic and elastic fashion

BigDL
Bringing Deep Learning To Big Data Platform
https://github.com/intel-analytics/BigDL
Spark Core
DataFrame
https://bigdl-project.github.io/
• Distributed deep learning framework for Apache Spark*
• Make deep learning more accessible to big data users
and data scientists
• Write deep learning applications as standard Spark programs
• Run on existing Spark/Hadoop clusters (no changes needed)
• Feature parity with popular deep learning frameworks
• E.g., Caffe, Torch, Tensorflow, etc.
• High performance (on CPU)
• Powered by Intel MKL and multi-threaded programming
• Efficient scale-out
• Leveraging Spark for distributed training & inference
Spark Core
SQL SparkR Streaming
MLlib GraphX
ML Pipeline
DataFrame

BigDL Run as Standard Spark Programs
Standard Spark jobs
• No changes to the Spark or Hadoop clusters needed
Iterative
• Each iteration of the training runs as a Spark job
Data parallel
• Each Spark task runs the same model on a subset of the data (batch)

Distributed Training in BigDL
Peer-2-Peer All-Reduce Synchronization

Analytics Zoo
Unified Analytics + AI Platform for Big Data
Reference Use Cases
• Anomaly detection, sentiment analysis, fraud detection, image
generation, chatbot, etc.
Built-In Deep Learning
Models
• Image classification, object detection, text classification, text matching,
recommendations, sequence-to-sequence, anomaly detection, etc.
Feature Engineering
Feature transformations for
• Image, text, 3D imaging, time series, speech, etc.
High-Level Pipeline APIs
• Distributed TensorFlow and Keras on Spark
• Native support for transfer learning, Spark DataFrame and ML Pipelines
• Model serving API for model serving/inference pipelines
Backbends Spark, TensorFlow, Keras, BigDL, OpenVINO, MKL-DNN, etc.
https://github.com/intel-analytics/analytics-zoo/ https://analytics-zoo.github.io/
Distributed TensorFlow, Keras and BigDL on Spark

Analytics Zoo
Build end-to-end deep learning applications for big data
• Distributed TensorFlow on Spark
• Keras-style APIs (with autograd & transfer learning support)
• nnframes: native DL support for Spark DataFrames and ML Pipelines
• Built-in feature engineering operations for data preprocessing
Productionize deep learning applications for big data at scale
• Model serving APIs (w/ OpenVINO support)
• Support Web Services, Spark, Storm, Flink, Kafka, etc.
Out-of-the-box solutions
• Built-in deep learning models and reference use cases

What Can you do with Analytic Zoo?
Anomaly Detection
• Using LSTM network to detect anomalies in time series data
Fraud Detection
• Using feed-forward neural network to detect frauds in credit card
transaction data
Recommendation
• Use Analytics Zoo Recommendation API (i.e., Neural Collaborative
Filtering, Wide and Deep Learning) for recommendations on data with
explicit feedback.
Sentiment Analysis
• Sentiment analysis using neural network models (e.g. CNN, LSTM, GRU,
Bi-LSTM)
Variational Autoencoder (VAE)
• Use VAE to generate faces and digital numbers
https://github.com/intel-analytics/analytics-zoo/tree/master/apps

Building and Deploying with BigDL/Analytics Zoo
http://software.intel.com/bigdl/build
Not a Full List

Word Embeddings and GloVe Vectors
https://nlp.stanford.edu/projects/glove/
• Words or phrases from the vocabulary are mapped
to vectors of real numbers.
• Global log-bilinear regression model for the
unsupervised learning algorithm.
• Training is performed on aggregated global word-word
co-occurrence statistics from a Wikipedia.
• Vector representations showcase meaningful linear
substructures of the word vector space.
31

Analytics Zoo Recommender Model
• Neural collaborative filtering,
Wide and Deep
• Answer the question using
classification methodologies
• Implicit feedback and explicit
feedback
• APIs
• recommendForUser
• recommendForItem
• predictUserItemPair
https://github.com/intel-analytics/analytics-zoo/ https://analytics-zoo.github.io/
He, 2015
32

Recommender model
33
val model = Sequential[Float]()
model.add(Linear(100, 40)).add(ReLU())
.add(Linear(40, 20)).add(ReLU())
.add(LogSoftMax())
Resume Glove vectors
Linear1(40 output)
Linear2(20 output)
Linear3(10 output)
LogSoftMax
Job Glove vectors
Linear4(2 output)

End to End Flow
34
https://software.intel.com/en-us/articles/talroo-uses-analytics-zoo-
and-aws-to-leverage-deep-learning-for-job-recommendations

Evaluation Results
Precision MRR
35

Takeaways
• Analytics Zoo/BigDL integrates well into existing AWS Databricks
Spark ETL and machine learning platform
• Analytics Zoo/BigDL scales with our data and business
• Jobs and resumes can be effectively modeled and processed
through embeddings
• Ensembling multiple models and glove embedding feature
embedding proved to be very effective for rich content
• More information available at https://analytics-zoo.github.io/
36

Unified Analytics + AI Platform
Distributed TensorFlow, Keras and BigDL on Apache Spark
https://github.com/intel-analytics/analytics-zoo

Legal Disclaimers
• Intel technologies’ features and benefits depend on system configuration and may require enabled
hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
• No computer system can be absolutely secure.
• Tests document performance of components on a particular test, in specific systems. Differences in
hardware, software, or configuration will affect actual performance. Consult other sources of
information to evaluate performance as you consider your purchase. For more complete information
about performance and benchmark results, visit http://www.intel.com/performance.
Intel, the Intel logo, Xeon, Xeon phi, Lake Crest, etc. are trademarks of Intel Corporation in the U.S.
and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2019 Intel Corporation

Leveraging NLP and Deep Learning for Document Recommendations in the Cloud

More Related Content

What's hot

Similar to Leveraging NLP and Deep Learning for Document Recommendations in the Cloud

More from Databricks

Recently uploaded

Leveraging NLP and Deep Learning for Document Recommendations in the Cloud