Guoqiong Song, Intel
Leveraging NLP and Deep
Learning for Document
Recommendation in the Cloud
#UnifiedAnalytics #SparkAISummit
Agenda
• Job/Resume Search Challenges and Opportunity
• Analytics Zoo and BigDL Overview
• Resume Search Analytics Zoo Solution
• Takeaways
Agenda
• Job/Resume Search Challenges and Opportunity
• Analytics Zoo and BigDL Overview
• Resume Search Analytics Zoo Solution
• Takeaways
4#UnifiedAnalytics #SparkAISummit
Job search
5
Personalize Results Value
Job Seekers Employers
Find the right job faster
Resume
Find the right person
Job description
Traditional Information Retrieval Sufferings
Solution challenges:
stemming, synonyms,
ontologies, sensitivity
6
Warehouse
Warehouse
Stemming
Accountant Accounting=!=
7
Stemming Solution
Accountant Accounting=!=
Accountant Accounting=
8
Stemming Sufferings
Accountant Accounting=!=
Accountant Accounting=
Accountant Accounting=
Account
Representative
=
9
Synonyms
Registered
Nurse
RN=!=
10
Synonyms Solution
Registered
Nurse
RN=!=
Registered
Nurse
RN → registered nurse=
11
Synonym Sufferings
Registered
Nurse
RN=!=
Registered
Nurse
RN → registered nurse=
. . .
12
Ontologies
Dishwasher Back of House=!=
13
Ontologies Solution
Dishwasher Back of House=!=
Restaurant Restaurant
Dishwasher Back of House
=
14
Ontologies Sufferings
. . .
Dishwasher Back of House=!=
Restaurant Restaurant
Dishwasher Back of House
=
15
Specificity Suffering
16
17
Personalize Results Value
Job Seekers Employers
Find the right job faster
Resume
Find the right person
Job description
Agenda
• Job/Resume Search Challenges and Opportunity
• Analytics Zoo and BigDL Overview
• Resume Search Analytics Zoo Solution
• Takeaways
AI on
Unifying Analytics + AI on Apache Spark
Distributed, High-Performance
Deep Learning Framework
for Apache Spark
https://github.com/intel-analytics/bigdl
DistributedTensoRflow, Keras and BigDL on Spark
Reference Use Cases,AI Models,
High-level APIs, Feature Engineering, etc.
https://github.com/intel-analytics/analytics-zoo
Unified Big Data Analytics Platform
Data
Input
Flume Kafka Storage HBaseHDFS
Resource Mgmt
& Co-ordination
ZooKeeperYARN
Data
Processing
& Analysis
MR
Storm
Apache Hadoop & Spark Platform
Parquet Avro
Spark Core
SQL Streaming MLlib GraphX
DataFrame
ML Pipelines
SparkR
Flink
Giraph
Batch Streaming Interactive
Machine
Leaning
Graph
Analytics SQL
R PythonJava
Notebook Spreadsheet
Chasm b/w Deep Learning and Big Data
Communities
Average users (big data users, data scientists, analysts, etc.)Deep learning experts
The
Chasm
Bridging the Chasm
Make deep learning more accessible to big data and data
science communities
• Continue the use of familiar SW tools and HW infrastructure to build deep learning
applications
• Analyze “big data” using deep learning on the same Hadoop/Spark cluster where the
data are stored
• Add deep learning functionalities to large-scale big data programs and/or workflow
• Leverage existing Hadoop/Spark clusters to run deep learning applications
• Shared, monitored and managed with other workloads (e.g., ETL, data warehouse, feature
engineering, traditional ML, graph analytics, etc.) in a dynamic and elastic fashion
BigDL
Bringing Deep Learning To Big Data Platform
https://github.com/intel-analytics/BigDL
Spark Core
DataFrame
https://bigdl-project.github.io/
• Distributed deep learning framework for Apache Spark*
• Make deep learning more accessible to big data users
and data scientists
• Write deep learning applications as standard Spark programs
• Run on existing Spark/Hadoop clusters (no changes needed)
• Feature parity with popular deep learning frameworks
• E.g., Caffe, Torch, Tensorflow, etc.
• High performance (on CPU)
• Powered by Intel MKL and multi-threaded programming
• Efficient scale-out
• Leveraging Spark for distributed training & inference
Spark Core
SQL SparkR Streaming
MLlib GraphX
ML Pipeline
DataFrame
BigDL Run as Standard Spark Programs
Standard Spark jobs
• No changes to the Spark or Hadoop clusters needed
Iterative
• Each iteration of the training runs as a Spark job
Data parallel
• Each Spark task runs the same model on a subset of the data (batch)
Distributed Training in BigDL
Peer-2-Peer All-Reduce Synchronization
Analytics Zoo
Unified Analytics + AI Platform for Big Data
Reference Use Cases
• Anomaly detection, sentiment analysis, fraud detection, image
generation, chatbot, etc.
Built-In Deep Learning
Models
• Image classification, object detection, text classification, text matching,
recommendations, sequence-to-sequence, anomaly detection, etc.
Feature Engineering
Feature transformations for
• Image, text, 3D imaging, time series, speech, etc.
High-Level Pipeline APIs
• Distributed TensorFlow and Keras on Spark
• Native support for transfer learning, Spark DataFrame and ML Pipelines
• Model serving API for model serving/inference pipelines
Backbends Spark, TensorFlow, Keras, BigDL, OpenVINO, MKL-DNN, etc.
https://github.com/intel-analytics/analytics-zoo/ https://analytics-zoo.github.io/
Distributed TensorFlow, Keras and BigDL on Spark
Analytics Zoo
Build end-to-end deep learning applications for big data
• Distributed TensorFlow on Spark
• Keras-style APIs (with autograd & transfer learning support)
• nnframes: native DL support for Spark DataFrames and ML Pipelines
• Built-in feature engineering operations for data preprocessing
Productionize deep learning applications for big data at scale
• Model serving APIs (w/ OpenVINO support)
• Support Web Services, Spark, Storm, Flink, Kafka, etc.
Out-of-the-box solutions
• Built-in deep learning models and reference use cases
What Can you do with Analytic Zoo?
Anomaly Detection
• Using LSTM network to detect anomalies in time series data
Fraud Detection
• Using feed-forward neural network to detect frauds in credit card
transaction data
Recommendation
• Use Analytics Zoo Recommendation API (i.e., Neural Collaborative
Filtering, Wide and Deep Learning) for recommendations on data with
explicit feedback.
Sentiment Analysis
• Sentiment analysis using neural network models (e.g. CNN, LSTM, GRU,
Bi-LSTM)
Variational Autoencoder (VAE)
• Use VAE to generate faces and digital numbers
https://github.com/intel-analytics/analytics-zoo/tree/master/apps
Building and Deploying with BigDL/Analytics Zoo
http://software.intel.com/bigdl/build
Not a Full List
Agenda
• Job/Resume Search Challenges and Opportunity
• Analytics Zoo and BigDL Overview
• Resume Search Analytics Zoo Solution
• Takeaways
Word Embeddings and GloVe Vectors
https://nlp.stanford.edu/projects/glove/
• Words or phrases from the vocabulary are mapped
to vectors of real numbers.
• Global log-bilinear regression model for the
unsupervised learning algorithm.
• Training is performed on aggregated global word-word
co-occurrence statistics from a Wikipedia.
• Vector representations showcase meaningful linear
substructures of the word vector space.
31
Analytics Zoo Recommender Model
• Neural collaborative filtering,
Wide and Deep
• Answer the question using
classification methodologies
• Implicit feedback and explicit
feedback
• APIs
• recommendForUser
• recommendForItem
• predictUserItemPair
https://github.com/intel-analytics/analytics-zoo/ https://analytics-zoo.github.io/
He, 2015
32
Recommender model
33
val model = Sequential[Float]()
model.add(Linear(100, 40)).add(ReLU())
.add(Linear(40, 20)).add(ReLU())
.add(Linear(20, 10)).add(ReLU())
.add(Linear(10, 2)).add(ReLU())
.add(LogSoftMax())
Resume Glove vectors
Linear1(40 output)
Linear2(20 output)
Linear3(10 output)
LogSoftMax
Job Glove vectors
Linear4(2 output)
End to End Flow
34
https://software.intel.com/en-us/articles/talroo-uses-analytics-zoo-
and-aws-to-leverage-deep-learning-for-job-recommendations
Evaluation Results
Precision MRR
35
Takeaways
• Analytics Zoo/BigDL integrates well into existing AWS Databricks
Spark ETL and machine learning platform
• Analytics Zoo/BigDL scales with our data and business
• Jobs and resumes can be effectively modeled and processed
through embeddings
• Ensembling multiple models and glove embedding feature
embedding proved to be very effective for rich content
• More information available at https://analytics-zoo.github.io/
36
Unified Analytics + AI Platform
Distributed TensorFlow, Keras and BigDL on Apache Spark
https://github.com/intel-analytics/analytics-zoo
Legal Disclaimers
• Intel technologies’ features and benefits depend on system configuration and may require enabled
hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
• No computer system can be absolutely secure.
• Tests document performance of components on a particular test, in specific systems. Differences in
hardware, software, or configuration will affect actual performance. Consult other sources of
information to evaluate performance as you consider your purchase. For more complete information
about performance and benchmark results, visit http://www.intel.com/performance.
Intel, the Intel logo, Xeon, Xeon phi, Lake Crest, etc. are trademarks of Intel Corporation in the U.S.
and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2019 Intel Corporation

Leveraging NLP and Deep Learning for Document Recommendations in the Cloud

  • 1.
    Guoqiong Song, Intel LeveragingNLP and Deep Learning for Document Recommendation in the Cloud #UnifiedAnalytics #SparkAISummit
  • 2.
    Agenda • Job/Resume SearchChallenges and Opportunity • Analytics Zoo and BigDL Overview • Resume Search Analytics Zoo Solution • Takeaways
  • 3.
    Agenda • Job/Resume SearchChallenges and Opportunity • Analytics Zoo and BigDL Overview • Resume Search Analytics Zoo Solution • Takeaways
  • 4.
  • 5.
    5 Personalize Results Value JobSeekers Employers Find the right job faster Resume Find the right person Job description
  • 6.
    Traditional Information RetrievalSufferings Solution challenges: stemming, synonyms, ontologies, sensitivity 6 Warehouse Warehouse
  • 7.
  • 8.
  • 9.
    Stemming Sufferings Accountant Accounting=!= AccountantAccounting= Accountant Accounting= Account Representative = 9
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Ontologies Solution Dishwasher Backof House=!= Restaurant Restaurant Dishwasher Back of House = 14
  • 15.
    Ontologies Sufferings . .. Dishwasher Back of House=!= Restaurant Restaurant Dishwasher Back of House = 15
  • 16.
  • 17.
    17 Personalize Results Value JobSeekers Employers Find the right job faster Resume Find the right person Job description
  • 18.
    Agenda • Job/Resume SearchChallenges and Opportunity • Analytics Zoo and BigDL Overview • Resume Search Analytics Zoo Solution • Takeaways
  • 19.
    AI on Unifying Analytics+ AI on Apache Spark Distributed, High-Performance Deep Learning Framework for Apache Spark https://github.com/intel-analytics/bigdl DistributedTensoRflow, Keras and BigDL on Spark Reference Use Cases,AI Models, High-level APIs, Feature Engineering, etc. https://github.com/intel-analytics/analytics-zoo
  • 20.
    Unified Big DataAnalytics Platform Data Input Flume Kafka Storage HBaseHDFS Resource Mgmt & Co-ordination ZooKeeperYARN Data Processing & Analysis MR Storm Apache Hadoop & Spark Platform Parquet Avro Spark Core SQL Streaming MLlib GraphX DataFrame ML Pipelines SparkR Flink Giraph Batch Streaming Interactive Machine Leaning Graph Analytics SQL R PythonJava Notebook Spreadsheet
  • 21.
    Chasm b/w DeepLearning and Big Data Communities Average users (big data users, data scientists, analysts, etc.)Deep learning experts The Chasm
  • 22.
    Bridging the Chasm Makedeep learning more accessible to big data and data science communities • Continue the use of familiar SW tools and HW infrastructure to build deep learning applications • Analyze “big data” using deep learning on the same Hadoop/Spark cluster where the data are stored • Add deep learning functionalities to large-scale big data programs and/or workflow • Leverage existing Hadoop/Spark clusters to run deep learning applications • Shared, monitored and managed with other workloads (e.g., ETL, data warehouse, feature engineering, traditional ML, graph analytics, etc.) in a dynamic and elastic fashion
  • 23.
    BigDL Bringing Deep LearningTo Big Data Platform https://github.com/intel-analytics/BigDL Spark Core DataFrame https://bigdl-project.github.io/ • Distributed deep learning framework for Apache Spark* • Make deep learning more accessible to big data users and data scientists • Write deep learning applications as standard Spark programs • Run on existing Spark/Hadoop clusters (no changes needed) • Feature parity with popular deep learning frameworks • E.g., Caffe, Torch, Tensorflow, etc. • High performance (on CPU) • Powered by Intel MKL and multi-threaded programming • Efficient scale-out • Leveraging Spark for distributed training & inference Spark Core SQL SparkR Streaming MLlib GraphX ML Pipeline DataFrame
  • 24.
    BigDL Run asStandard Spark Programs Standard Spark jobs • No changes to the Spark or Hadoop clusters needed Iterative • Each iteration of the training runs as a Spark job Data parallel • Each Spark task runs the same model on a subset of the data (batch)
  • 25.
    Distributed Training inBigDL Peer-2-Peer All-Reduce Synchronization
  • 26.
    Analytics Zoo Unified Analytics+ AI Platform for Big Data Reference Use Cases • Anomaly detection, sentiment analysis, fraud detection, image generation, chatbot, etc. Built-In Deep Learning Models • Image classification, object detection, text classification, text matching, recommendations, sequence-to-sequence, anomaly detection, etc. Feature Engineering Feature transformations for • Image, text, 3D imaging, time series, speech, etc. High-Level Pipeline APIs • Distributed TensorFlow and Keras on Spark • Native support for transfer learning, Spark DataFrame and ML Pipelines • Model serving API for model serving/inference pipelines Backbends Spark, TensorFlow, Keras, BigDL, OpenVINO, MKL-DNN, etc. https://github.com/intel-analytics/analytics-zoo/ https://analytics-zoo.github.io/ Distributed TensorFlow, Keras and BigDL on Spark
  • 27.
    Analytics Zoo Build end-to-enddeep learning applications for big data • Distributed TensorFlow on Spark • Keras-style APIs (with autograd & transfer learning support) • nnframes: native DL support for Spark DataFrames and ML Pipelines • Built-in feature engineering operations for data preprocessing Productionize deep learning applications for big data at scale • Model serving APIs (w/ OpenVINO support) • Support Web Services, Spark, Storm, Flink, Kafka, etc. Out-of-the-box solutions • Built-in deep learning models and reference use cases
  • 28.
    What Can youdo with Analytic Zoo? Anomaly Detection • Using LSTM network to detect anomalies in time series data Fraud Detection • Using feed-forward neural network to detect frauds in credit card transaction data Recommendation • Use Analytics Zoo Recommendation API (i.e., Neural Collaborative Filtering, Wide and Deep Learning) for recommendations on data with explicit feedback. Sentiment Analysis • Sentiment analysis using neural network models (e.g. CNN, LSTM, GRU, Bi-LSTM) Variational Autoencoder (VAE) • Use VAE to generate faces and digital numbers https://github.com/intel-analytics/analytics-zoo/tree/master/apps
  • 29.
    Building and Deployingwith BigDL/Analytics Zoo http://software.intel.com/bigdl/build Not a Full List
  • 30.
    Agenda • Job/Resume SearchChallenges and Opportunity • Analytics Zoo and BigDL Overview • Resume Search Analytics Zoo Solution • Takeaways
  • 31.
    Word Embeddings andGloVe Vectors https://nlp.stanford.edu/projects/glove/ • Words or phrases from the vocabulary are mapped to vectors of real numbers. • Global log-bilinear regression model for the unsupervised learning algorithm. • Training is performed on aggregated global word-word co-occurrence statistics from a Wikipedia. • Vector representations showcase meaningful linear substructures of the word vector space. 31
  • 32.
    Analytics Zoo RecommenderModel • Neural collaborative filtering, Wide and Deep • Answer the question using classification methodologies • Implicit feedback and explicit feedback • APIs • recommendForUser • recommendForItem • predictUserItemPair https://github.com/intel-analytics/analytics-zoo/ https://analytics-zoo.github.io/ He, 2015 32
  • 33.
    Recommender model 33 val model= Sequential[Float]() model.add(Linear(100, 40)).add(ReLU()) .add(Linear(40, 20)).add(ReLU()) .add(Linear(20, 10)).add(ReLU()) .add(Linear(10, 2)).add(ReLU()) .add(LogSoftMax()) Resume Glove vectors Linear1(40 output) Linear2(20 output) Linear3(10 output) LogSoftMax Job Glove vectors Linear4(2 output)
  • 34.
    End to EndFlow 34 https://software.intel.com/en-us/articles/talroo-uses-analytics-zoo- and-aws-to-leverage-deep-learning-for-job-recommendations
  • 35.
  • 36.
    Takeaways • Analytics Zoo/BigDLintegrates well into existing AWS Databricks Spark ETL and machine learning platform • Analytics Zoo/BigDL scales with our data and business • Jobs and resumes can be effectively modeled and processed through embeddings • Ensembling multiple models and glove embedding feature embedding proved to be very effective for rich content • More information available at https://analytics-zoo.github.io/ 36
  • 37.
    Unified Analytics +AI Platform Distributed TensorFlow, Keras and BigDL on Apache Spark https://github.com/intel-analytics/analytics-zoo
  • 38.
    Legal Disclaimers • Inteltechnologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. • No computer system can be absolutely secure. • Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel, the Intel logo, Xeon, Xeon phi, Lake Crest, etc. are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2019 Intel Corporation