Data Science
Learning from experiments
About Me
~15 years | ~12 products | Various roles

Name Gaurav Marwaha
Current Associated with Nucleus Software, having complete ownership of new product
development for loan origination product for banks/ NBFCs.
Driving the technology teams to deliver an internally re-usable product
development framework.
Past Have successfully lead & contributed to multiple product teams in different
domains (GIS/ Health/ e-Governance)

Technology Java/ big-data/ analytics/ Spring/ ESB (Camel)/ Mobile/ Social
Product Conceptualization, Design, Development, Maintenance, EOL, Strategy &
roadmap
Soft Team building, coaching, mentoring
LinkedIn http://in.linkedin.com/in/gauravmarwaha/
Table of Contents
› Introduction

› Assumptions
› Experiment 1: Inferring written text
› Experiment 2: Scoring public data

› Experiment 3: Discovering cross sell opportunities
› Learning’s
› Tools & References
Introduction
Introduction
We generated more digital data in the year 2013 than we
have ever before. Everyone wants to know more about me
right from my bank to places I shop. From Google to the
mall store owner. Everyone wants to know what I want
before I myself know that I want it.
Quants have tried to predict stock movement based on
history of trades for years now.
Businesses can leverage the abundantly available data from
smart phones, desktops etc to make critical CAPEX /
marketing decisions. Knowing how to derive value out of
this data is more important today than ever.
Assumptions
This short presentation will only focus on problems which I
worked on; it will avoid theoretical aspects of data science.
› Assuming viewers of this have read about:
–
–
–
–

Language processing: Stemming, tokenization, parts of speech tagging
Basics of machine learning clustering/ classification techniques
Point clouds and dimensional analysis on data using them
Java/ J2EE based web application development

› Per my knowledge none of these experiments became part of a
commercial product.

› I have purposefully kept the presentation focused on learning’s
avoiding the nuts & bolts to keep it short.
Experiment 1: Inferring written text
Scenarios
Text analysis refers to inferring valuable
knowledge from a given piece of text which
may help in further action/ decisions.

Customer Support

Text Mining

TEXT
ANALYSIS
Challenges:
1. Slang – we use a lot of phrases which deviate
from the defined grammar of a language.
2. Ambiguity – there is lot of ambiguity in some
sentences where the speaker may be throwing a
pun or a sarcastic remark
3. Language – English and other Hipsanic
languages are not the only ones spoken some
users may mix languages. Like English + Spanish/
Mexican etc.

Auto respond bots for text
Auto respond IVR bots
Auto email responses to email queries

Legal text
Medical records

Social Analysis

Facebook page analyzer
Twitter stream analysis
Other sentiment analysis

Computer Games

AI games
Betting games
EXAMPLE TEXT

Decision Support

Millitary use
Email analysis
Customer Support
I will limit the discussion to this topic where a user is writing
in to the customer support during off hours and instead of a
standard reply the query first goes to a bot which tries to
answer it.
There can be numerous other use cases for this service, the
key elements are:
1. The calling application – this is the consumer of the service
which passes the user query
2. Text parser – this is the engine which receives and parses text
3. Dictionary – a list of phrases/ words of interest, used to map the
query to something that the machine understands.
Customer Support - How
Security Shell: Oauth

Web application

Text Parser

User keys in query in a simple
contact us page. It is first sent to
parser if low score response is
received same is discarded for a
pre-decided “we will get back to
you response”

Dictionar
y

1

Web application
Standard Spring based web
application

2

Security Shell
Oauth provider shell to help with
REST based security

3

Text Paser
Stanford NLP Parser:
http://nlp.stanford.edu/software/l
ex-parser.shtml and the core-NLP
package

4

Notes:
Dictionary maintencance, finding
nouns/ subjects are all part of
standard documentation/ tutorials.
The tool also supports languages
other than English.
Learning's and Possible Uses
Learning’s:
1. Dictionary is a very critical element, a well defined dictionary will
help identify subjects more easily with right scores.
2. Quality of data if second key element, spelling mistakes,
ambiguous sentences and emotions of the writer all play
different roles. A quick example is Porch/ Porsche it is just an e
but it changes a lot.

Uses:
Other than customer support a parser like this can also be used in
sentiment analysis or text analysis.
Experiment 2: Scoring public data
Scenarios
All of us generate tons of public data and
businesses can use it for profiling us both as
exisiting and prospective customers. A better
profiled customer is better served and can
lead to a longer term relationship.

LinkedIn

Facebook

PUBLIC DATA
Challenges:
1. Privacy – The user has to authorize access to
such data
2. Authenticity – people may have fake accounts
3. Volume – The sheer volume of such data may
make it difficult to analyze it in a given time.

Twitter

Blogs

Employment Verification
Type of connections
Recommendations

Personal nature
Interests

Following and followed by
Tweet sentiment/ text analysis
Location data

Text analysis
Knowledge
Social/ Public Scores
The experiment is simple, which is to score an individual from
LinkedIn and Twitter data which is further used in employability
checks.
There can be numerous other use cases for this service, the key
elements are:
1.
2.
3.

Social Networks – access to an account/ user’s personal data
A learning database that allows the machine to create good/ bad/
neutral clusters of from existing data
Choosing the right algorithm to identify the cluster

Data:
• LinkedIn: Experience, connections, degrees used for scoring
• Twitter: tweets, followers etc. used for personal scoring
Customer Support - How
Web Application
Dictionary
Twitter Score
Engine

Twitter Parser

Final Score
aggregator

Spring Social
LinkedIn Score
Engine

LinkedIn Parser

Training
data set

1

Spring Social
A standard module from
Spring helps us to get data
from social networks to
Java applications very
easily.

2

Parsers
Once data is in, we can write
some parsers/ formatters to
cleanse data or move it into
application defined standard
structures.

3

Twitter Score Engine
This is nothing but an extension of
textual analysis tool with the
dictionary defining words that bring
out substance abuse/ gambling and
other socially unacceptable
characteristics

4

LinkedIn Score Engine
The machine was pre-trained
on some sample data using
standard dimensions
provided by LinkedIn. We
used Encog and Weka .

5

Algorithm
We experimented with some
basic machine learning
algorithms including
Bayesian, K-Means also tried
with fuzzy K-means
Learning's and Possible Uses
Learning’s:
1.
2.

3.
4.

Privacy laws across countries do not allow access to such data but
companies are circumventing this by launching mobile apps which
have access to everything on your smart phones.
To make a machine take sane decisions it is critical to have the right
training data this data becomes all the more critical for qualitative
attributes.
If you do not have a data scientist/ statistician then you can play with
different algorithms. Genetic and neural algorithms may sound cool
but they may not give desired results.
Weka is a good tool to visualize the execution and also a tool which
can be used to select the right algorithm.

Uses:
This is a very generic public data profiling application it can have uses in
banks, HR departments and many other places.
Experiment 3: Discovering cross sell opportunities
Scenarios
This is the most complicated of the three scenarios.
Large corporations have hundereds of different
products, millions of customers and thousands of
salesmen across geographies. What is it that an
existing customer will buy next especially in an
enterprise product environment.

INCLINATION
CONNECTIONS
COMMON FRIENDS
DECISION AUTHORITY

PERSONAL
GOALS

CURRENT ESCALATIONS
LAST CHANGE REQUEST
SERVICE HISTORY

CUSTOMER SUPPORT

CUSTOMER CONTACT

LICENSES

”Say a sales person is visiting a customer and he/ she
quickly wants to see what can be sold to this
customer.”

MARKETS & REGIONS

PRODUCTS

MARKET/ REGION

INSTALLATIONS

Challenges:
1. Aggregation – data is being aggregated from
public and private data storres
2. Time – the opportunity presentment window is
very short and lot of data has to be crunched.
3. Availability – Anytime that any service is down

FEATURES

DATA ON
CUSTOMERS IN
THIS MARKET/
REGION

LOCATION
PRICE
WHERE?

CHEAP?

LUXARY?
AVERAGE?

MARKET & REGION
DATA RELATED TO
THE MARKET
MATURITY, STATE
ETC.
Cross Selling
This is not a simple experiment, it is aggregation of multiple public and
private data sources.
The key elements being:
1.
2.
3.

Speed of decision/ suggestion
Availability and access to multiple API based services (paid/ free)
Availability of enough data for the machine to have built up knowledge to take
correct decision

Data:
• LinkedIn: Common connects
• Twitter: tweets, followers etc. used for personal profiling
• Jigsaw: Company data
• Yahoo Finance API: Market information
• Customer Support: Analysis of tickets
Cross Sell - How
Yahoo
Connector &
Formatter

Web Application
Dictionary
Twitter Score
Engine

Customer
support Data

Final List of
suggestions

Spring Social
LinkedIn Score
Engine

Jigsaw
Connector &
formatter

1

Twitter Parser

Previous Modules
Refer to previous slides for
a description of repeated
modules.

2

Yahoo Connector

3
Fetches data from Yahoo finance
API and formats some
structured/ unstructured data
into more structured data which
can be analyzed

LinkedIn Parser

Training
data set

Jigsaw Connector
Fetches Jigsaw company information
over API calls. Note now this API looks
to have moved to data.com

4

Final Suggestions
Basically a quick aggregator
of data with inbuilt custom
logic for scoring and location
analysis that is once we have
final list of contacts we
overlay salesrep location.

5

Algorithms
Text: combination of noun &
knowledge extraction from
free text using SOLR & NLP
Jigsaw: Company match to
indicate closeness to selected
customer.
Learning's and Possible Uses
Learning’s:
1.
2.
3.
4.

Data Quality: Leaving aside the complexity of integration and multiple
data sources. The quality of data and its importance in decision
making, especially in the enterprise world was the critical learning.
In most of real world complicated scenario, there is no one solution
which will fit.
Agile: breaking the problem into several smaller problems made life
more simple.
Human judgment: Whatever the machine may show to the sales rep in
case he/ she ignores and decides to cross sell something else that
has to go back to the machine as learning else the intelligence will
slowly die away.

Uses:
Multiple, leave it to the imagination of the reader.
Learning’s
Big Picture – Data Quality
Enterprise/ B2B World

Public/ B2C World

Data entry is a cost center and also corner stone for
enterprise applications. The data that we use for
machines to learn has been mostly captured by
humans over the past years. Data entry is not the
most rewarding career and people tend to make
mistakes like wrong address, figures, names are very
common. Focus on quality of data entry will reduce
the speed which means reduced volumes.

Imagine amazon, when you buy a book what data
does it capture about you: clicks, geo-ip, browser,
products viewed/ liked/ bought/ searched/ etc.
Some data from cookies and your past searches,
your profile. To place the order most of us will give
the right address and phone with payment
information. As you notice lot of data is machine
generated which makes analysis more accurate.

Conclusion
•Curing data is possible but it is important to balance quality, quantity and cost of data entry by designing
applications which strike the right balance in these.
•Master data management, data quality programs and data curing all are costly affairs if done late in the
enterprise
•The aggregation of public and private data sets is a reality in today’s world and ”identity” that is identifying an
individual across these data sets is also a real challenge.
Big Picture – Others
Machine Learning

Big Data

Integrations

How much and what is required to
solve problem at hand. Reuse what
is already done and application of
same on business problem is good.

Is not same as data analysis, it can
speed up the analysis and may/
may not be applicable to your
problem

Is the way to go in future, all these
mountains of data will soon
integrate

Agility

Data

Data Scientists

Hit smaller chunks of doable workitems and slowly take down the
larger beast.

Data & Data quality are
tremendously important a few
hundred bad apples can spoil lot
more.

Is an important position in the
overall picture, complicated
scoring/ analysis requires
specialized skills.
Tools & References
Tools & References
› Tools:
– The normal Spring JEE stack with many spring modules has been
used to develop these applications
– Eclipse used as source code editor
– The other tools like Stanford NLP, Encog and Weka are listed with
links on individual slides

› References:
– There are good courses on Coursera
– The Stanford, Weka and Encog websites also have lot of reading
material
– Presentation template & graphics provided by Microsoft
Thank You

Data Science - Experiments

  • 1.
  • 2.
    About Me ~15 years| ~12 products | Various roles Name Gaurav Marwaha Current Associated with Nucleus Software, having complete ownership of new product development for loan origination product for banks/ NBFCs. Driving the technology teams to deliver an internally re-usable product development framework. Past Have successfully lead & contributed to multiple product teams in different domains (GIS/ Health/ e-Governance) Technology Java/ big-data/ analytics/ Spring/ ESB (Camel)/ Mobile/ Social Product Conceptualization, Design, Development, Maintenance, EOL, Strategy & roadmap Soft Team building, coaching, mentoring LinkedIn http://in.linkedin.com/in/gauravmarwaha/
  • 3.
    Table of Contents ›Introduction › Assumptions › Experiment 1: Inferring written text › Experiment 2: Scoring public data › Experiment 3: Discovering cross sell opportunities › Learning’s › Tools & References
  • 4.
  • 5.
    Introduction We generated moredigital data in the year 2013 than we have ever before. Everyone wants to know more about me right from my bank to places I shop. From Google to the mall store owner. Everyone wants to know what I want before I myself know that I want it. Quants have tried to predict stock movement based on history of trades for years now. Businesses can leverage the abundantly available data from smart phones, desktops etc to make critical CAPEX / marketing decisions. Knowing how to derive value out of this data is more important today than ever.
  • 6.
    Assumptions This short presentationwill only focus on problems which I worked on; it will avoid theoretical aspects of data science. › Assuming viewers of this have read about: – – – – Language processing: Stemming, tokenization, parts of speech tagging Basics of machine learning clustering/ classification techniques Point clouds and dimensional analysis on data using them Java/ J2EE based web application development › Per my knowledge none of these experiments became part of a commercial product. › I have purposefully kept the presentation focused on learning’s avoiding the nuts & bolts to keep it short.
  • 7.
  • 8.
    Scenarios Text analysis refersto inferring valuable knowledge from a given piece of text which may help in further action/ decisions. Customer Support Text Mining TEXT ANALYSIS Challenges: 1. Slang – we use a lot of phrases which deviate from the defined grammar of a language. 2. Ambiguity – there is lot of ambiguity in some sentences where the speaker may be throwing a pun or a sarcastic remark 3. Language – English and other Hipsanic languages are not the only ones spoken some users may mix languages. Like English + Spanish/ Mexican etc. Auto respond bots for text Auto respond IVR bots Auto email responses to email queries Legal text Medical records Social Analysis Facebook page analyzer Twitter stream analysis Other sentiment analysis Computer Games AI games Betting games EXAMPLE TEXT Decision Support Millitary use Email analysis
  • 9.
    Customer Support I willlimit the discussion to this topic where a user is writing in to the customer support during off hours and instead of a standard reply the query first goes to a bot which tries to answer it. There can be numerous other use cases for this service, the key elements are: 1. The calling application – this is the consumer of the service which passes the user query 2. Text parser – this is the engine which receives and parses text 3. Dictionary – a list of phrases/ words of interest, used to map the query to something that the machine understands.
  • 10.
    Customer Support -How Security Shell: Oauth Web application Text Parser User keys in query in a simple contact us page. It is first sent to parser if low score response is received same is discarded for a pre-decided “we will get back to you response” Dictionar y 1 Web application Standard Spring based web application 2 Security Shell Oauth provider shell to help with REST based security 3 Text Paser Stanford NLP Parser: http://nlp.stanford.edu/software/l ex-parser.shtml and the core-NLP package 4 Notes: Dictionary maintencance, finding nouns/ subjects are all part of standard documentation/ tutorials. The tool also supports languages other than English.
  • 11.
    Learning's and PossibleUses Learning’s: 1. Dictionary is a very critical element, a well defined dictionary will help identify subjects more easily with right scores. 2. Quality of data if second key element, spelling mistakes, ambiguous sentences and emotions of the writer all play different roles. A quick example is Porch/ Porsche it is just an e but it changes a lot. Uses: Other than customer support a parser like this can also be used in sentiment analysis or text analysis.
  • 12.
  • 13.
    Scenarios All of usgenerate tons of public data and businesses can use it for profiling us both as exisiting and prospective customers. A better profiled customer is better served and can lead to a longer term relationship. LinkedIn Facebook PUBLIC DATA Challenges: 1. Privacy – The user has to authorize access to such data 2. Authenticity – people may have fake accounts 3. Volume – The sheer volume of such data may make it difficult to analyze it in a given time. Twitter Blogs Employment Verification Type of connections Recommendations Personal nature Interests Following and followed by Tweet sentiment/ text analysis Location data Text analysis Knowledge
  • 14.
    Social/ Public Scores Theexperiment is simple, which is to score an individual from LinkedIn and Twitter data which is further used in employability checks. There can be numerous other use cases for this service, the key elements are: 1. 2. 3. Social Networks – access to an account/ user’s personal data A learning database that allows the machine to create good/ bad/ neutral clusters of from existing data Choosing the right algorithm to identify the cluster Data: • LinkedIn: Experience, connections, degrees used for scoring • Twitter: tweets, followers etc. used for personal scoring
  • 15.
    Customer Support -How Web Application Dictionary Twitter Score Engine Twitter Parser Final Score aggregator Spring Social LinkedIn Score Engine LinkedIn Parser Training data set 1 Spring Social A standard module from Spring helps us to get data from social networks to Java applications very easily. 2 Parsers Once data is in, we can write some parsers/ formatters to cleanse data or move it into application defined standard structures. 3 Twitter Score Engine This is nothing but an extension of textual analysis tool with the dictionary defining words that bring out substance abuse/ gambling and other socially unacceptable characteristics 4 LinkedIn Score Engine The machine was pre-trained on some sample data using standard dimensions provided by LinkedIn. We used Encog and Weka . 5 Algorithm We experimented with some basic machine learning algorithms including Bayesian, K-Means also tried with fuzzy K-means
  • 16.
    Learning's and PossibleUses Learning’s: 1. 2. 3. 4. Privacy laws across countries do not allow access to such data but companies are circumventing this by launching mobile apps which have access to everything on your smart phones. To make a machine take sane decisions it is critical to have the right training data this data becomes all the more critical for qualitative attributes. If you do not have a data scientist/ statistician then you can play with different algorithms. Genetic and neural algorithms may sound cool but they may not give desired results. Weka is a good tool to visualize the execution and also a tool which can be used to select the right algorithm. Uses: This is a very generic public data profiling application it can have uses in banks, HR departments and many other places.
  • 17.
    Experiment 3: Discoveringcross sell opportunities
  • 18.
    Scenarios This is themost complicated of the three scenarios. Large corporations have hundereds of different products, millions of customers and thousands of salesmen across geographies. What is it that an existing customer will buy next especially in an enterprise product environment. INCLINATION CONNECTIONS COMMON FRIENDS DECISION AUTHORITY PERSONAL GOALS CURRENT ESCALATIONS LAST CHANGE REQUEST SERVICE HISTORY CUSTOMER SUPPORT CUSTOMER CONTACT LICENSES ”Say a sales person is visiting a customer and he/ she quickly wants to see what can be sold to this customer.” MARKETS & REGIONS PRODUCTS MARKET/ REGION INSTALLATIONS Challenges: 1. Aggregation – data is being aggregated from public and private data storres 2. Time – the opportunity presentment window is very short and lot of data has to be crunched. 3. Availability – Anytime that any service is down FEATURES DATA ON CUSTOMERS IN THIS MARKET/ REGION LOCATION PRICE WHERE? CHEAP? LUXARY? AVERAGE? MARKET & REGION DATA RELATED TO THE MARKET MATURITY, STATE ETC.
  • 19.
    Cross Selling This isnot a simple experiment, it is aggregation of multiple public and private data sources. The key elements being: 1. 2. 3. Speed of decision/ suggestion Availability and access to multiple API based services (paid/ free) Availability of enough data for the machine to have built up knowledge to take correct decision Data: • LinkedIn: Common connects • Twitter: tweets, followers etc. used for personal profiling • Jigsaw: Company data • Yahoo Finance API: Market information • Customer Support: Analysis of tickets
  • 20.
    Cross Sell -How Yahoo Connector & Formatter Web Application Dictionary Twitter Score Engine Customer support Data Final List of suggestions Spring Social LinkedIn Score Engine Jigsaw Connector & formatter 1 Twitter Parser Previous Modules Refer to previous slides for a description of repeated modules. 2 Yahoo Connector 3 Fetches data from Yahoo finance API and formats some structured/ unstructured data into more structured data which can be analyzed LinkedIn Parser Training data set Jigsaw Connector Fetches Jigsaw company information over API calls. Note now this API looks to have moved to data.com 4 Final Suggestions Basically a quick aggregator of data with inbuilt custom logic for scoring and location analysis that is once we have final list of contacts we overlay salesrep location. 5 Algorithms Text: combination of noun & knowledge extraction from free text using SOLR & NLP Jigsaw: Company match to indicate closeness to selected customer.
  • 21.
    Learning's and PossibleUses Learning’s: 1. 2. 3. 4. Data Quality: Leaving aside the complexity of integration and multiple data sources. The quality of data and its importance in decision making, especially in the enterprise world was the critical learning. In most of real world complicated scenario, there is no one solution which will fit. Agile: breaking the problem into several smaller problems made life more simple. Human judgment: Whatever the machine may show to the sales rep in case he/ she ignores and decides to cross sell something else that has to go back to the machine as learning else the intelligence will slowly die away. Uses: Multiple, leave it to the imagination of the reader.
  • 22.
  • 23.
    Big Picture –Data Quality Enterprise/ B2B World Public/ B2C World Data entry is a cost center and also corner stone for enterprise applications. The data that we use for machines to learn has been mostly captured by humans over the past years. Data entry is not the most rewarding career and people tend to make mistakes like wrong address, figures, names are very common. Focus on quality of data entry will reduce the speed which means reduced volumes. Imagine amazon, when you buy a book what data does it capture about you: clicks, geo-ip, browser, products viewed/ liked/ bought/ searched/ etc. Some data from cookies and your past searches, your profile. To place the order most of us will give the right address and phone with payment information. As you notice lot of data is machine generated which makes analysis more accurate. Conclusion •Curing data is possible but it is important to balance quality, quantity and cost of data entry by designing applications which strike the right balance in these. •Master data management, data quality programs and data curing all are costly affairs if done late in the enterprise •The aggregation of public and private data sets is a reality in today’s world and ”identity” that is identifying an individual across these data sets is also a real challenge.
  • 24.
    Big Picture –Others Machine Learning Big Data Integrations How much and what is required to solve problem at hand. Reuse what is already done and application of same on business problem is good. Is not same as data analysis, it can speed up the analysis and may/ may not be applicable to your problem Is the way to go in future, all these mountains of data will soon integrate Agility Data Data Scientists Hit smaller chunks of doable workitems and slowly take down the larger beast. Data & Data quality are tremendously important a few hundred bad apples can spoil lot more. Is an important position in the overall picture, complicated scoring/ analysis requires specialized skills.
  • 25.
  • 26.
    Tools & References ›Tools: – The normal Spring JEE stack with many spring modules has been used to develop these applications – Eclipse used as source code editor – The other tools like Stanford NLP, Encog and Weka are listed with links on individual slides › References: – There are good courses on Coursera – The Stanford, Weka and Encog websites also have lot of reading material – Presentation template & graphics provided by Microsoft
  • 27.