Machine Learning
Algorithms
WALAA HAMDY ASSY
SOFTWARE DVELOPER
Machine Learning Definition
Building a model from example inputs to make data
driven predictions vs following strictly static program
instructions.
Machine learning is also often referred to as
predictive analytics, or predictive modelling
“computer’s ability to learn without being explicitly
programmed”.
Machine learning logic
 Gathering the data
 Format it
 Pass to an algorithm
 The algorithm analyzes the data
 Create a model with a solution to solve the problem
Data Algorithm
Data
Analysis
Model
Types of Machine Learning
Algorithms
Supervised Unsupervised
Semi-
supervised
Reinforcement
SUPERVISED
Value prediction
Needs training data containing
value being predicted
Trained model predicts value in
new data
UNSUPERVISED
Identify clusters of like data
We do not have values, we try
to figure out the values
Data does not have cluster
membership
Model provides access to
data by cluster
Supervised
 In supervised learning, the machine is taught by example.
 The operator provides the machine learning algorithm with a known
dataset that includes desired inputs and outputs,
 the algorithm must find a method to determine how to arrive at
those inputs and outputs.
 the operator knows the correct answers to the problem
 the algorithm identifies patterns in data
 The algorithm makes predictions and is corrected by the operator
 this process continues until the algorithm achieves a high level of
accuracy/performance.
Types of supervised learning
 Classification:
the machine learning program must draw a conclusion from observed
values , For example, when filtering emails as ‘spam’ or ‘not spam’, the
program must look at existing observational data and filter the emails
accordingly.
 Regression
the machine learning program must estimate the relationships
among variables. Regression analysis focuses on one dependent variable
and a series of other changing variables – making it particularly useful for
prediction and forecasting.
 Forecasting
the process of making predictions about the future based on the past and
present data, and is commonly used to analyse trends.
Unsupervised learning
There is no answer key or human operator to provide instruction.
Instead, the machine determines the correlations and relationships by
analysing available data.
 Clustering
involves grouping sets of similar data (based on defined criteria). It’s
useful for segmenting data into several groups and performing analysis
on each data set to find patterns.
 Dimensionality reduction
reduces the number of variables being considered to find the exact
information required.
Clusters of data, algorithm analyzes
input data and identifies groups
that share the same traits
Workflow Guidelines
50% - 80% of time spent in
preparing data
Tidy datasets are easy to manipulate, model and visualize
And have a specific structure:
 Each variable is a column
 Each observation is a row
 Each observation is a row
 Each type of observational unit is a table
Hadley Wickham
Reinforcement
learning
 provided with a set of
actions, parameters and end
values. By defining the rules,
the machine learning
algorithm then tries to explore
different options and
possibilities, monitoring and
evaluating each result to
determine which one is
optimal.
 Considered an approach to AI
so it is basically out of our
scope
 Some data is labeled but most of it
is unlabeled and a mixture of
supervised and unsupervised
techniques can be used.
 it can be expensive or time-
consuming to label data as it may
require experts. Whereas unlabeled
data is cheap and easy to collect
and store.
Semi
Supervised
Algorithm Decision Factors
Choosing the right machine learning algorithm depends on
several factors:
 data size
 quality and diversity
 What we want to derive
from data
 accuracy
 training time
 parameters
 Learning type
 Complexity
 Result classification or regression
 Some enable both
 Basic vs Enhanced
a combination of business need, specification,
experimentation and time available.
Algorithms
We have over 50 algorithms and more
Supervised only 28
Naïve Bayes Classifier
Algorithm (Supervised Learning )
 The Naïve Bayes classifier is based on Bayes’ theorem and classifies
every value as independent of any other value. It allows us to predict a
class/category, based on a given set of features, using probability.
 Simple and easy to understand
 Fast up to 100x faster
 Stable to data changes
 Every feature has the same weight.
Some of real world examples are :
 To mark an email as spam, or not spam ?
 Classify a news article about technology, politics, or sports ?
 Check a piece of text expressing positive emotions, or negative
emotions?
 Also used for face recognition software
How Naive Bayes algorithm works?
 Let’s understand it using an example. In the next slide a training
data set of weather and corresponding target variable ‘Play’
(suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s
follow the below steps to perform it.
 Step 1: Convert the data set into a frequency table
 Step 2: Create Likelihood table by finding the probabilities like
 Overcast probability = 0.29 and
 probability of playing is 0.64.
Applications of Naive Bayes
Algorithm
 Real time Prediction: Naive Bayes is an eager learning classifier and it is
sure fast. Thus, it could be used for making predictions in real time.
 Multi class Prediction: This algorithm is also well known for multi class
prediction feature. It predicts the probability of multiple classes of target
variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes
classifiers mostly used in text classification .have higher success rate as
compared to other algorithms. As a result, it is widely used in Spam
filtering and Sentiment Analysis (in social media analysis, to identify
positive and negative customer sentiments)
 Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine
learning and data mining techniques to filter unseen information and
predict whether a user would like a given resource or not
Scala Code Example
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
// Train a NaiveBayes model.
val model = new NaiveBayes()
.fit(trainingData)
// Select example rows to display.
val predictions = model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy")
Support Vector Machine Algorithm
(Supervised Learning )
 SVM is a supervised machine learning algorithm which can be used for
classification or regression problems. It uses a technique called the
kernel trick to transform your data and then based on these
transformations it finds an optimal boundary between the possible
outputs.
 It essentially filters data into categories, which is achieved by providing
a set of training examples, each set marked as belonging to one or the
other of the two categories. The algorithm then works to build a model
that assigns new values to one category or the other.
 Also, it works by classifying the data into different classes by finding a
line.
 SVMs can not only make the reliable prediction but also can reduce
redundant information.
Applications of SVM in Real World
 Face detection – SVMc classify parts of the image as a face and non-face and create
a square boundary around the face.
 Text and hypertext categorization – SVMs allow Text and hyper text categorization .
They use training data to classify documents into different categories. It categorizes on
the basis of the score generated and then compares with the threshold value.
 Classification of images – Use of SVMs provides better search accuracy for image
classification. It provides better accuracy in comparison to the traditional query based
searching techniques.
 Bioinformatics – It includes protein classification and cancer classification. We use SVM
for identifying the classification of genes, patients on the basis of genes and other
biological problems.
 Handwriting recognition – We use SVMs to recognize hand written characters used
widely.
 Generalized predictive control(GPC)
Kernel tricks are used to map a non-linearly
separable functions into a higher dimension
linearly separable function.
Linear SVMKERNEL SVM
Scala Code
 import org.apache.spark.ml.classification.LinearSVC
 // Load training data
 val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
 val lsvc = new LinearSVC()
 .setMaxIter(10)
 .setRegParam(0.1)
 // Fit the model
 val lsvcModel = lsvc.fit(training)
 // Print the coefficients and intercept for linear svc
 println(s"Coefficients: ${lsvcModel.coefficients} Intercept: ${lsvcModel.intercept}")
Linear Regression - Supervised
Linear regression is the most basic type of regression.
 Simple linear regression allows us to understand
The relationships between two continuous
variables.
 If the dependent variable is not continuous
but categorical, linear regression can be
transformed to logistic regression
It envolves finding the best fit line that passes through
All the training data
Regularized regression model
 Technique Used to improve regression
 Penalize complex models , we use it to keep models simple so that it
does not over fit
 Eliminates unimportant features
 They force the model to make it simple
 They force the coefficients to go to zero if they are not significant
applications
 Time series
 Economics
 Environmental science
 Financial forecasting
 Software cost prediction, effort prediction and software quality
assurance.
 Restructuring of the budget : Organization or Country
 Predicting the crime rate of a states based on drug usage, number
of gangs, human trafficking, and Killings.
import org.apache.spark.ml.regression.LinearRegression
// Load training data
val training = spark.read.format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt")
val lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
Logistic Regression (Supervised
learning – Classification)
 Simple but performs well in many classifcations problems
 Logistic regression focuses on estimating the probability of an event
occurring based on the previous data provided.
 It is used to cover a binary dependent variable, that is where only
two values, 0 and 1, represent outcomes.
 Confusing name meaning continuous values but the
result is binary
 Relations ships between features are weighted based
On their impact on the result
Applications of Logistic Regression
 Image Segmentation and Categorization
 Geographic Image Processing
 Handwriting recognition
 Healthcare
 Depression Prediction
 It is used in sentimental analysis like classifying good reviews from
bad ones
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// We can also use the multinomial family for binary classification
val mlr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFamily("multinomial")
val mlrModel = mlr.fit(training)
// Print the coefficients and intercepts for logistic regression with multinomial family
println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")
println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
Decision Trees (Supervised Learning
– Classification/Regression)
 A decision tree is a decision support tool that uses a tree-
like graph or model of decisions and their possible consequences.
 It is one way to display an algorithm that only contains conditional
control statements.
 It is a binary tree of nodes , every node contains a decision so It is
nicely visualized
 Applications:
Its main purpose is classification
Ps: in MILb it can be used as a regressor or a classifier.
The only hyper parameter tuning that decision tree allow is the depth of the tree
Random Forests (Supervised Learning –
Classification/Regression) extremely
powerful technique
 an ensemble learning method, combining multiple algorithms to generate
better results for classification, regression and other tasks.
 Each individual classifier is weak, but when combined with others, can
produce excellent results.
 The algorithm starts with a ‘decision tree’ and an input is entered at the top.
It then travels down the tree, with data being segmented into smaller and
smaller sets, based on specific variables.
How it works?
 Random forests are built on the basics on decision trees
 It trains many decision tress at the same time, every tree has its own
parameters on a different sample of the same dataset then combine
the output of all these decision trees
 Randomly select “K” features from total “m” features where k << m
 Among the “K” features, calculate the node “d” using the best split
point
 Split the node into daughter nodes using the best split
 Repeat the a to c steps until “l” number of nodes has been reached
 Build forest by repeating steps a to d for “n” number times to create “n”
number of trees
 Random forests performs well when individual trees that make up the
ensemble are different as possible
 They should have different trained data they should have different
parameters
Applications
 Banking
 Ecommerce
 Medicine
 Stockmarket
K Means Clustering Algorithm
(Unsupervised Learning - Clustering)
 used to categorise unlabelled data
 It works by finding groups within the data, with the number of groups
represented
 by the variable K.
 It then works iteratively to assign each data point to one of K groups
based on the features provided.
 The results of the K-means clustering algorithm are:
 The centroids of the K clusters,
 which can be used to label new data
 Labels for the training data
(each data point is assigned to a single cluster)
K-means Applications
 Behavioural segmentation
 Segment by purchase history
 Segment by activities on application, website, or platform
 Define personas based on interests
 Create profiles based on activity monitoring
 Inventory categorization:
 Group inventory by sales activity
 Group inventory by manufacturing metrics
 Sorting sensor measurements:
 Detect activity types in motion sensors
 Group images
 Separate audio
 Identify groups in health monitoring
 Detecting bots :
 Separate valid activity groups from bots
 Group valid activity to clean up outlier detection
Scala code example
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
// Make predictions
val predictions = model.transform(dataset)
// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
Collaborative Filtering - Alternative
least squares
 used for recommendation systems and personalized ranking
 The DataFrame-based API for ALS currently only supports integers for
user and item id
 Not in fully functionality in ML
 It has two main implementations :
 Implicit feedback
 Asking a user to rank a collection of items from favorite to least
favorite or other things
 Explicit feedback
 Observing the items that a user views in an online store.
 Analyzing item/user viewing times.[35]
 Keeping a record of the items that a user purchases online.
References
 https://spark.apache.org/docs/latest/ml-guide.html // Ml guide -
Df based Api
 https://spark.apache.org/docs/latest/mllib-guide.html Mlib Guide -
RDD based Api
 https://archive.ics.uci.edu/ml/index.php //repository with more
than 440 datasets
 https://data-flair.training/blogs/machine-learning-algorithm/
 http://www.cubicsol.com/machine-learning-algorithms/
 https://en.wikepedia.org

Machine learning Algorithms

  • 1.
  • 2.
    Machine Learning Definition Buildinga model from example inputs to make data driven predictions vs following strictly static program instructions. Machine learning is also often referred to as predictive analytics, or predictive modelling “computer’s ability to learn without being explicitly programmed”.
  • 3.
    Machine learning logic Gathering the data  Format it  Pass to an algorithm  The algorithm analyzes the data  Create a model with a solution to solve the problem Data Algorithm Data Analysis Model
  • 4.
    Types of MachineLearning Algorithms Supervised Unsupervised Semi- supervised Reinforcement
  • 5.
    SUPERVISED Value prediction Needs trainingdata containing value being predicted Trained model predicts value in new data UNSUPERVISED Identify clusters of like data We do not have values, we try to figure out the values Data does not have cluster membership Model provides access to data by cluster
  • 6.
    Supervised  In supervisedlearning, the machine is taught by example.  The operator provides the machine learning algorithm with a known dataset that includes desired inputs and outputs,  the algorithm must find a method to determine how to arrive at those inputs and outputs.  the operator knows the correct answers to the problem  the algorithm identifies patterns in data  The algorithm makes predictions and is corrected by the operator  this process continues until the algorithm achieves a high level of accuracy/performance.
  • 8.
    Types of supervisedlearning  Classification: the machine learning program must draw a conclusion from observed values , For example, when filtering emails as ‘spam’ or ‘not spam’, the program must look at existing observational data and filter the emails accordingly.  Regression the machine learning program must estimate the relationships among variables. Regression analysis focuses on one dependent variable and a series of other changing variables – making it particularly useful for prediction and forecasting.  Forecasting the process of making predictions about the future based on the past and present data, and is commonly used to analyse trends.
  • 9.
    Unsupervised learning There isno answer key or human operator to provide instruction. Instead, the machine determines the correlations and relationships by analysing available data.  Clustering involves grouping sets of similar data (based on defined criteria). It’s useful for segmenting data into several groups and performing analysis on each data set to find patterns.  Dimensionality reduction reduces the number of variables being considered to find the exact information required.
  • 10.
    Clusters of data,algorithm analyzes input data and identifies groups that share the same traits
  • 12.
  • 13.
    50% - 80%of time spent in preparing data Tidy datasets are easy to manipulate, model and visualize And have a specific structure:  Each variable is a column  Each observation is a row  Each observation is a row  Each type of observational unit is a table Hadley Wickham
  • 14.
    Reinforcement learning  provided witha set of actions, parameters and end values. By defining the rules, the machine learning algorithm then tries to explore different options and possibilities, monitoring and evaluating each result to determine which one is optimal.  Considered an approach to AI so it is basically out of our scope  Some data is labeled but most of it is unlabeled and a mixture of supervised and unsupervised techniques can be used.  it can be expensive or time- consuming to label data as it may require experts. Whereas unlabeled data is cheap and easy to collect and store. Semi Supervised
  • 16.
    Algorithm Decision Factors Choosingthe right machine learning algorithm depends on several factors:  data size  quality and diversity  What we want to derive from data  accuracy  training time  parameters  Learning type  Complexity  Result classification or regression  Some enable both  Basic vs Enhanced a combination of business need, specification, experimentation and time available.
  • 17.
    Algorithms We have over50 algorithms and more Supervised only 28
  • 18.
    Naïve Bayes Classifier Algorithm(Supervised Learning )  The Naïve Bayes classifier is based on Bayes’ theorem and classifies every value as independent of any other value. It allows us to predict a class/category, based on a given set of features, using probability.  Simple and easy to understand  Fast up to 100x faster  Stable to data changes  Every feature has the same weight. Some of real world examples are :  To mark an email as spam, or not spam ?  Classify a news article about technology, politics, or sports ?  Check a piece of text expressing positive emotions, or negative emotions?  Also used for face recognition software
  • 19.
    How Naive Bayesalgorithm works?  Let’s understand it using an example. In the next slide a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.  Step 1: Convert the data set into a frequency table  Step 2: Create Likelihood table by finding the probabilities like  Overcast probability = 0.29 and  probability of playing is 0.64.
  • 21.
    Applications of NaiveBayes Algorithm  Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.  Multi class Prediction: This algorithm is also well known for multi class prediction feature. It predicts the probability of multiple classes of target variable.  Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification .have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)  Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
  • 22.
    Scala Code Example importorg.apache.spark.ml.classification.NaiveBayes import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator // Load the data stored in LIBSVM format as a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Split the data into training and test sets (30% held out for testing) val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L) // Train a NaiveBayes model. val model = new NaiveBayes() .fit(trainingData) // Select example rows to display. val predictions = model.transform(testData) predictions.show() // Select (prediction, true label) and compute test error val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println(s"Test set accuracy = $accuracy")
  • 23.
    Support Vector MachineAlgorithm (Supervised Learning )  SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.  It essentially filters data into categories, which is achieved by providing a set of training examples, each set marked as belonging to one or the other of the two categories. The algorithm then works to build a model that assigns new values to one category or the other.  Also, it works by classifying the data into different classes by finding a line.  SVMs can not only make the reliable prediction but also can reduce redundant information.
  • 24.
    Applications of SVMin Real World  Face detection – SVMc classify parts of the image as a face and non-face and create a square boundary around the face.  Text and hypertext categorization – SVMs allow Text and hyper text categorization . They use training data to classify documents into different categories. It categorizes on the basis of the score generated and then compares with the threshold value.  Classification of images – Use of SVMs provides better search accuracy for image classification. It provides better accuracy in comparison to the traditional query based searching techniques.  Bioinformatics – It includes protein classification and cancer classification. We use SVM for identifying the classification of genes, patients on the basis of genes and other biological problems.  Handwriting recognition – We use SVMs to recognize hand written characters used widely.  Generalized predictive control(GPC)
  • 25.
    Kernel tricks areused to map a non-linearly separable functions into a higher dimension linearly separable function. Linear SVMKERNEL SVM
  • 26.
    Scala Code  importorg.apache.spark.ml.classification.LinearSVC  // Load training data  val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")  val lsvc = new LinearSVC()  .setMaxIter(10)  .setRegParam(0.1)  // Fit the model  val lsvcModel = lsvc.fit(training)  // Print the coefficients and intercept for linear svc  println(s"Coefficients: ${lsvcModel.coefficients} Intercept: ${lsvcModel.intercept}")
  • 27.
    Linear Regression -Supervised Linear regression is the most basic type of regression.  Simple linear regression allows us to understand The relationships between two continuous variables.  If the dependent variable is not continuous but categorical, linear regression can be transformed to logistic regression It envolves finding the best fit line that passes through All the training data
  • 28.
    Regularized regression model Technique Used to improve regression  Penalize complex models , we use it to keep models simple so that it does not over fit  Eliminates unimportant features  They force the model to make it simple  They force the coefficients to go to zero if they are not significant
  • 29.
    applications  Time series Economics  Environmental science  Financial forecasting  Software cost prediction, effort prediction and software quality assurance.  Restructuring of the budget : Organization or Country  Predicting the crime rate of a states based on drug usage, number of gangs, human trafficking, and Killings.
  • 30.
    import org.apache.spark.ml.regression.LinearRegression // Loadtraining data val training = spark.read.format("libsvm") .load("data/mllib/sample_linear_regression_data.txt") val lr = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) // Fit the model val lrModel = lr.fit(training) // Print the coefficients and intercept for linear regression println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") // Summarize the model over the training set and print out some metrics val trainingSummary = lrModel.summary println(s"numIterations: ${trainingSummary.totalIterations}") println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]") trainingSummary.residuals.show() println(s"RMSE: ${trainingSummary.rootMeanSquaredError}") println(s"r2: ${trainingSummary.r2}")
  • 31.
    Logistic Regression (Supervised learning– Classification)  Simple but performs well in many classifcations problems  Logistic regression focuses on estimating the probability of an event occurring based on the previous data provided.  It is used to cover a binary dependent variable, that is where only two values, 0 and 1, represent outcomes.  Confusing name meaning continuous values but the result is binary  Relations ships between features are weighted based On their impact on the result
  • 32.
    Applications of LogisticRegression  Image Segmentation and Categorization  Geographic Image Processing  Handwriting recognition  Healthcare  Depression Prediction  It is used in sentimental analysis like classifying good reviews from bad ones
  • 33.
    import org.apache.spark.ml.classification.LogisticRegression // Loadtraining data val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) // Fit the model val lrModel = lr.fit(training) // Print the coefficients and intercept for logistic regression println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") // We can also use the multinomial family for binary classification val mlr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8) .setFamily("multinomial") val mlrModel = mlr.fit(training) // Print the coefficients and intercepts for logistic regression with multinomial family println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}") println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
  • 34.
    Decision Trees (SupervisedLearning – Classification/Regression)  A decision tree is a decision support tool that uses a tree- like graph or model of decisions and their possible consequences.  It is one way to display an algorithm that only contains conditional control statements.  It is a binary tree of nodes , every node contains a decision so It is nicely visualized  Applications: Its main purpose is classification Ps: in MILb it can be used as a regressor or a classifier.
  • 35.
    The only hyperparameter tuning that decision tree allow is the depth of the tree
  • 36.
    Random Forests (SupervisedLearning – Classification/Regression) extremely powerful technique  an ensemble learning method, combining multiple algorithms to generate better results for classification, regression and other tasks.  Each individual classifier is weak, but when combined with others, can produce excellent results.  The algorithm starts with a ‘decision tree’ and an input is entered at the top. It then travels down the tree, with data being segmented into smaller and smaller sets, based on specific variables.
  • 38.
    How it works? Random forests are built on the basics on decision trees  It trains many decision tress at the same time, every tree has its own parameters on a different sample of the same dataset then combine the output of all these decision trees  Randomly select “K” features from total “m” features where k << m  Among the “K” features, calculate the node “d” using the best split point  Split the node into daughter nodes using the best split  Repeat the a to c steps until “l” number of nodes has been reached  Build forest by repeating steps a to d for “n” number times to create “n” number of trees  Random forests performs well when individual trees that make up the ensemble are different as possible  They should have different trained data they should have different parameters
  • 39.
  • 40.
    K Means ClusteringAlgorithm (Unsupervised Learning - Clustering)  used to categorise unlabelled data  It works by finding groups within the data, with the number of groups represented  by the variable K.  It then works iteratively to assign each data point to one of K groups based on the features provided.
  • 41.
     The resultsof the K-means clustering algorithm are:  The centroids of the K clusters,  which can be used to label new data  Labels for the training data (each data point is assigned to a single cluster)
  • 42.
    K-means Applications  Behaviouralsegmentation  Segment by purchase history  Segment by activities on application, website, or platform  Define personas based on interests  Create profiles based on activity monitoring  Inventory categorization:  Group inventory by sales activity  Group inventory by manufacturing metrics  Sorting sensor measurements:  Detect activity types in motion sensors  Group images  Separate audio  Identify groups in health monitoring  Detecting bots :  Separate valid activity groups from bots  Group valid activity to clean up outlier detection
  • 43.
    Scala code example importorg.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator // Loads data. val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") // Trains a k-means model. val kmeans = new KMeans().setK(2).setSeed(1L) val model = kmeans.fit(dataset) // Make predictions val predictions = model.transform(dataset) // Evaluate clustering by computing Silhouette score val evaluator = new ClusteringEvaluator() val silhouette = evaluator.evaluate(predictions) println(s"Silhouette with squared euclidean distance = $silhouette") // Shows the result. println("Cluster Centers: ") model.clusterCenters.foreach(println)
  • 44.
    Collaborative Filtering -Alternative least squares  used for recommendation systems and personalized ranking  The DataFrame-based API for ALS currently only supports integers for user and item id  Not in fully functionality in ML  It has two main implementations :  Implicit feedback  Asking a user to rank a collection of items from favorite to least favorite or other things  Explicit feedback  Observing the items that a user views in an online store.  Analyzing item/user viewing times.[35]  Keeping a record of the items that a user purchases online.
  • 45.
    References  https://spark.apache.org/docs/latest/ml-guide.html //Ml guide - Df based Api  https://spark.apache.org/docs/latest/mllib-guide.html Mlib Guide - RDD based Api  https://archive.ics.uci.edu/ml/index.php //repository with more than 440 datasets  https://data-flair.training/blogs/machine-learning-algorithm/  http://www.cubicsol.com/machine-learning-algorithms/  https://en.wikepedia.org