Unit 1 Introduction to Machine learning
1
Lecture
Number
Topic Hours
1 Classic and adaptive machines, Machine learning
matters
01
2 Beyond machine learning-deep learning 01
3 Bio inspired adaptive systems 01
4 Machine learning and Big data 01
5 Important Elements of Machine Learning- Data
formats, Learnability
01
6 Statistical learning approaches 01
7 Elements of information theory 01
Total hours =07
Introduction to Machine Learning
2
Life before Machine Learning
3
Why ML:Perks
4
Classic and adaptive machines
• machine is never efficient or trendy without a
concrete possibility to use it with pragmatism.
• A machine is immediately considered useful
and destined to be continuously improved if
its users can easily understand what tasks can
be completed with less effort or completely
automatically
Generic representation of a classical
system
Adaptive system
Machine learning matters
• learning is the ability to change according to
external stimuli and remembering most of all
previous experiences
• machine learning is an engineering approach
that gives maximum importance to every
technique that increases or improves the
propensity for changing adaptively.
Goal of ML
• to study, engineer, and improve mathematical
models which can be trained (once or
continuously) with context-related data
(provided by a generic environment), to infer
the future and to make decisions without
complete knowledge of all influencing
elements (external factors).
What is Machine Learning?
10
Goal of ML
• In other words
• an agent (which is a software entity that receives
information from an environment, picks the best
action to reach a specific goal, and observes the
results of it) adopts a statistical learning
approach, trying to determine the right
probability distributions and use them to
compute the action (value or decision) that is
most likely to be successful (with the least error).
Types of Machine Learning
12
Supervised learning
• A supervised scenario is characterized by the concept
of a teacher or supervisor, whose main task is to
provide the agent with a precise measure of its error
(directly comparable with output values).
• In a supervised scenario, the goal is training a system
that must also work with samples never seen before.
So, it's necessary to allow the model to develop a
generalization ability and avoid a common problem
called overfitting, which causes an overlearning due to
an excessive capacity
Supervised Learning
14
Common supervised learning applications
include:
• Predictive analysis based on regression or
categorical classification
• Spam detection
• Pattern detection
• Natural Language Processing
• Sentiment analysis
• Automatic image classification
• Automatic sequence processing (for example,
music or speech)
Unsupervised learning
• This approach is based on the absence of any
supervisor and therefore of absolute error
measures; it's useful when it's necessary to
learn how a set of elements can be
grouped(clustered) according to their
similarity (or distance measure)
Unsupervised Learning
17
Commons unsupervised applications
include
• Object segmentation (for example, users,
products, movies, songs, and so on)
• Similarity detection
• Automatic labeling
Reinforcement Learning
19
Reinforcement Learning
• reinforcement learning is also based on feedback
provided by the environment
• the information is more qualitative and doesn't
help the agent in determining a precise measure
of its error. In reinforcement learning, this
feedback is usually called reward (sometimes, a
negative one is defined as a penalty) and it's
useful to understand whether a certain action
performed in a state is positive or not.
Reinforcement Learning
• The sequence of most useful actions is a policy that the
agent has to learn, so to be able to make always the
best decision in terms of the highest immediate and
cumulative reward. In other words, an action can also
be imperfect, but in terms of a global policy it has to
offer the highest total reward.
• Reinforcement learning is particularly efficient when
the environment is not completely deterministic, when
it's often very dynamic, and when it's impossible to
have a precise error measure.
Reinforcement Learning: Example
22
Machine Learning algorithms
23
Learning types
Reinforcement Learning
• reinforcement learning is also based on feedback
provided by the environment
• the information is more qualitative and doesn't
help the agent in determining a precise measure
of its error. In reinforcement learning, this
feedback is usually called reward (sometimes, a
negative one is defined as a penalty) and it's
useful to understand whether a certain action
performed in a state is positive or not.
Reinforcement Learning
• The sequence of most useful actions is a policy that the
agent has to learn, so to be able to make always the
best decision in terms of the highest immediate and
cumulative reward. In other words, an action can also
be imperfect, but in terms of a global policy it has to
offer the highest total reward.
• Reinforcement learning is particularly efficient when
the environment is not completely deterministic, when
it's often very dynamic, and when it's impossible to
have a precise error measure.
Reinforcement Learning: Example
27
Beyond machine learning
deep learning , bio inspired adaptive
systems
• complex (deep) neural architectures: deep learning
– Rosenblatt invented the first perceptron, interest in neural
networks
• The idea behind these techniques is to create
algorithms that work like a brain
• Common deep learning applications include:
– Image classification
– Real-time visual tracking
– Autonomous car driving
– Logistic optimization
– Bioinformatics
– Speech recognition
Machine learning and Big data.
• the amount of information managed in different
business contexts grew exponentially
• opportunity to use it for machine learning
purposes arose
• it's possible to asynchronously train several local
models, periodically share the updates, and re-
synchronize them all with a master model
• Not every machine learning problem is suitable
for big data, and not all big datasets are really
useful when training models.
Parametric learning
• a vector-values function depend on an
internal parameter vector which determines
the actual instance of a generic predictor, the
approach is called parametric learning:
non-parametric learning
• doesn't make initial assumptions about the
family of predictors
• instance-based learning and makes real-time
predictions (without pre-computing
parameter values) based on hypothesis
determined only by the training samples
(instance set).
• concept of neighborhoods :classification
• In a classification problem, a new sample is
automatically surrounded by classified
training elements and the output class is
determined considering the preponderant one
in the neighborhood
• Eg: kernel-based support vector machines
• A generic parametric training process must
find the best parameter vector which
minimizes the regression/classification error
given a specific training dataset and it should
also generate a predictor that can correctly
generalize when unknown samples are
provided.
• Supervised Learning
– Regression and classification
• Single value /single label
• Multi value: multi-label classification and multi-output
regression
• Unsupervised learning
– We only have an input set X with m-length
vectors, and we define clustering function (with n
target clusters) with the following expression:
IMPORTANT ELEMENTS IN MACHINE
LEARNING
two possibilities
• If we expect future data to be exactly distributed as
training samples, a more complex model can be a good
choice, to capture small variations that a lowerlevel
one will discard. In this case, a linear (or lower-level)
model will drive to underfitting, because it won't be
able to capture an appropriate level of expressivity.
• If we think that future data can be locally distributed
differently but keeps a global trend, it's preferable to
have a higher residual misclassification error as well as
a more precise generalization ability. Using a bigger
model focusing only on training data can drive to
overfitting.
Underfitting and overfitting
• The purpose of a machine learning model is to approximate an
unknown function that associates input elements to output ones
(classes)
• Underfitting: It means that the model isn't able to capture the
dynamics shown by the same training set (probably because its
capacity is too limited).
• Overfitting: the model has an excessive capacity and it's not more
able to generalize considering the original dynamics provided by the
training set. It can associate almost perfectly all the known samples
to the corresponding output values, but when an unknown input is
presented, the corresponding prediction error can be very high.
• low-capacity (underfitting) :
– easier to detect considering the prediction error
• normal-capacity (normal fitting):
– may prove to be more difficult to discover as it could
be initially considered the result of a perfect fitting.
• excessive capacity (overfitting):
Data formats
• In a supervised learning problem, there will
always be a dataset, defined as a finite set of
real vectors with m features each:
Considering approach is always probabilistic each X as
drawn from a statistical multivariate distribution D.
• dataset X: we expect all samples to be independent and
identically distributed (i.i.d).
• This means all variables belong to the same distribution D,
and considering an arbitrary subset of m values, it happens
that:
• The corresponding output values can be both numerical-
continuous or categorical.
• In the first case, the process is called regression, while in the
second, it is called classification.
Numerical Outputs
Categorical examples are
• Generic Regressor : a vector-valued function
which associates an input value to a
continuous output
• Generic Classifier, a vector-values function
whose predicted output is categorical
(discrete).
Multiclass strategies
• number of output classes is greater than one
• two main possibilities to manage a
classification problem:
– One-vs-all
– One-vs-one
• the choice is transparent and the output
returned to the user will always be the final
value or class.
One-vs-one
• training a model for each pair of classes
• complexity is O(n2)
• the right class is determined by a majority
vote.
• this choice is more expensive and should be
adopted only when a full dataset comparison
is not preferable.
One-vs-all
• widely adopted by scikit-learn
• If there are n output classes, n classifiers will
be trained in parallel considering there is
always a separation between an actual class
and the remaining ones.
• lightweight approach (O(n) complexity)
Learnability
• A parametric model can be split into two
parts: a static structure and a dynamic set of
parameters.
• a static structure : is determined by choice of
a specific algorithm and is normally
immutable
• dynamic set of parameters is the objective of
our optimization
Learnability
• Considering n unbounded parameters, they
generate an n-dimensional space where each
point, together with the immutable part of the
estimator function, represents a learning
hypothesis H
• The goal of a parametric learning process is to
find the best hypothesis whose corresponding
prediction error is minimum and the residual
generalization ability is enough to avoid
overfitting.
• the dataset X is linearly separable (without
transformations) if there exists a hyperplane
which divides the space into two subspaces
containing only elements belonging to the
same class
• Cross-validation and other techniques easily
show how our model works with test samples
never seen during the training phase.
• generic rule of thumb says that a residual error is
always necessary to guarantee a good
generalization ability, while a model that shows a
validation accuracy of 99.999... percent on
training samples is almost surely overfitted and
will likely be unable to predict correctly when
never-seen input samples are provided.
Error measures
• In general, when working with a supervised
scenario, we define a non-negative error
measure em which takes two arguments
(expected and predicted output) and allows us
to compute a total error value over the whole
dataset (made up of n samples):
loss function
• it's useful to consider the mean square error (MSE):
• A generic training algorithm has to find the global
minimum or a point quite close to it (there's always a
tolerance to avoid an excessive number of iterations
and a consequent risk of overfitting).
• This measure is also called loss function because its
value must be minimized through an optimization
problem.
zero-one-loss
• efficient for binary classifications (also for one-
vs-rest multiclass strategy):
• adopted in loss functions based on the
probability of misclassification
Statistical learning approaches
• Imagine that you need to design a spam-filtering algorithm
starting from this initial (over simplistic) classification based
on two parameters:
• We have collected 200 email messages (X) : p1 and p2
mutually exclusive
• find a couple of probabilistic hypotheses to determine

Machine Learning Basics and Supervised, unsupervised

  • 1.
    Unit 1 Introductionto Machine learning 1 Lecture Number Topic Hours 1 Classic and adaptive machines, Machine learning matters 01 2 Beyond machine learning-deep learning 01 3 Bio inspired adaptive systems 01 4 Machine learning and Big data 01 5 Important Elements of Machine Learning- Data formats, Learnability 01 6 Statistical learning approaches 01 7 Elements of information theory 01 Total hours =07
  • 2.
  • 3.
  • 4.
  • 5.
    Classic and adaptivemachines • machine is never efficient or trendy without a concrete possibility to use it with pragmatism. • A machine is immediately considered useful and destined to be continuously improved if its users can easily understand what tasks can be completed with less effort or completely automatically
  • 6.
    Generic representation ofa classical system
  • 7.
  • 8.
    Machine learning matters •learning is the ability to change according to external stimuli and remembering most of all previous experiences • machine learning is an engineering approach that gives maximum importance to every technique that increases or improves the propensity for changing adaptively.
  • 9.
    Goal of ML •to study, engineer, and improve mathematical models which can be trained (once or continuously) with context-related data (provided by a generic environment), to infer the future and to make decisions without complete knowledge of all influencing elements (external factors).
  • 10.
    What is MachineLearning? 10
  • 11.
    Goal of ML •In other words • an agent (which is a software entity that receives information from an environment, picks the best action to reach a specific goal, and observes the results of it) adopts a statistical learning approach, trying to determine the right probability distributions and use them to compute the action (value or decision) that is most likely to be successful (with the least error).
  • 12.
    Types of MachineLearning 12
  • 13.
    Supervised learning • Asupervised scenario is characterized by the concept of a teacher or supervisor, whose main task is to provide the agent with a precise measure of its error (directly comparable with output values). • In a supervised scenario, the goal is training a system that must also work with samples never seen before. So, it's necessary to allow the model to develop a generalization ability and avoid a common problem called overfitting, which causes an overlearning due to an excessive capacity
  • 14.
  • 15.
    Common supervised learningapplications include: • Predictive analysis based on regression or categorical classification • Spam detection • Pattern detection • Natural Language Processing • Sentiment analysis • Automatic image classification • Automatic sequence processing (for example, music or speech)
  • 16.
    Unsupervised learning • Thisapproach is based on the absence of any supervisor and therefore of absolute error measures; it's useful when it's necessary to learn how a set of elements can be grouped(clustered) according to their similarity (or distance measure)
  • 17.
  • 18.
    Commons unsupervised applications include •Object segmentation (for example, users, products, movies, songs, and so on) • Similarity detection • Automatic labeling
  • 19.
  • 20.
    Reinforcement Learning • reinforcementlearning is also based on feedback provided by the environment • the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not.
  • 21.
    Reinforcement Learning • Thesequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. • Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure.
  • 22.
  • 23.
  • 24.
  • 25.
    Reinforcement Learning • reinforcementlearning is also based on feedback provided by the environment • the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not.
  • 26.
    Reinforcement Learning • Thesequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. • Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure.
  • 27.
  • 28.
    Beyond machine learning deeplearning , bio inspired adaptive systems • complex (deep) neural architectures: deep learning – Rosenblatt invented the first perceptron, interest in neural networks • The idea behind these techniques is to create algorithms that work like a brain • Common deep learning applications include: – Image classification – Real-time visual tracking – Autonomous car driving – Logistic optimization – Bioinformatics – Speech recognition
  • 29.
    Machine learning andBig data. • the amount of information managed in different business contexts grew exponentially • opportunity to use it for machine learning purposes arose • it's possible to asynchronously train several local models, periodically share the updates, and re- synchronize them all with a master model • Not every machine learning problem is suitable for big data, and not all big datasets are really useful when training models.
  • 30.
    Parametric learning • avector-values function depend on an internal parameter vector which determines the actual instance of a generic predictor, the approach is called parametric learning:
  • 31.
    non-parametric learning • doesn'tmake initial assumptions about the family of predictors • instance-based learning and makes real-time predictions (without pre-computing parameter values) based on hypothesis determined only by the training samples (instance set). • concept of neighborhoods :classification
  • 32.
    • In aclassification problem, a new sample is automatically surrounded by classified training elements and the output class is determined considering the preponderant one in the neighborhood • Eg: kernel-based support vector machines
  • 33.
    • A genericparametric training process must find the best parameter vector which minimizes the regression/classification error given a specific training dataset and it should also generate a predictor that can correctly generalize when unknown samples are provided.
  • 34.
    • Supervised Learning –Regression and classification • Single value /single label • Multi value: multi-label classification and multi-output regression • Unsupervised learning – We only have an input set X with m-length vectors, and we define clustering function (with n target clusters) with the following expression:
  • 35.
    IMPORTANT ELEMENTS INMACHINE LEARNING
  • 36.
    two possibilities • Ifwe expect future data to be exactly distributed as training samples, a more complex model can be a good choice, to capture small variations that a lowerlevel one will discard. In this case, a linear (or lower-level) model will drive to underfitting, because it won't be able to capture an appropriate level of expressivity. • If we think that future data can be locally distributed differently but keeps a global trend, it's preferable to have a higher residual misclassification error as well as a more precise generalization ability. Using a bigger model focusing only on training data can drive to overfitting.
  • 37.
    Underfitting and overfitting •The purpose of a machine learning model is to approximate an unknown function that associates input elements to output ones (classes) • Underfitting: It means that the model isn't able to capture the dynamics shown by the same training set (probably because its capacity is too limited). • Overfitting: the model has an excessive capacity and it's not more able to generalize considering the original dynamics provided by the training set. It can associate almost perfectly all the known samples to the corresponding output values, but when an unknown input is presented, the corresponding prediction error can be very high.
  • 38.
    • low-capacity (underfitting): – easier to detect considering the prediction error • normal-capacity (normal fitting): – may prove to be more difficult to discover as it could be initially considered the result of a perfect fitting. • excessive capacity (overfitting):
  • 39.
    Data formats • Ina supervised learning problem, there will always be a dataset, defined as a finite set of real vectors with m features each: Considering approach is always probabilistic each X as drawn from a statistical multivariate distribution D.
  • 40.
    • dataset X:we expect all samples to be independent and identically distributed (i.i.d). • This means all variables belong to the same distribution D, and considering an arbitrary subset of m values, it happens that: • The corresponding output values can be both numerical- continuous or categorical. • In the first case, the process is called regression, while in the second, it is called classification.
  • 41.
  • 42.
    • Generic Regressor: a vector-valued function which associates an input value to a continuous output • Generic Classifier, a vector-values function whose predicted output is categorical (discrete).
  • 43.
    Multiclass strategies • numberof output classes is greater than one • two main possibilities to manage a classification problem: – One-vs-all – One-vs-one • the choice is transparent and the output returned to the user will always be the final value or class.
  • 44.
    One-vs-one • training amodel for each pair of classes • complexity is O(n2) • the right class is determined by a majority vote. • this choice is more expensive and should be adopted only when a full dataset comparison is not preferable.
  • 45.
    One-vs-all • widely adoptedby scikit-learn • If there are n output classes, n classifiers will be trained in parallel considering there is always a separation between an actual class and the remaining ones. • lightweight approach (O(n) complexity)
  • 46.
    Learnability • A parametricmodel can be split into two parts: a static structure and a dynamic set of parameters. • a static structure : is determined by choice of a specific algorithm and is normally immutable • dynamic set of parameters is the objective of our optimization
  • 47.
    Learnability • Considering nunbounded parameters, they generate an n-dimensional space where each point, together with the immutable part of the estimator function, represents a learning hypothesis H • The goal of a parametric learning process is to find the best hypothesis whose corresponding prediction error is minimum and the residual generalization ability is enough to avoid overfitting.
  • 49.
    • the datasetX is linearly separable (without transformations) if there exists a hyperplane which divides the space into two subspaces containing only elements belonging to the same class
  • 51.
    • Cross-validation andother techniques easily show how our model works with test samples never seen during the training phase. • generic rule of thumb says that a residual error is always necessary to guarantee a good generalization ability, while a model that shows a validation accuracy of 99.999... percent on training samples is almost surely overfitted and will likely be unable to predict correctly when never-seen input samples are provided.
  • 52.
    Error measures • Ingeneral, when working with a supervised scenario, we define a non-negative error measure em which takes two arguments (expected and predicted output) and allows us to compute a total error value over the whole dataset (made up of n samples):
  • 53.
    loss function • it'suseful to consider the mean square error (MSE): • A generic training algorithm has to find the global minimum or a point quite close to it (there's always a tolerance to avoid an excessive number of iterations and a consequent risk of overfitting). • This measure is also called loss function because its value must be minimized through an optimization problem.
  • 54.
    zero-one-loss • efficient forbinary classifications (also for one- vs-rest multiclass strategy): • adopted in loss functions based on the probability of misclassification
  • 55.
    Statistical learning approaches •Imagine that you need to design a spam-filtering algorithm starting from this initial (over simplistic) classification based on two parameters: • We have collected 200 email messages (X) : p1 and p2 mutually exclusive • find a couple of probabilistic hypotheses to determine