Transfer learning aims to improve learning in a target domain by leveraging knowledge from a related source domain. It is useful when the target domain has limited labeled data. There are several approaches, including instance-based approaches that reweight or resample source instances, and feature-based approaches that learn a transformation to align features across domains. Spectral feature alignment is one technique that builds a graph of correlations between pivot features shared across domains and domain-specific features, then applies spectral clustering to derive new shared features.
A psychological point of view
• Transfer ofLearning (学习迁移)in
Educa7on and Psychology
– The study of dependency of human conduct,
learning or performance on prior experience.
– [Thorndike and Woodworth, 1901] explored how individuals would
transfer in one context to another context that share similar
characteristics.
• E.g.
! C++ " Java
! Math/Physics " Computer Science/Economics
2
3.
Transfer Learning
In the machine learning community
• Theability of a system to recognize and apply
knowledge and skills learned in previous domains/
tasks to novel tasks/domains, which share some
commonality.
• Given a target domain/task, how to transfer
knowledge to new domains/tasks (target)?
• Key:
– Representation Learning, Change of Representation
3
Why Transfer Learning?
5
Source
DomainData
Target
Domain Data
Predictive
Models
Labeled Training
Unlabeled data/a few labeled
data for adaptation
Transfer Learning
Algorithms
Target
Domain Data
Testing
Electronics
Time Period A
Device A
DVDDevice B
Time Period B
6.
Transfer Learning
Different fields
• Transfer learning for
reinforcement learning.
[Taylor and Stone, Transfer
Learning for Reinforcement
Learning Domains: A Survey,
JMLR 2009]
• Transfer learning for
classifica7on, and
regression problems.
[Pan and Yang, A Survey on
Transfer Learning, IEEE TKDE
2010]
6
Focus!
Difference in Representation
12
ElectronicsVideo Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will never
buy HP again.
(6) The game is so boring. I
am extremely unhappy and will
probably never buy UbiSoft
again.
Instance-based Approaches
Case I(cont.)
22
Correcting Sample Selection Bias / Covariate Shift
[Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]
23.
Instance-based Approaches
Correcting sampleselection bias (cont.)
• The distribu7on of the selector variable maps
the target onto the source distribu7on
23
! Label instances from the source domain with label 1
! Label instances from the target domain with label 0
! Train a binary classifier
[Zadrozny, ICML-04]
24.
Instance-based Approaches
Kernel meanmatching (KMM)
Maximum Mean Discrepancy (MMD)
[Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial]
24
25.
Instance-based Approaches
Direct densityratio estimation
25
[Sugiyama etal., NIPS-07, Kanamori etal., JMLR-09]
KL divergence loss Least squared loss
[Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]
26.
Instance-based Approaches
Case II
• Intui7on: Part of the labeled data in the source
domain can be reused in the target domain
after re-weighting
26
27.
Instance-based Approaches
Case II(cont.)
! TrAdaBoost [Dai etal ICML-07]
– For each boosting iteration,
# Use the same strategy as AdaBoost to
update the weights of target domain data.
# Use a new mechanism to decrease the
weights of misclassified source domain
data.
27
28.
Feature-based Transfer Learning
Approaches
Whensource and target
domains only have some
overlapping features. (lots
of features only have
support in either the source
or the target domain)
28
29.
Feature-based Transfer Learning
Approaches(cont.)
How to learn ?
! Solution 1: Encode application-specific
knowledge to learn the transformation.
! Solution 2: General approaches to learning the
transformation.
29
30.
Feature-based Approaches
Encode application-specificknowledge
30
Electronics Video Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will
never_buy HP again.
(6) The game is so boring. I
am extremely unhappy and will
probably never_buy UbiSoft
again.
31.
Feature-based Approaches
Encode application-specificknowledge (cont.)
31
compact sharp blurry hooked realistic boring
1 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
( ) sgn( ), [1,1, 1,0,0,0]T
y f x w x w= = ⋅ = −
compact sharp blurry hooked realistic boring
0 0 0 1 0 0
0 0 0 1 1 0
0 0 0 0 0 1
Electronics
Video Game
Training
Prediction
32.
Feature-based Approaches
Encode application-specificknowledge (cont.)
32
Electronics Video Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will
never_buy HP again.
(6) The game is so boring. I
am extremely unhappy and
will probably never_buy
UbiSoft again.
33.
Feature-based Approaches
Encode application-specificknowledge (cont.)
! Three different types of features
! Source domain (Electronics) specific features, e.g.,
compact, sharp, blurry
! Target domain (Video Game) specific features, e.g.,
hooked, realistic, boring
! Domain independent features (pivot features), e.g.,
good, excited, nice, never_buy
33
34.
Feature-based Approaches
Encode application-specificknowledge (cont.)
! How to identify pivot features?
! Term frequency on both domains
! Mutual information between features and labels (source
domain)
! Mutual information on between features and domains
! How to utilize pivots to align features across domains?
! Structural Correspondence Learning (SCL) [Biltzer etal.
EMNLP-06]
! Spectral Feature Alignment (SFA) [Pan etal. WWW-10]
34
35.
Feature-based Approaches Spectral
FeatureAlignment (SFA)
! Intuition
# Use a bipartite graph to model the correlations
between pivot features and other features
# Discover new shared features by applying
spectral clustering techniques on the graph
35
36.
! If two domain-specificwords have connections to more common pivot words in
the graph, they tend to be aligned or clustered together with a higher probability.
! If two pivot words have connections to more common domain-specific words in
the graph, they tend to be aligned together with a higher probability.
Spectral Feature Alignment (SFA)
High level idea
36
exciting
good
never_buy
sharp
boring
blurry
hooked
compact
realistic
Pivot features
Domain-specific features
7
6
8
3
6
2
4
5
Electronics
Video Game
Spectral Feature Alignment(SFA)
Derive new features (cont.)
sharp/hooked compact/realistic blurry/boring
1 1 0
1 0 0
0 0 1
38
( ) sgn( ), [1,1, 1]T
y f x w x w= = ⋅ = −
sharp/hooked compact/realistic blurry/boring
1 0 0
1 1 0
0 0 1
Electronics
Video Game
Training
Prediction
39.
Spectral Feature Alignment(SFA)
1. Identify P pivot features
2. Construct a bipartite graph between the pivot and
remaining features.
3. Apply spectral clustering on the graph to derive
new features
4. Train classifiers on the source using augmented
features (original features + new features)
39
Feature-based Approaches
Transfer ComponentAnalysis [Pan etal., IJCAI-09, TNN-11]
41
TargetSource
Latent factors
Temperature Signal
properties
Building
structure
Power of APs
Motivation
42.
Transfer Component Analysis(cont.)
42
TargetSource
Latent factors
Temperature Signal
properties
Building
structure
Power of APs
Causes the data distributions between two domains to be different
43.
Transfer Component Analysis(cont.)
43
TargetSource
Signal
properties
Noisy
component
Building
structure
Principal components
Transfer Component Analysis(cont.)
Main idea: the learned should map the source and
target domain data to the latent space spanned by the
factors which can reduce domain difference and
preserve original data structure.
45
High level optimization problem
Transfer Component Analysis(cont.)
47
An illustrative example
Latent features learned by PCA and TCA
PCAOriginal feature space TCA
48.
Feature-based Approaches
Self-taught FeatureLearning (Andrew Ng. et al.)
! Intuition: Useful higher-level features can be learned from
unlabeled data.
! Steps:
1) Learn higher-level features from a lot of unlabeled data.
2) Use the learned higher-level features to represent the data of the
target task.
3) Train models from the new representations of the target task
(supervised)
! How to learn higher-level features
# Sparse Coding [Raina etal., 2007]
# Deep learning [Glorot etal., 2011]
48
Multi-task Learning
Assumption:
If tasksare related, they may share similar parameter vectors.
For example, [Evgeniou and Pontil, KDD-04]
50
Common part
Specific part for individual task
Results
Unsupervised domain adaptationAmazon→Webcam over time
2011 2012 2013 2014 2015
Multi-classaccuracy
10
20
30
40
50
60
70
TCA
GFK
CNN
DD
C
DLID
DAN
BA
TL without DL
TL with DL
DL without TL
With Deep Learning, Transfer Learning improves.
Applying Transfer Learning techniques outperforms directly applying Deep Learning
models trained on the source.
56.
DASH-N [1]
Finetuning [2,3]
SCNN[4]
Overview
• Overview
supervised unsupervised
single
modality
multiple
modalities
[5]
DLID [6]
DCC [7]
DAN [8]
BA [9]
SHL-MDNN [10]
ST [11]
Are there
labelled
target
data?
Are dimensions of source
and target domains equal?
[12] Ngiam, Jiquan, et al. "Multimodal deep
learning." ICML. 2011.
[13] Srivastava, Nitish, and Ruslan
Salakhutdinov. "Multimodal learning with deep
Boltzmann machines." JMLR. 2014
[14] Sohn, Kihyuk, Wenling Shang, and Honglak
Lee. "Improved multimodal deep learning
with variation of information." NIPS. 2014.
DBN [12]
DBM [13]
MDRNN [14]
[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio.
"Domain adaptation for large-scale sentiment
classification: A deep learning approach." ICML. 2011.
[6] Chopra, Sumit, Suhrid Balakrishnan, and
Raghuraman Gopalan. "Dlid: Deep learning for domain
adaptation by interpolating between domains." ICML.
2013.
[7] Tzeng, Eric, et al. "Deep domain confusion:
Maximizing for domain invariance." arXiv
preprint arXiv:1412.3474. 2014.
[8] Long, Mingsheng, and Jianmin Wang. "Learning
transferable features with deep adaptation
networks." arXiv preprint arXiv:1502.02791. 2015.
[9] Ganin, Yaroslav, and Victor Lempitsky.
"Unsupervised Domain Adaptation by
Backpropagation." ICML. 2015.
[10] Huang, Jui-Ting, et al. "Cross-language
knowledge transfer using multilingual deep
neural network with shared hidden
layers." ICASSP. 2013.
[11] Gupta, Saurabh, Judy Hoffman, and
Jitendra Malik. "Cross Modal
Distillation for Supervision Transfer."
arXiv preprint arXiv:1507.00448. 2015.
[1] Nguyen, Hien V., et al. "Joint hierarchical
domain adaptation and feature
learning." PAMI. 2013.
[2] Oquab, Maxime, et al. "Learning and
transferring mid-level image representations
using convolutional neural networks." CVPR
2014.
[3] Yosinski, Jason, et al. "How transferable
are features in deep neural networks?." NIPS
2014.
[4] Tzeng, Eric, et al. "Simultaneous
deep transfer across domains and
tasks." CVPR. 2015.
Single Modality
• Transferability oflayer-wise features
ImageNet
1000
class
A: 500
class
B: 500
class
source
domain
target
domain
random
split
baseA: train all layers with A baseB: train all layers with B
BnB: initialize the first n layers with baseB and fix, randomly
initialize the other layers and train with B
BnB+: initialize the first n layers with baseB, randomly initialize the
other layers, and train all layers with B
AnB: initialize the first n layers with baseA and fix, randomly
initialize the other layers and train with B
AnB+: initialize the first n layers with baseA, randomly initialize
the other layers, and train all layers with B
59.
Single Modality
• Transferability oflayer-wise features
[3]
Conclusion 1: lower layer features are more general and transferrable, and higher
layer features are more specific and non-transferrable.
Conclusion 2: transferring features + fine-tuning always improve generalization. What if we do not have any labelled data to finetune in the target domain?
What happens if the source and target domain are very dissimilar?
ImageNet is not
randomly split, but
into A = {man-made
classes} and
B = {natural classes}
MIR-Flickr Dataset
• 1 millionimages with user-generated tags
• 25,000 images are labelled with 24 categories
• 10,000 for training, 5,000 for validation, 10,000 for testing
categories baby, female,
portrait, people
plant life,
river, water
clouds, sea, sky,
transport, water
animals, dog,
food
domain 1:
images
domain 2:
text
70.
Results
Mean Average Precision(MAP) by applying LR to different layers [13]
Transferring either one of the two domains to the other (joint hidden), outperforms the
domain itself (image_input OR text_input).
DBN [12]
DBM [13]
71.
References
[1] Nguyen, HienV., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013.
[2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural
networks." CVPR. 2014.
[3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS. 2014.
[4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015.
[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification:
A deep learning approach." ICML. 2011.
[6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by
interpolating between domains." ICML. 2013.
[7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint
arXiv:1412.3474. 2014.
[8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation
networks." arXiv preprint arXiv:1502.02791. 2015.
[9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation."
ICML. 2015.
[10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with
shared hidden layers." ICASSP. 2013.
[11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision
Transfer." arXiv preprint arXiv:1507.00448. 2015.
[12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011.
[13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR.
2014
[14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of
information." NIPS. 2014.
83
Query Classification andOnline
Advertisement
• ACM KDDCUP 05
Winner
• SIGIR 06
• ACM Transactions on
Information Systems
Journal 2006
– Joint work with Dou
Shen, Jiantao Sun and
Zheng Chen
84.
84 84
QC asMachine Learning
Inspired by the KDDCUP’05 competition
– Classify a query into a ranked list of categories
– Queries are collected from real search engines
– Target categories are organized in a tree with each
node being a category
85.
85
Target-transfer Learning inQC
• Classifier, once trained, stays constant
– Target Classes Before
• Sports, Politics (European, US, China)
– Target Classes Now
• Sports (Olympics, Football, NBA), Stock Market (Asian, Dow,
Nasdaq), History (Chinese, World) How to allow target to change?
• Application:
– advertisements come and go,
– but our query"target mapping needs not be retrained!
• We call this the target-transfer learning problem
86.
86 86
Solutions: QueryEnrichment
+ Staged Classification
Target
Categories
Queries
Solution: Bridging classifier
Construction of
Synonym- based
Classifiers
Construction of
Statistical Classifier
Query
Search
Engine
Labels of
Returned
Pages
Text of
Returned
Pages
Classified
results
Classified
results
Finial Results
Phase II: the testing phase
Phase I: the training phase
87.
87 87
$ Categoryinformation
Full
Step 1: Query enrichment
• Textual information
Title
Snippet
Category
88.
88 88
Step 2:Bridging Classifier
• Wish to avoid:
– When target is changed, training needs to repeat!
• Solution:
– Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge
89.
89 89
Bridging Classifier(Cont.)
$ How to connect?
Prior prob. of
I
jC
The relation between
and I
jC
T
iC
The relation between
and I
jC
q
The relation between
and T
iC
q
90.
90 90
Category Selectionfor Intermediate
Taxonomy
– Category Selection for Reducing Complexity
• Total Probability (TP)
• Mutual Information
91.
91
Result of BridgingClassifiers
– Using bridging classifier allows the target
classes to change freely
• no the need to retrain the classifier!
$ Performance of the Bridging Classifier with Different
Granularity of Intermediate Taxonomy
Cross-Domain AR: Performance
Mean
Accuracy
with Cross
Domain
Transfer
# Activities
(Source
Domain)
#Activities
(Target
Domain)
Baseline
(Random
Guess)
MIT Dataset
(Cleaning to
Laundry)
58.9% 13 8 12.5%
MIT Dataset
(Cleaning to
Dishwashing)
53.2% 13 7 14.3%
Intel Research
Lab Dataset
63.2% 5 6 16.7%
95
! Ac7vi7es in the source domain and the target domain are
generated from ten random trials, mean accuracies are reported.
Designing Objec7ve Func7on of TTL
Transitive Transfer Learningwith intermediate data
Intermediate domain
weighting/selection
The weights for the intermediate domains are learned from data.
The intermediate data help find a better hidden layer.
Predictive Model
Learning
Feature Engineering
Reinforcement Transfer Learning via
Sparse Coding
• Slow learningspeed remains a fundamental problem for
reinforcement learning in complex environments.
• Main problem: the numbers of states and actions in the
source and target domains are different.
– Existing works: hand-coded inter-task mapping between state-
action pairs
• Tool: new transfer learning based on sparse coding
Ammar, Tuyls, Taylor, Driessens, Weiss: Reinforcement Learning
Transfer via Sparse Coding. AAMAS, 2012.
Reference
! [Thorndike andWoodworth, The Influence of Improvement in one
mental function upon the efficiency of the other functions, 1901]
! [Taylor and Stone, Transfer Learning for Reinforcement Learning
Domains: A Survey, JMLR 2009]
! [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009]
! [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press
2009]
! [Biltzer etal.. Domain Adapta7on with Structural Correspondence
Learning, EMNLP 2006]
! [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature
Alignment, WWW 2010]
! [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI
2008]
126
127.
Reference (cont.)
! [Pan etal.,Domain Adaptation via Transfer Component Analysis,
IJCAI 2009]
! [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004]
! [Zhang and Yeung, A Convex Formulation for Learning Task
Relationships in Multi-Task Learning, UAI 2010]
! [Agarwal etal, Learning Multiple Tasks using Manifold
Regularization, NIPS 2010]
! [Argyriou etal., Multi-Task Feature Learning, NIPS 2007]
! [Ando and Zhang, A Framework for Learning Predictive Structures
from Multiple Tasks and Unlabeled Data, JMLR 2005]
! [Ji etal, Extracting Shared Subspace for Multi-label Classification,
KDD 2008]
127
128.
Reference (cont.)
! [Raina etal.,Self-taught Learning: Transfer Learning from Unlabeled
Data, ICML 2007]
! [Dai etal., Boosting for Transfer Learning, ICML 2007]
! [Glorot etal., Domain Adaptation for Large-Scale Sentiment
Classification: A Deep Learning Approach, ICML 2011]
! [Davis and Domingos, Deep Transfer vis Second-order Markov Logic,
ICML 2009]
! [Mihalkova etal., Mapping and Revising Markov Logic Networks for
Transfer Learning, AAAI 2007]
! [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic
Lexicons, ACL 2012]
128
129.
Reference (cont.)
! [Sugiyama etal.,Direct Importance Estimation with Model Selection
and Its Application to Covariate Shift Adaptation, NIPS 2007]
! [Kanamori etal., A Least-squares Approach to Direct Importance
Estimation, JMLR 2009]
! [Cris7anini etal., On Kernel Target Alignment, NIPS 2002]
! [Huang etal., Correcting Sample Selection Bias by Unlabeled Data,
NIPS 2006]
! [Zadrozny, Learning and Evaluating Classifiers under Sample
Selection Bias, ICML 2004]
129
130.
Transfer Learning in Convolu7onal
Neural Networks
• Convolutional neuralnetworks (CNN): outstanding
image-classification.
• Learning CNNs requires a very large number of
annotated image samples
– Millions of parameters, to many that prevents application
of CNNs to problems with limited training data.
• Key Idea:
– the internal layers of the CNN can act as a generic
extractor of mid-level image representation
– Model-based Transfer Learning