Transfer Learning: An overview

迁移学习理论与应用
Transfer Learning: An Overview
杨强，香港科大
Qiang Yang, HKUST
Thanks:
Sinno Jialin Pan, NTU, Singapore
Ying Wei, HKUST, Hong Kong
Ben Tan, HKUST, Hong Kong

A psychological point of view
•  Transfer of Learning (学习迁移）in
Educa7on and Psychology
– The study of dependency of human conduct,
learning or performance on prior experience.
–  [Thorndike and Woodworth, 1901] explored how individuals would
transfer in one context to another context that share similar
characteristics.
•  E.g.
!  C++ " Java
!  Math/Physics " Computer Science/Economics
2

Transfer Learning
In the machine learning community
•  The ability of a system to recognize and apply
knowledge and skills learned in previous domains/
tasks to novel tasks/domains, which share some
commonality.
•  Given a target domain/task, how to transfer
knowledge to new domains/tasks (target)?
•  Key:
–  Representation Learning, Change of Representation
3

Why Transfer?
!  Build every model from scratch?
#  Time consuming and expensive
# Expense:
•  Data Collec7on/Labeling
•  Privacy
•  Time to train
!  Reuse common knowledge extracted from
exis7ng systems?
#  More prac7cal
4

Why Transfer Learning?
5
Source
Domain Data
Target
Domain Data
Predictive
Models
Labeled Training
Unlabeled data/a few labeled
data for adaptation
Transfer Learning
Algorithms
Target
Domain Data
Testing
Electronics
Time Period A
Device A
DVDDevice B
Time Period B

Transfer Learning
Different fields
•  Transfer learning for
reinforcement learning.

[Taylor and Stone, Transfer
Learning for Reinforcement
Learning Domains: A Survey,
JMLR 2009]
•  Transfer learning for
classiﬁca7on, and
regression problems.

[Pan and Yang, A Survey on
Transfer Learning, IEEE TKDE
2010]
6
Focus!

Motivating Example I:
Indoor WiFi localiza7on
7
-30dBm -70dBm -40dBm

Indoor WiFi Localization (cont.)
8
Training
Training Test
Device A
Test
Device B
~ 1.5 meters
~10 meters
Device A
Device A
S=(-37dbm, .., -77dbm), L=(1, 3)
S=(-41dbm, .., -83dbm), L=(1, 4)
…
S=(-49dbm, .., -34dbm), L=(9, 10)
S=(-61dbm, .., -28dbm), L=(15,22)
S=(-37dbm, .., -77dbm)
S=(-41dbm, .., -83dbm)
…
S=(-49dbm, .., -34dbm)
S=(-61dbm, .., -28dbm)
S=(-37dbm, .., -77dbm)
S=(-41dbm, .., -83dbm)
…
S=(-49dbm, .., -34dbm)
S=(-61dbm, .., -28dbm)
S=(-33dbm, .., -82dbm), L=(1, 3)
…
S=(-57dbm, .., -63dbm), L=(10, 23)
Localization
model
Localization
model
Drop!
Average Error
Distance

Difference between Domains
9
Time Period A Time Period B
Device B
Device A

Motivating Example II:
Sen7ment classiﬁca7on
10

Sentiment Classification (cont.)
11
Training
Training Test
Electronics
Test
~ 84.6%
~72.65%
Sentiment
Classifier
Sentiment
Classifier
Drop!
Electronics
Classification
Accuracy
ElectronicsDVD

Difference in Representation
12
Electronics Video Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will never
buy HP again.
(6) The game is so boring. I
am extremely unhappy and will
probably never buy UbiSoft
again.

A Major Assump7on in Tradi7onal
Machine Learning
! Training and future (test) data come from the
same domain, which implies
# Represented in the same feature spaces.
# Follow the same data distribution.
13

Machine Learning: Yesterday, Today
and Tomorrow
14
Deep Learning:
Features
Reinforcement
Learning:
Rewards
Transfer
Learning:
Adapta7on
Yesterday Today Tomorrow

Machine Learning: Yesterday, Today
and Tomorrow
15
Deep Learning:
Lots of Data
Only the Rich
Reinforcement
Learning:
Lots of Data
Only the Rich
Transfer Learning:
Few Data
Everyone
Yesterday Today Tomorrow

Diﬀerent Scenarios
•  Training and tes7ng data may come from
diﬀerent domains:
# Different different feature spaces/ marginal
distributions:
# Different conditional distributions or different
label spaces:
16

Transfer Learning Approaches
17
Instance-based
Approaches
Feature-based
Approaches
Parameter/Model -
based Approaches
Relational
Approaches

Instance-based Transfer Learning
Approaches
Source and target domains
have a lot of overlapping
features
18
General Assumption

Instance-based Transfer Learning
Approaches
Case I: Unlabeled Target

Case II: Some Labels in Target

19
Problem Setting
Assumption Assumption
Problem Setting

Instance-based Approaches
Case I
Given a target task,
20

Case I (cont.)
Assumption:

21

Case I (cont.)

22
Correcting Sample Selection Bias / Covariate Shift
[Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]

Correcting sample selection bias (cont.)
•  The distribu7on of the selector variable maps
the target onto the source distribu7on
23
! Label instances from the source domain with label 1
! Label instances from the target domain with label 0
! Train a binary classifier
[Zadrozny, ICML-04]

Kernel mean matching (KMM)
Maximum Mean Discrepancy (MMD)

[Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial]
24

Direct density ratio estimation
25
[Sugiyama etal., NIPS-07, Kanamori etal., JMLR-09]
KL divergence loss Least squared loss
[Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]

Case II
•  Intui7on: Part of the labeled data in the source
domain can be reused in the target domain
after re-weighting
26

Case II (cont.)
! TrAdaBoost [Dai etal ICML-07]
– For each boosting iteration,
# Use the same strategy as AdaBoost to
update the weights of target domain data.
# Use a new mechanism to decrease the
weights of misclassified source domain
data.
27

Feature-based Transfer Learning
Approaches
When source and target
domains only have some
overlapping features. (lots
of features only have
support in either the source
or the target domain)
28

Feature-based Transfer Learning
Approaches (cont.)
How to learn ?
! Solution 1: Encode application-specific
knowledge to learn the transformation.
! Solution 2: General approaches to learning the
transformation.
29

Feature-based Approaches
Encode application-specific knowledge
30
looks sharp!
sharp.
very dark settings. I will
never_buy HP again.
am extremely unhappy and will
probably never_buy UbiSoft
again.

Encode application-specific knowledge (cont.)
31
compact sharp blurry hooked realistic boring
1 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
( ) sgn( ), [1,1, 1,0,0,0]T
y f x w x w= = ⋅ = −
compact sharp blurry hooked realistic boring
0 0 0 1 0 0
0 0 0 1 1 0
0 0 0 0 0 1
Electronics
Video Game
Training
Prediction

32
looks sharp!
sharp.
very dark settings. I will
never_buy HP again.
am extremely unhappy and
will probably never_buy
UbiSoft again.

! Three different types of features
!  Source domain (Electronics) specific features, e.g.,
compact, sharp, blurry
!  Target domain (Video Game) specific features, e.g.,
hooked, realistic, boring
!  Domain independent features (pivot features), e.g.,
good, excited, nice, never_buy
33

! How to identify pivot features?
! Term frequency on both domains
! Mutual information between features and labels (source
domain)
! Mutual information on between features and domains
! How to utilize pivots to align features across domains?
! Structural Correspondence Learning (SCL) [Biltzer etal.
EMNLP-06]
! Spectral Feature Alignment (SFA) [Pan etal. WWW-10]
34

Feature-based Approaches Spectral
Feature Alignment (SFA)
! Intuition
# Use a bipartite graph to model the correlations
between pivot features and other features
# Discover new shared features by applying
spectral clustering techniques on the graph
35

! If two domain-speciﬁc words have connections to more common pivot words in
the graph, they tend to be aligned or clustered together with a higher probability.
! If two pivot words have connections to more common domain-speciﬁc words in
the graph, they tend to be aligned together with a higher probability.
Spectral Feature Alignment (SFA)
High level idea
36
exciting
good
never_buy
sharp
boring
blurry
hooked
compact
realistic
Pivot features
Domain-specific features
7
6
8
3
6
2
4
5
Electronics
Video Game

exciting
good
never_buy
sharp
boring
blurry
hooked
compact
realistic
Pivot features
Domain-specific features
7
6
8
3
6
2
4
5
Electronics
Video Game
boring realistic
hooked
blurry
sharp
compact
Electronics
Video Game
Electronics
Electronics Video Game
Video Game
Derive new features
Spectral Clustering
37

Derive new features (cont.)
sharp/hooked compact/realistic blurry/boring
1 1 0
1 0 0
0 0 1
38
( ) sgn( ), [1,1, 1]T
y f x w x w= = ⋅ = −
sharp/hooked compact/realistic blurry/boring
1 0 0
1 1 0
0 0 1
Electronics
Video Game
Training
Prediction

1.  Identify P pivot features
2.  Construct a bipartite graph between the pivot and
remaining features.
3.  Apply spectral clustering on the graph to derive
new features
4.  Train classifiers on the source using augmented
features (original features + new features)
39

Develop general approaches
40
Time Period A Time Period B
Device B
Device A

Transfer Component Analysis [Pan etal., IJCAI-09, TNN-11]
41
TargetSource
Latent factors
Temperature Signal
properties
Building
structure
Power of APs
Motivation

Transfer Component Analysis (cont.)
42
TargetSource
Latent factors
Temperature Signal
properties
Building
structure
Power of APs
Causes the data distributions between two domains to be different

43
TargetSource
Signal
properties
Noisy
component
Building
structure
Principal components

Learning by only minimizing the distance between
distribu7ons
44

Main idea: the learned should map the source and
target domain data to the latent space spanned by the
factors which can reduce domain difference and
preserve original data structure.
45
High level optimization problem

46
Recall: Maximum Mean Discrepancy (MMD)

47
An illustrative example
Latent features learned by PCA and TCA
PCAOriginal feature space TCA

Self-taught Feature Learning (Andrew Ng. et al.)
! Intuition: Useful higher-level features can be learned from
unlabeled data.
! Steps:
1)  Learn higher-level features from a lot of unlabeled data.
2)  Use the learned higher-level features to represent the data of the
target task.
3)  Train models from the new representations of the target task
(supervised)
! How to learn higher-level features
# Sparse Coding [Raina etal., 2007]
# Deep learning [Glorot etal., 2011]
48

Mul7-task Feature Learning
! Assumption: If tasks are related, they should
share some good common features.
! Goal: Learn a low-dimensional representation
shared across related tasks.
49
General Multi-task Learning Setting

Multi-task Learning
Assumption:
If tasks are related, they may share similar parameter vectors.
For example, [Evgeniou and Pontil, KDD-04]
50
Common part
Specific part for individual task

Mul7-task Feature Learning
51
[Argyriou etal., NIPS-07]
[Ando and Zhang, JMLR-05]
[Ji etal, KDD-08]

Deep Learning in Transfer Learning
52

Transfer Learning with Deep LearningTransfer Learning Perspec7ve:
Why need Deep Learning?
•  Deep neural networks learn
nonlinear representa7ons
–  that are hierarchical;
–  that disentangle diﬀerent
explanatory factors of
varia7on behind data
samples;
–  that manifest invariant
factors underlying diﬀerent
popula7ons.
Deep Learning Perspec7ve:
Why need Transfer Learning?
•  Transfer Learning alleviates
–  the incapability of learning on
a dataset which may not be
large enough to train an
en7re deep neural network
from scratch

Benchmark Dataset: Oﬃce
•  Descrip7on: leverage source images to
improve classiﬁca7on of target images
3 domains
31categories
backpackbike
amazon
2,817 images
webcam
957 images
dslr
795 images
object
images in
Amazon
low-resolution images
taken from a web
camera
high-resolution images
taken from a digital SLR
camera

Results
Unsupervised domain adaptation Amazon→Webcam over time
2011 2012 2013 2014 2015
Multi-classaccuracy
10
20
30
40
50
60
70
TCA
GFK
CNN
DD
C
DLID
DAN
BA
TL without DL
TL with DL
DL without TL
With Deep Learning, Transfer Learning improves.
Applying Transfer Learning techniques outperforms directly applying Deep Learning
models trained on the source.

DASH-N [1]
Finetuning [2,3]
SCNN [4]
Overview
•  Overview
supervised unsupervised
single
modality
multiple
modalities
[5]
DLID [6]
DCC [7]
DAN [8]
BA [9]
SHL-MDNN [10]
ST [11]
Are there
labelled
target
data?
Are dimensions of source
and target domains equal?
[12] Ngiam, Jiquan, et al. "Multimodal deep
learning." ICML. 2011.
[13] Srivastava, Nitish, and Ruslan
Salakhutdinov. "Multimodal learning with deep
Boltzmann machines." JMLR. 2014
[14] Sohn, Kihyuk, Wenling Shang, and Honglak
Lee. "Improved multimodal deep learning
with variation of information." NIPS. 2014.
DBN [12]
DBM [13]
MDRNN [14]
[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio.
"Domain adaptation for large-scale sentiment
classiﬁcation: A deep learning approach." ICML. 2011.
[6] Chopra, Sumit, Suhrid Balakrishnan, and
Raghuraman Gopalan. "Dlid: Deep learning for domain
adaptation by interpolating between domains." ICML.
2013.
[7] Tzeng, Eric, et al. "Deep domain confusion:
Maximizing for domain invariance." arXiv
preprint arXiv:1412.3474. 2014.
[8] Long, Mingsheng, and Jianmin Wang. "Learning
transferable features with deep adaptation
networks." arXiv preprint arXiv:1502.02791. 2015.
[9] Ganin, Yaroslav, and Victor Lempitsky.
"Unsupervised Domain Adaptation by
Backpropagation." ICML. 2015.
[10] Huang, Jui-Ting, et al. "Cross-language
knowledge transfer using multilingual deep
neural network with shared hidden
layers." ICASSP. 2013.
[11] Gupta, Saurabh, Judy Hoffman, and
Jitendra Malik. "Cross Modal
Distillation for Supervision Transfer."
arXiv preprint arXiv:1507.00448. 2015.
[1] Nguyen, Hien V., et al. "Joint hierarchical
domain adaptation and feature
learning." PAMI. 2013.
[2] Oquab, Maxime, et al. "Learning and
transferring mid-level image representations
using convolutional neural networks." CVPR
2014.

[3] Yosinski, Jason, et al. "How transferable
are features in deep neural networks?." NIPS
2014.
[4] Tzeng, Eric, et al. "Simultaneous
deep transfer across domains and
tasks." CVPR. 2015.

Single Modality
•  Directly applying the model parameters (deep
neural network weights) from the source to
targetsource
domai
n
input
outpu
t
target
domai
n
input
outpu
t
shared weightsAre the features transferrable?

Single Modality
•  Transferability of layer-wise features
ImageNet
1000
class
A: 500
class
B: 500
class
source
domain
target
domain
random
split
baseA: train all layers with A baseB: train all layers with B
BnB: initialize the first n layers with baseB and fix, randomly
initialize the other layers and train with B
BnB+: initialize the first n layers with baseB, randomly initialize the
other layers, and train all layers with B
AnB: initialize the first n layers with baseA and fix, randomly
initialize the other layers and train with B
AnB+: initialize the first n layers with baseA, randomly initialize
the other layers, and train all layers with B

Single Modality
•  Transferability of layer-wise features
[3]
Conclusion 1: lower layer features are more general and transferrable, and higher
layer features are more specific and non-transferrable.
Conclusion 2: transferring features + fine-tuning always improve generalization. What if we do not have any labelled data to finetune in the target domain?
What happens if the source and target domain are very dissimilar?
ImageNet is not
randomly split, but
into A = {man-made
classes} and
B = {natural classes}

Single Modality
•  General framework of unsupervised transfer
source
domai
n
input
outpu
t
target
domai
n
input
outpu
t
domain
distanc
e loss
For lower level features (more general & transferrable),
the source transfers to the target directly.
For higher level features (more domain speciﬁc & not transferrable), the source
transfers to the target by minimizing domain distances.
shared weights
If some labelled target data are available, it would be better.

Single Modality
•  Overall training objec7ve
•  Domain distance losses
– Maximum Mean Discrepancy [7]
source domain classiﬁcation lossdomain distance loss
a particular representation, e.g. the representation after 5th
layer

Single Modality
•  Domain distance losses
–  MK-MMD (Mul7-kernel variant of MMD) [8]
–  Domain classifier [4, 9]
an embedding
A distribution-free metric - maximizes the domain classification error
Learn a more flexible distance metric than MMD by adjusting

Single Modality
•  Other factors to improve transfer
–  Which layers should the domain distance loss be considered?
•  By learning, pinpoint the layer that minimizes the domain distance
among all speciﬁc layers, say the fourth. [7]
•  All the speciﬁc layers, say the last two layers. [8]
source
domain
input
output
target
domain
inpu
t
domain
distanc
e loss

Single Modality•  Other factors to improve transfer
– When we have some training data in the target
domain?
•  soj label supervision [4]: categories without any
labeled target data are s7ll updated to output non-zero
probabili7es
target
doma
in
inpu
t outpu
t
source
domain

Mul7ple Modali7es
•  The source domain and target domain could have diﬀerent feature
spaces, i.e., dimensionality.
–  Mul7media on the web
•  Images
•  Text documents
•  Audio
•  Video
–  Recommender systems
•  Douban
•  Taobao
•  Xiami Music
–  Robo7cs
•  Vision
•  Audio
•  Sensors
How to deal with multi-modal transfer with Deep Learning?

Mul7ple Modali7es
•  Key
The cat is sitting on a sofa
with ears cocking.
shared concept
cat ears kitteneyes

Mul7ple Modali7es
•  General framework of unsupervised transfer
source
domai
n
input
target
domai
n
input
commo
n
Paired
loss
reconstruction layer
reconstruction layer
Reconstruction errors:
Paired loss: the similarity of a pair of source and target instances is
preserved in the common space.
Paired loss:
similarity

Mul7ple Modali7es
•  General framework of supervised transfer
outpu
t
outpu
t
Paired
loss
Classiﬁcation loss:
source
domai
n
input
target
domai
n
input
commo
n

MIR-Flickr Dataset
•  1 million images with user-generated tags
•  25,000 images are labelled with 24 categories
•  10,000 for training, 5,000 for validation, 10,000 for testing
categories baby, female,
portrait, people
plant life,
river, water
clouds, sea, sky,
transport, water
animals, dog,
food
domain 1:
images
domain 2:
text

Results
Mean Average Precision (MAP) by applying LR to different layers [13]
Transferring either one of the two domains to the other (joint hidden), outperforms the
domain itself (image_input OR text_input).
DBN [12]
DBM [13]

References
[1] Nguyen, Hien V., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013.
[2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural
networks." CVPR. 2014.
[3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS. 2014.
[4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015.
[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classiﬁcation:
A deep learning approach." ICML. 2011.
[6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by
interpolating between domains." ICML. 2013.
[7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint
arXiv:1412.3474. 2014.
[8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation
networks." arXiv preprint arXiv:1502.02791. 2015.
[9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation."
ICML. 2015.
[10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with
shared hidden layers." ICASSP. 2013.
[11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision
Transfer." arXiv preprint arXiv:1507.00448. 2015.
[12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011.
[13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR.
2014
[14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of
information." NIPS. 2014.

Simultaneous Deep Transfer Across Domains
and Tasks Eric Tzeng, Judy Hoﬀman, Trevor Darrell, Kate Saenko,
ICCV 2015

Oquab, Bottou, Laptev, Sivic: Learning and Transferring
Mid-Level Image Representations using Convolutional
Neural Networks. CVPR 2014.

Transfer Learning in Convolu7onal
Neural Networks
•  Source Domain: ImageNet
–  1000 classes, 1.2 million images
•  Target Domain: Pascal VOC 2007 object classiﬁca7on
–  20 classes, about 5000 images
•  PRE-1000C: the proposed method

DeCAF: A Deep Convolu7onal Ac7va7on Feature
for Generic Visual Recogni7on
•  Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
Tzeng, Trevor Darrell. ICML2014
•  Ques7ons:
–  How to transfer features to tasks with different labels
–  Do features extracted from the CNN generalize to other datasets?
–  How does performance vary with network depth?
•  Algorithm:
–  A deep convolu7onal model is first trained in a fully supervised seqng
using a state-of-the-art method Krizhevsky et al. (2012 ).
–  extract various features from this network, and evaluate the efficacy of
these features on generic vision tasks.
78

Comparison: DECAF to others
79

Relational Transfer Learning
Approaches
! Motivation:
!  If two logically described domains (rela7onal,
data is non-i.i.d) are related, they must share
similar rela)ons among objects.
! These rela7ons can be used for transfer learning
80

Relational Transfer Learning
Approaches (cont.)
81
Actor(A) Director(B)
WorkedFor
Movie (M)
Student (B) Professor (A)
AdvisedBy
Paper (T)
Publication Publication
Academic domain (source) Movie domain (target)
MovieMember MovieMember
AdvisedBy (B, A) ˄ Publication (B, T)
=> Publication (A, T)
WorkedFor (A, B) ˄ MovieMember (A, M)
=> MovieMember (B, M)
P1(x, y) ˄ P2 (x, z) => P2 (y, z)
[Mihalkova etal., AAAI-07, Davis and Domingos, ICML-09]

TRANSFER LEARNING
APPLICATIONS
迁移学习应用
82

83
Query Classification and Online
Advertisement
•  ACM KDDCUP 05
Winner
•  SIGIR 06
•  ACM Transactions on
Information Systems
Journal 2006
–  Joint work with Dou
Shen, Jiantao Sun and
Zheng Chen

84 84
QC as Machine Learning
Inspired by the KDDCUP’05 competition
–  Classify a query into a ranked list of categories
–  Queries are collected from real search engines
–  Target categories are organized in a tree with each
node being a category

85
Target-transfer Learning in QC
•  Classifier, once trained, stays constant
–  Target Classes Before
•  Sports, Politics (European, US, China)
–  Target Classes Now
•  Sports (Olympics, Football, NBA), Stock Market (Asian, Dow,
Nasdaq), History (Chinese, World) How to allow target to change?
•  Application:
–  advertisements come and go,
–  but our query"target mapping needs not be retrained!
•  We call this the target-transfer learning problem

86 86
Solutions: Query Enrichment
+ Staged Classification
Target
Categories
Queries
Solution: Bridging classifier
Construction of
Synonym- based
Classifiers
Construction of
Statistical Classifier
Query
Search
Engine
Labels of
Returned
Pages
Text of
Returned
Pages
Classified
results
Classified
results
Finial Results
Phase II: the testing phase
Phase I: the training phase

87 87
$  Category information
Full
Step 1: Query enrichment
•  Textual information
Title
Snippet
Category

88 88
Step 2: Bridging Classifier
•  Wish to avoid:
–  When target is changed, training needs to repeat!
•  Solution:
–  Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge

89 89
Bridging Classifier (Cont.)
$  How to connect?
Prior prob. of
I
jC
The relation between
and I
jC
T
iC
and I
jC
q
and T
iC
q

90 90
Category Selection for Intermediate
Taxonomy
–  Category Selection for Reducing Complexity
•  Total Probability (TP)
•  Mutual Information

91
Result of Bridging Classifiers
– Using bridging classifier allows the target
classes to change freely
•  no the need to retrain the classifier!
$  Performance of the Bridging Classifier with Different
Granularity of Intermediate Taxonomy

Cross Domain Ac7vity Recogni7on
[Zheng, Hu, Yang, Ubicomp 2009]
•  Challenges:
–  A new domain of
ac7vi7es without
labeled data
•  Cross-domain ac7vity
recogni7on
–  Transfer some available
labeled data from
source ac7vi7es to help
training the recognizer
for the target ac7vi7es.
92
Cleaning
Indoor
Laundry
Dishwashing

How to use the similari7es?
93
Source Domain
Labeled Data
Similarity
Measure
<Sensor Reading, Ac7vity
Name>
Example: <SS, “Make
Coffee”>
sim(“Make Coffee”,
“Make Tea”) = 0.6
Pseudo Training
Data: <SS, “Make
Tea”, 0.6>
Target Domain
Pseudo Labeled
Data
Weighted SVM
Classifier
THE WEB

Calcula7ng Ac7vity Similari7es
! How similar are two
ac7vi7es?
◦  Use Web search results
◦  TFIDF: Tradi7onal IR
similarity metrics
(cosine similarity)
◦  Example
"  Mined similarity between
the ac7vity “sweeping”
and “vacuuming”,
“making the bed”,
“gardening”
Calculated Similarity with
the activity "Sweeping"
Similarity
with the
activity
"Sweeping
"
94

Cross-Domain AR: Performance
Mean
Accuracy
with Cross
Domain
Transfer
# Activities
(Source
Domain)
# Activities
(Target
Domain)
Baseline
(Random
Guess)
MIT Dataset
(Cleaning to
Laundry)
58.9% 13 8 12.5%
MIT Dataset
(Cleaning to
Dishwashing)
53.2% 13 7 14.3%
Intel Research
Lab Dataset
63.2% 5 6 16.7%
95
!  Ac7vi7es in the source domain and the target domain are
generated from ten random trials, mean accuracies are reported.

Transferring knowledge from social to
physical
! Ubiquitous physical sensors mo7vate extensive
research on ubiquitous compu7ng.
Which ac7vity is this person performing?

Transferring from social to physical
I am on a business trip in
New York. The
Metropolitan Museum of
Art is fantas7c!
Brilliant night at Chilli Food,
wine, hospitality all excellent.
Bristol's top restaurant.
Back in the #gym ajer 3.5
weeks :) feeling good
#exercise

Can we transfer
knowledge from social
media to physical
world?

Transfer from social to physical
Cellphone Sensor Dataset
! 232 sensor records
! 10 volunteers
! 7me, GPS, tri-axial
accelerometer, loca7on
POI info
Sina Weibo
! 10,791 tweets
! Distribu7on of labels
! Distribu7on of top 9 labels

Transfer from social to physical
! Results
A naive combina7on of sensor and social features
performs bezer than sensor features only (Combined
v.s. Sensor), which validates the necessity of ins7lling
social knowledge into physical sensor data.
Heterogeneous transfer learning methods show
improvement over Combined: employing social
messages to enrich sensor readings’ feature
representa7on in a latent space is more eﬀec7ve than
naive combina7on.
!  Our method could use only 50% labelled data of other methods to
achieve the same performance.

Transfer Learning for Collabora7ve Filtering
101
IMDB Database
Amazon.com
101

Transfer Learning in Collabora7ve Filtering
•  Source (Dense): Encode cluster-level ra7ng pazerns
•  Target (Sparse): Map users/items to the encoded prototypes
102
A B C
III
II
I
A B
III
II
I
a e b f c d
2
6
4
5
1
3
c d a b e
1
3
6
2
4
7
a b c d e f
a b c d e
1
6
5
4
3
2
7
1
6
5
4
3
2
BOOKS
(Target-Sparse)
MOVIES
(Auxiliary-Dense)
Cluster-level
Rating Pattern
Matching
3
2
3
2
3
1
3
2
3
2
3
1
1
3
2
3 ?
3 3
1 1
1 1
2 2
? 2
2 2
2 2
3 ?
3 3
3 3
3 3
? 3
3 3
2 2
2 2
1 1
1 ?
? 1 1
1 1 ?
1 ? 1
2 ?
2 2
3 3 3
3 ? 3
? 3
3 3
2 2 ?
? 2 2
2
3
2
?
3
3
3
2
3
2
1
1
3
1
3
1
?
2
3
?
3
1
2
2
2
3
2
3
3
?
?
2
3
2
1
1
?
3
1
1
2
3
1
?
?
1
2
3
3
2
?
3
?
2
3
2
3
?
3
?
1
3
1
?
?
3
? 2 3 3 2
Permuterows&cols
5
ReducetoGroups
3 3
3 ?
? 3

Source-Free
Transfer Learning
Evan Wei Xiang, Sinno Jialin Pan, Weike Pan, Jian
Su and Qiang Yang. Source-Selec7on-Free
Transfer Learning. In Proceedings of the 22nd
Interna7onal Joint Conference on Ar7ﬁcial
Intelligence (IJCAI-11), Barcelona, Spain, July
2011.

Transfer Learning
Lack of labeled
training data
always happens
When we have
some related
source domains
Supervised
Learning
Transfer
Learning

Where are the “right” source data?
•  We may have an extremely large number of choices of
poten7al sources to use.

SFTL – Building base models
vs.
vs.
vs.
vs.
vs.
vs.
vs.
vs.
vs.
vs.
vs.
From the taxonomy of the online informa7on
source, we can “compile” a lot of base
classiﬁca7on models

Source Free Transfer Learning
vs.
vs.
vs.
vs.
vs.
For each target instance, we
can obtain a combined result
on the label space via
aggrega=ng the predic=ons
from all the base classiﬁers
However, do we need to call the base classiﬁers during the
predic)on phase? The answer is No!
Then we can use the projec=on matrix V
to transform such combined results from
the label space to a latent space
V
Projection matrix
q
m
Probability
Label space
A Target
Instance

Compila7on: Learning a projec7on matrix W to
amp the target instance to latent space
vs.
vs.
vs.
vs.
vs.
V
Projection matrix
Target Domain
Labeled &
Unlabeled
Data
q
m
W d
m
Learned Projection matrix
Our regression model
Loss on labeled data
Loss on unlabeled data
For each target instance, we ﬁrst aggregate
its predic=on in the base label space, and
then project it onto the latent space

SFTL – Predic7ons for the incoming test data
vs.
vs.
vs.
vs.
vs.
V
Projec=on matrix
Target Domain
Incoming
Test Data
q
m
W d
m
Learned Projec=on
matrix
With the parameter matrix W, we
can make predic=on on any incoming
test data based on the distance to
the label prototypes, without calling
the base classiﬁca=on models
No need to use base models
explicitly!

Transi7ve Transfer
Learning
with intermediate domains
Qiang Yang
Hong Kong University of Science and
Technology
http://www.cse.ust.hk/~qyang

Far Transfer vs. Near Transfer

Problem deﬁni7on
!  Given distant source and target domains, and a set of
intermediate domains, can we ﬁnd one or more
intermediate domains to enable the transfer learning
between source and target?
Not directly Transferrable
Intermediat
e domain 1
Common factor 1

Previous work and TTL
%  Tradi7onal machine learning
&  training and test data should be from the same problem domain.
%  Transfer learning
&  training and test data should be from similar problem domains.
%  Transi7ve transfer learning
&  training and test data could be from distant problem domains.
ML: Same domain
TL: Similar domains
TTL: Distant domains

Text-to-Image Classiﬁca7on
Source and target domains have few overlaps
Text-to-image
Classiﬁcation with co-
occurrence data as
intermediate domain
accelerator-to-gyroscope
activity recognition with
data from intelligent
devices as intermediate
domains

TTL: single intermediate domain
Intermediate domain selec7on, then propagate knowledge
!  Crawl a lot of images with annota7ons from Internet
!  Use domain distance, such as A-distance, to iden7fy domain
!  Transi7ve transfer through shared hidden factors in row by matrix tri-
factoriza7on
Matrix tri-factoriza7on for clustering/classiﬁca7on

TTL: shared hidden factors in row by matrix
tri-factoriza7on

Experiments NUS-WISE data set
!  The NUS-WISE data set are used
!  45 text-to-image tasks
! Each task is composed of 1200 text documents, 600
images, and 1600 co-occurred text-image pairs.

Supervised Learning w/ auto-encoder
Labeled
Source
Domain
Feature Engineering
Predictive Model
Learning
Shared
Text Classiﬁcation

Designing Objec7ve Func7on of TTL
Transitive Transfer Learning with intermediate data
Intermediate domain
weighting/selection
The weights for the intermediate domains are learned from data.

The intermediate data help ﬁnd a better hidden layer.
Predictive Model
Learning
Feature Engineering

TTL with supervised auto-
encoder
Source
Feature
Engineering
Predictive
Model Learning
SharedTarget
Intermediates
! The NUS-WISE data
! 45 text-to-image
tasks
! Each task is
composed of 1200 text
documents, 600
images, and 1600 co-
occurred text-image
pairs. In each task,
1600*45 co-occurred
text-image pairs will be
used for knowledge
transfer.

TTL with supervised auto-
encoder
Source
Feature
Engineering
Predictive
Model Learning
SharedTarget
Intermediates
Text-to-image w/
intermediate data

Reinforcement Transfer Learning via
Sparse Coding
•  Slow learning speed remains a fundamental problem for
reinforcement learning in complex environments.
•  Main problem: the numbers of states and actions in the
source and target domains are different.
–  Existing works: hand-coded inter-task mapping between state-
action pairs
•  Tool: new transfer learning based on sparse coding
Ammar, Tuyls, Taylor, Driessens, Weiss: Reinforcement Learning
Transfer via Sparse Coding. AAMAS, 2012.

Reinforcement Learning Transfer via
Sparse CodingA u t h o r s m e a s u r e d t h e
performance as the number of
steps during an episode to control
the pole in an upright posi7on on
a given ﬁxed amount of samples.

•  Given State-Ac7on-State Triplets in the source task, learn dic7onary
as
•  Using the coefficient matrix in the first step, we can learn the
dic7onary in the target task as
•  Then for each triplet in the target task, - sparse projec7on is used to
find its coefficients
•  As a result, the inter-task mapping can be learned!
Reinforcement Transfer Learning via
Sparse Coding

Reference
!  [Thorndike and Woodworth, The Influence of Improvement in one
mental function upon the efficiency of the other functions, 1901]
!  [Taylor and Stone, Transfer Learning for Reinforcement Learning
Domains: A Survey, JMLR 2009]
!  [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009]
!  [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press
2009]
!  [Biltzer etal.. Domain Adapta7on with Structural Correspondence
Learning, EMNLP 2006]
!  [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature
Alignment, WWW 2010]
!  [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI
2008]
126

Reference (cont.)
!  [Pan etal., Domain Adaptation via Transfer Component Analysis,
IJCAI 2009]
!  [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004]
!  [Zhang and Yeung, A Convex Formulation for Learning Task
Relationships in Multi-Task Learning, UAI 2010]
!  [Agarwal etal, Learning Multiple Tasks using Manifold
Regularization, NIPS 2010]
!  [Argyriou etal., Multi-Task Feature Learning, NIPS 2007]
!  [Ando and Zhang, A Framework for Learning Predictive Structures
from Multiple Tasks and Unlabeled Data, JMLR 2005]
!  [Ji etal, Extracting Shared Subspace for Multi-label Classification,
KDD 2008]
127

Reference (cont.)
!  [Raina etal., Self-taught Learning: Transfer Learning from Unlabeled
Data, ICML 2007]
!  [Dai etal., Boosting for Transfer Learning, ICML 2007]
!  [Glorot etal., Domain Adaptation for Large-Scale Sentiment
Classification: A Deep Learning Approach, ICML 2011]
!  [Davis and Domingos, Deep Transfer vis Second-order Markov Logic,
ICML 2009]
!  [Mihalkova etal., Mapping and Revising Markov Logic Networks for
Transfer Learning, AAAI 2007]
!  [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic
Lexicons, ACL 2012]
128

Reference (cont.)
!  [Sugiyama etal., Direct Importance Estimation with Model Selection
and Its Application to Covariate Shift Adaptation, NIPS 2007]
!  [Kanamori etal., A Least-squares Approach to Direct Importance
Estimation, JMLR 2009]
!  [Cris7anini etal., On Kernel Target Alignment, NIPS 2002]
!  [Huang etal., Correcting Sample Selection Bias by Unlabeled Data,
NIPS 2006]
!  [Zadrozny, Learning and Evaluating Classifiers under Sample
Selection Bias, ICML 2004]
129

Transfer Learning in Convolu7onal
Neural Networks
•  Convolutional neural networks (CNN): outstanding
image-classiﬁcation.
•  Learning CNNs requires a very large number of
annotated image samples
–  Millions of parameters, to many that prevents application
of CNNs to problems with limited training data.
•  Key Idea:
–  the internal layers of the CNN can act as a generic
extractor of mid-level image representation
–  Model-based Transfer Learning

Transfer Learning: An overview

In this document

More Related Content

What's hot

Viewers also liked

Similar to Transfer Learning: An overview

More from jins0618

Recently uploaded

Transfer Learning: An overview