迁移学习理论与应用	
Transfer	Learning:	An	Overview
杨强,香港科大
Qiang Yang, HKUST
Thanks:
Sinno Jialin Pan, NTU, Singapore
Ying Wei, HKUST, Hong Kong
Ben Tan, HKUST, Hong Kong
A	psychological	point	of	view
•  Transfer of Learning	(学习迁移)in	
Educa7on	and	Psychology		
– The	study	of	dependency	of	human	conduct,	
learning	or	performance	on	prior	experience.
–  [Thorndike and Woodworth, 1901] explored how individuals would
transfer in one context to another context that share similar
characteristics.
•  E.g.
!  C++ " Java
!  Math/Physics "	Computer	Science/Economics
2
Transfer Learning	
In	the	machine	learning	community
•  The ability of a system to recognize and apply
knowledge and skills learned in previous domains/
tasks to novel tasks/domains, which share some
commonality.
•  Given a target domain/task, how to transfer
knowledge to new domains/tasks (target)?
•  Key:
–  Representation Learning, Change of Representation
3
Why Transfer?
! 	Build	every	model	from	scratch?		
# 	Time	consuming	and	expensive	
# Expense:	
•  Data	Collec7on/Labeling	
•  Privacy	
•  Time	to	train	
! 	Reuse	common	knowledge	extracted	from	
exis7ng	systems?	
# 	More	prac7cal	
4
Why Transfer Learning?	
5	
Source
Domain Data
Target
Domain Data
Predictive
Models
Labeled Training
Unlabeled data/a few labeled
data for adaptation
Transfer Learning
Algorithms
Target
Domain Data
Testing
Electronics
Time Period A
Device A
DVDDevice B
Time Period B
Transfer Learning
Different fields
•  Transfer	learning	for	
reinforcement	learning.	
	
[Taylor and Stone, Transfer
Learning for Reinforcement
Learning Domains: A Survey,
JMLR 2009]
•  Transfer	learning	for	
classifica7on,	and	
regression	problems.	
	
	
[Pan and Yang, A Survey on
Transfer Learning, IEEE TKDE
2010]
6	
Focus!
Motivating Example I:
	Indoor	WiFi	localiza7on
7	
-30dBm -70dBm -40dBm
Indoor WiFi Localization (cont.)
8	
Training
Training Test
Device A
Test
Device B
~ 1.5 meters
~10 meters
Device A
Device A
S=(-37dbm,	..,	-77dbm),	L=(1,	3)	
S=(-41dbm,	..,	-83dbm),	L=(1,	4)	
…	
S=(-49dbm,	..,	-34dbm),	L=(9,	10)	
S=(-61dbm,	..,	-28dbm),	L=(15,22)	
S=(-37dbm,	..,	-77dbm)	
S=(-41dbm,	..,	-83dbm)		
…	
S=(-49dbm,	..,	-34dbm)		
S=(-61dbm,	..,	-28dbm)	
S=(-37dbm,	..,	-77dbm)	
S=(-41dbm,	..,	-83dbm)		
…	
S=(-49dbm,	..,	-34dbm)		
S=(-61dbm,	..,	-28dbm)	
S=(-33dbm,	..,	-82dbm),	L=(1,	3)	
…	
S=(-57dbm,	..,	-63dbm),	L=(10,	23)	
Localization
model
Localization
model
Drop!
Average Error
Distance
Difference between Domains
9	
Time Period A	 Time Period B	
Device B	
Device A
Motivating Example II:
Sen7ment	classifica7on
10
Sentiment Classification (cont.)
11	
Training
Training Test
Electronics
Test
~ 84.6%
~72.65%
Sentiment
Classifier
Sentiment
Classifier
Drop!
Electronics
Classification
Accuracy
ElectronicsDVD
Difference in Representation
12	
Electronics Video Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will never
buy HP again.
(6) The game is so boring. I
am extremely unhappy and will
probably never buy UbiSoft
again.
A	Major	Assump7on	in	Tradi7onal	
Machine	Learning
! Training and future (test) data come from the
same domain, which implies
# Represented in the same feature spaces.
# Follow the same data distribution.
13
Machine	Learning:	Yesterday,	Today	
and	Tomorrow	
14	
Deep	Learning:	
Features	
Reinforcement	
Learning:	
Rewards	
Transfer	
Learning:	
Adapta7on	
Yesterday	 Today	 Tomorrow
Machine	Learning:	Yesterday,	Today	
and	Tomorrow	
15	
Deep	Learning:		
Lots	of	Data	
Only	the	Rich	
Reinforcement	
Learning:		
Lots	of	Data	
Only	the	Rich		
Transfer	Learning:		
Few	Data	
Everyone	
Yesterday	 Today	 Tomorrow
Different	Scenarios	
•  Training	and	tes7ng	data	may	come	from	
different	domains:	
# Different different feature spaces/ marginal
distributions:
# Different conditional distributions or different
label spaces:
16
Transfer Learning Approaches	
17	
Instance-based
Approaches
Feature-based
Approaches
Parameter/Model -
based Approaches
Relational
Approaches
Instance-based Transfer Learning
Approaches
Source and target domains
have a lot of overlapping
features
18	
General Assumption
Instance-based Transfer Learning
Approaches
Case	I:	Unlabeled	Target
	
Case	II:	Some	Labels	in	Target
	
19	
Problem Setting
Assumption Assumption
Problem Setting
Instance-based Approaches
Case I
Given	a	target	task,
20
Instance-based Approaches
Case I (cont.)
Assumption:
	
	
	
	
21
Instance-based Approaches
Case I (cont.)
	
22	
Correcting Sample Selection Bias / Covariate Shift
[Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]
Instance-based Approaches
Correcting sample selection bias (cont.)	
•  The	distribu7on	of	the	selector	variable	maps	
the	target	onto	the	source	distribu7on		
23	
! Label instances from the source domain with label 1	
! Label instances from the target domain with label 0	
! Train a binary classifier
[Zadrozny, ICML-04]
Instance-based Approaches
Kernel mean matching (KMM)
Maximum Mean Discrepancy (MMD)
	
	
	
	
	
	
	
[Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial]
24
Instance-based Approaches
Direct density ratio estimation
25	
[Sugiyama etal., NIPS-07, Kanamori etal., JMLR-09]
KL divergence loss Least squared loss
[Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]
Instance-based Approaches
Case II	
•  Intui7on:	Part of the labeled data in the source
domain can be reused in the target domain
after re-weighting
26
Instance-based Approaches
Case II (cont.)	
! TrAdaBoost [Dai etal ICML-07]
– For each boosting iteration,
# Use the same strategy as AdaBoost to
update the weights of target domain data.
# Use a new mechanism to decrease the
weights of misclassified source domain
data.
27
Feature-based Transfer Learning
Approaches
When source and target
domains only have some
overlapping features. (lots
of features only have
support in either the source
or the target domain)
28
Feature-based Transfer Learning
Approaches (cont.)	
How	to	learn					?	
! Solution 1: Encode application-specific
knowledge to learn the transformation.
! Solution 2: General approaches to learning the
transformation.
29
Feature-based Approaches
Encode application-specific knowledge
30	
Electronics Video Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will
never_buy HP again.
(6) The game is so boring. I
am extremely unhappy and will
probably never_buy UbiSoft
again.
Feature-based Approaches
Encode application-specific knowledge (cont.)
31	
compact sharp blurry hooked realistic boring
1 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
( ) sgn( ), [1,1, 1,0,0,0]T
y f x w x w= = ⋅ = −
compact sharp blurry hooked realistic boring
0 0 0 1 0 0
0 0 0 1 1 0
0 0 0 0 0 1
Electronics
Video Game
Training
Prediction
Feature-based Approaches
Encode application-specific knowledge (cont.)
32	
Electronics Video Games
(1) Compact; easy to operate;
very good picture quality;
looks sharp!
(2) A very good game! It is
action packed and full of
excitement. I am very much
hooked on this game.
(3) I purchased this unit from
Circuit City and I was very
excited about the quality of the
picture. It is really nice and
sharp.
(4) Very realistic shooting
action and good plots. We
played this and were hooked.
(5) It is also quite blurry in
very dark settings. I will
never_buy HP again.
(6) The game is so boring. I
am extremely unhappy and
will probably never_buy
UbiSoft again.
Feature-based Approaches
Encode application-specific knowledge (cont.)	
! Three different types of features
!  Source domain (Electronics) specific features, e.g.,
compact, sharp, blurry
!  Target domain (Video Game) specific features, e.g.,
hooked, realistic, boring
!  Domain independent features (pivot features), e.g.,
good, excited, nice, never_buy
33
Feature-based Approaches
Encode application-specific knowledge (cont.)	
! How to identify pivot features?
! Term frequency on both domains
! Mutual information between features and labels (source
domain)
! Mutual information on between features and domains
! How to utilize pivots to align features across domains?
! Structural Correspondence Learning (SCL) [Biltzer etal.
EMNLP-06]
! Spectral Feature Alignment (SFA) [Pan etal. WWW-10]
34
Feature-based Approaches Spectral
Feature Alignment (SFA)	
! Intuition
# Use a bipartite graph to model the correlations
between pivot features and other features
# Discover new shared features by applying
spectral clustering techniques on the graph
35
! If two domain-specific words have connections to more common pivot words in
the graph, they tend to be aligned or clustered together with a higher probability.
! If two pivot words have connections to more common domain-specific words in
the graph, they tend to be aligned together with a higher probability.
Spectral Feature Alignment (SFA)
High level idea	
36	
exciting
good
never_buy
sharp
boring
blurry
hooked
compact
realistic
Pivot features
Domain-specific features
7
6
8
3
6
2
4
5
Electronics
Video Game
exciting
good
never_buy
sharp
boring
blurry
hooked
compact
realistic
Pivot features
Domain-specific features
7
6
8
3
6
2
4
5
Electronics
Video Game
boring realistic
hooked
blurry
sharp
compact
Electronics
Video Game
Electronics
Electronics Video Game
Video Game
Derive new features
Spectral Clustering
37
Spectral Feature Alignment (SFA)
Derive new features (cont.)
sharp/hooked compact/realistic blurry/boring
1 1 0
1 0 0
0 0 1
38	
( ) sgn( ), [1,1, 1]T
y f x w x w= = ⋅ = −
sharp/hooked compact/realistic blurry/boring
1 0 0
1 1 0
0 0 1
Electronics
Video Game
Training
Prediction
Spectral Feature Alignment (SFA)	
1.  Identify P pivot features
2.  Construct a bipartite graph between the pivot and
remaining features.
3.  Apply spectral clustering on the graph to derive
new features
4.  Train classifiers on the source using augmented
features (original features + new features)
39
Feature-based Approaches
Develop general approaches
40	
Time Period A Time Period B
Device B
Device A
Feature-based Approaches
Transfer Component Analysis [Pan etal., IJCAI-09, TNN-11]
41	
TargetSource
Latent factors
Temperature Signal
properties
Building
structure
Power of APs
Motivation
Transfer Component Analysis (cont.)
42	
TargetSource
Latent factors
Temperature Signal
properties
Building
structure
Power of APs
Causes the data distributions between two domains to be different
Transfer Component Analysis (cont.)
43	
TargetSource
Signal
properties
Noisy
component
Building
structure
Principal components
Transfer Component Analysis (cont.)
Learning					by	only	minimizing	the	distance	between	
distribu7ons
44
Transfer Component Analysis (cont.)
Main idea: the learned should map the source and
target domain data to the latent space spanned by the
factors which can reduce domain difference and
preserve original data structure.
45	
High level optimization problem
Transfer Component Analysis (cont.)
46	
Recall: Maximum Mean Discrepancy (MMD)
Transfer Component Analysis (cont.)	
47	
An illustrative example	
Latent features learned by PCA and TCA
PCAOriginal feature space TCA
Feature-based Approaches
Self-taught Feature Learning (Andrew Ng. et al.)	
! Intuition: Useful higher-level features can be learned from
unlabeled data.
! Steps:
1)  Learn higher-level features from a lot of unlabeled data.
2)  Use the learned higher-level features to represent the data of the
target task.
3)  Train models from the new representations of the target task
(supervised)
! How to learn higher-level features
# Sparse Coding [Raina etal., 2007]
# Deep learning [Glorot etal., 2011]
48
Feature-based Approaches
Mul7-task	Feature	Learning	
! Assumption: If tasks are related, they should
share some good common features.
! Goal: Learn a low-dimensional representation
shared across related tasks.
49	
General Multi-task Learning Setting
Multi-task Learning
Assumption:
If tasks are related, they may share similar parameter vectors.
For example, [Evgeniou and Pontil, KDD-04]
50	
Common part
Specific part for individual task
Mul7-task	Feature	Learning
51	
[Argyriou etal., NIPS-07]
[Ando and Zhang, JMLR-05]
[Ji etal, KDD-08]
Deep	Learning	in	Transfer	Learning	
52
Transfer	Learning	with	Deep	LearningTransfer	Learning	Perspec7ve:	
Why	need	Deep	Learning?	
•  Deep	neural	networks	learn	
nonlinear	representa7ons		
–  that	are	hierarchical;	
–  that	disentangle	different	
explanatory	factors	of	
varia7on	behind	data	
samples;	
–  that	manifest	invariant	
factors	underlying	different	
popula7ons.
Deep	Learning	Perspec7ve:	
Why	need	Transfer	Learning?	
•  Transfer	Learning	alleviates	
–  the	incapability	of	learning	on	
a	dataset	which	may	not	be	
large	enough	to	train	an	
en7re	deep	neural	network	
from	scratch
Benchmark	Dataset:	Office
•  Descrip7on:	leverage	source	images	to	
improve	classifica7on	of	target	images	
3 domains
31categories
backpackbike
amazon
2,817 images
webcam
957 images
dslr
795 images
object
images in
Amazon
low-resolution images
taken from a web
camera
high-resolution images
taken from a digital SLR
camera
Results
Unsupervised domain adaptation Amazon→Webcam over time
2011 2012 2013 2014 2015
Multi-classaccuracy
10
20
30
40
50
60
70
TCA
GFK
CNN
DD
C
DLID
DAN
BA
TL without DL
TL with DL
DL without TL
With Deep Learning, Transfer Learning improves.
Applying Transfer Learning techniques outperforms directly applying Deep Learning
models trained on the source.
DASH-N [1]
Finetuning [2,3]
SCNN [4]
Overview
•  Overview
supervised unsupervised
single
modality
multiple
modalities
[5]
DLID [6]
DCC [7]
DAN [8]
BA [9]
SHL-MDNN [10]
ST [11]
Are there
labelled
target
data?
Are dimensions of source
and target domains equal?
[12] Ngiam, Jiquan, et al. "Multimodal deep
learning." ICML. 2011.	
[13] Srivastava, Nitish, and Ruslan
Salakhutdinov. "Multimodal learning with deep
Boltzmann machines." JMLR. 2014	
[14] Sohn, Kihyuk, Wenling Shang, and Honglak
Lee. "Improved multimodal deep learning
with variation of information." NIPS. 2014.
DBN [12]
DBM [13]
MDRNN [14]
[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio.
"Domain adaptation for large-scale sentiment
classification: A deep learning approach." ICML. 2011.	
[6] Chopra, Sumit, Suhrid Balakrishnan, and
Raghuraman Gopalan. "Dlid: Deep learning for domain
adaptation by interpolating between domains." ICML.
2013.
[7] Tzeng, Eric, et al. "Deep domain confusion:
Maximizing for domain invariance." arXiv
preprint arXiv:1412.3474. 2014.
[8] Long, Mingsheng, and Jianmin Wang. "Learning
transferable features with deep adaptation
networks." arXiv preprint arXiv:1502.02791. 2015.
[9] Ganin, Yaroslav, and Victor Lempitsky.
"Unsupervised Domain Adaptation by
Backpropagation." ICML. 2015.
[10] Huang, Jui-Ting, et al. "Cross-language
knowledge transfer using multilingual deep
neural network with shared hidden
layers." ICASSP. 2013.	
[11] Gupta, Saurabh, Judy Hoffman, and
Jitendra Malik. "Cross Modal
Distillation for Supervision Transfer."
arXiv preprint arXiv:1507.00448. 2015.
[1] Nguyen, Hien V., et al. "Joint hierarchical
domain adaptation and feature
learning." PAMI. 2013.
[2] Oquab, Maxime, et al. "Learning and
transferring mid-level image representations
using convolutional neural networks." CVPR
2014.	
	
[3] Yosinski, Jason, et al. "How transferable
are features in deep neural networks?." NIPS
2014.	
[4] Tzeng, Eric, et al. "Simultaneous
deep transfer across domains and
tasks." CVPR. 2015.
Single	Modality
•  Directly	applying	the	model	parameters	(deep	
neural	network	weights)	from	the	source	to	
targetsource
domai
n
input
outpu
t
target
domai
n
input
outpu
t
shared weightsAre	the	features	transferrable?
Single	Modality
•  Transferability of layer-wise features
ImageNet	
1000	
class
A:	500	
class
B:	500	
class
source	
domain
target	
domain
random	
split
baseA: train all layers with A	baseB: train all layers with B	
BnB: initialize the first n layers with baseB and fix, randomly
initialize the other layers and train with B	
BnB+: initialize the first n layers with baseB, randomly initialize the
other layers, and train all layers with B	
AnB: initialize the first n layers with baseA and fix, randomly
initialize the other layers and train with B	
AnB+: initialize the first n layers with baseA, randomly initialize
the other layers, and train all layers with B
Single	Modality
•  Transferability of layer-wise features
[3]
Conclusion 1: lower layer features are more general and transferrable, and higher
layer features are more specific and non-transferrable.	
Conclusion 2: transferring features + fine-tuning always improve generalization. 	What if we do not have any labelled data to finetune in the target domain? 	
What happens if the source and target domain are very dissimilar?	
ImageNet is not
randomly split, but
into A = {man-made
classes} and	
B = {natural classes}
Single	Modality
•  General	framework	of	unsupervised	transfer
source
domai
n
input
outpu
t
target
domai
n
input
outpu
t
domain	
distanc
e	loss
For lower level features (more general & transferrable), 	
the source transfers to the target directly.
For higher level features (more domain specific & not transferrable), the source
transfers to the target by minimizing domain distances.
shared weights
If some labelled target data are available, it would be better.
Single	Modality
•  Overall	training	objec7ve	
•  Domain	distance	losses	
– Maximum	Mean	Discrepancy	[7]	
source domain classification lossdomain distance loss
a particular representation, e.g. the representation after 5th
layer
Single	Modality
•  Domain	distance	losses	
–  MK-MMD	(Mul7-kernel	variant	of	MMD)	[8]	
–  Domain	classifier	[4,	9]
an embedding
A distribution-free metric - maximizes the domain classification error
Learn a more flexible distance metric than MMD by adjusting
Single	Modality
•  Other	factors	to	improve	transfer	
–  Which	layers	should	the	domain	distance	loss	be	considered?	
•  By	learning,	pinpoint	the	layer	that	minimizes	the	domain	distance	
among	all	specific	layers,	say	the	fourth.	[7]	
•  All	the	specific	layers,	say	the	last	two	layers.	[8]
source
domain
input
output
target
domain
inpu
t
domain	
distanc
e	loss
Single	Modality•  Other	factors	to	improve	transfer	
– When	we	have	some	training	data	in	the	target	
domain?	
•  soj	label	supervision	[4]:	categories	without	any	
labeled	target	data	are	s7ll	updated	to	output	non-zero	
probabili7es
target
doma
in
inpu
t outpu
t
source
domain
Mul7ple	Modali7es
•  The	source	domain	and	target	domain	could	have	different	feature	
spaces,	i.e.,	dimensionality.	
–  Mul7media	on	the	web	
•  Images		
•  Text	documents	
•  Audio	
•  Video	
–  Recommender	systems	
•  Douban	
•  Taobao	
•  Xiami	Music	
–  Robo7cs	
•  Vision	
•  Audio	
•  Sensors
How to deal with multi-modal transfer with Deep Learning?
Mul7ple	Modali7es
•  Key
The cat is sitting on a sofa
with ears cocking.
shared concept
cat ears kitteneyes
Mul7ple	Modali7es	
•  General	framework	of	unsupervised	transfer
source
domai
n
input
target
domai
n
input
commo
n
Paired	
loss
reconstruction layer
reconstruction layer
Reconstruction errors:
Paired loss: the similarity of a pair of source and target instances is
preserved in the common space.
Paired loss:
similarity
Mul7ple	Modali7es	
•  General	framework	of	supervised	transfer
outpu
t
outpu
t
Paired	
loss
Classification loss:
source
domai
n
input
target
domai
n
input
commo
n
MIR-Flickr	Dataset
•  1 million images with user-generated tags
•  25,000 images are labelled with 24 categories
•  10,000 for training, 5,000 for validation, 10,000 for testing
categories baby, female, 	
portrait, people
plant life, 	
river, water
clouds, sea, sky,	
transport, water
animals, dog,	
food
domain 1:
images
domain 2:
text
Results
Mean Average Precision (MAP) by applying LR to different layers [13]
Transferring either one of the two domains to the other (joint hidden), outperforms the
domain itself (image_input OR text_input).
DBN [12]	
DBM [13]
References
[1] Nguyen, Hien V., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013.
[2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural
networks." CVPR. 2014.	
[3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS. 2014.
[4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015.
[5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification:
A deep learning approach." ICML. 2011.
[6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by
interpolating between domains." ICML. 2013.
[7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint
arXiv:1412.3474. 2014.
[8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation
networks." arXiv preprint arXiv:1502.02791. 2015.
[9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation."
ICML. 2015.
[10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with
shared hidden layers." ICASSP. 2013.
[11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision
Transfer." arXiv preprint arXiv:1507.00448. 2015.
[12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011.
[13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR.
2014
[14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of
information." NIPS. 2014.
Simultaneous	Deep	Transfer	Across	Domains		
and	Tasks	Eric	Tzeng,	Judy	Hoffman,	Trevor	Darrell,	Kate	Saenko,	
ICCV	2015
Tzeng	et	al.:		Architecture
Tzeng	et	al.:		Architecture
Oquab, Bottou, Laptev, Sivic: Learning and Transferring
Mid-Level Image Representations using Convolutional
Neural Networks. CVPR 2014.
Transfer	Learning	in	Convolu7onal	
Neural	Networks
•  Source	Domain:	ImageNet	
–  1000	classes,	1.2	million	images	
•  Target	Domain:	Pascal	VOC	2007	object	classifica7on	
–  20	classes,	about	5000	images	
•  PRE-1000C:	the	proposed	method
DeCAF:	A	Deep	Convolu7onal	Ac7va7on	Feature	
for	Generic	Visual	Recogni7on	
•  Jeff	Donahue,	Yangqing	Jia,	Oriol	Vinyals,	Judy	Hoffman,	Ning	Zhang,	Eric	
Tzeng,	Trevor	Darrell.		ICML2014	
•  Ques7ons:		
–  How	to	transfer	features	to	tasks	with	different	labels	
–  Do	features	extracted	from	the	CNN	generalize	to	other	datasets?		
–  How	does	performance	vary	with	network	depth?		
•  Algorithm:	
–  A	deep	convolu7onal	model	is	first	trained	in	a	fully	supervised	seqng	
using	a	state-of-the-art	method	Krizhevsky	et	al.		(2012	).		
–  extract	various	features	from	this	network,	and	evaluate	the	efficacy	of	
these	features	on	generic	vision	tasks.	
78
Comparison:	DECAF	to	others	
79
Relational Transfer Learning
Approaches
! Motivation:	
! 	If	two	logically	described	domains	(rela7onal,	
data	is	non-i.i.d)	are	related,	they	must	share	
similar	rela)ons	among	objects.		
! These	rela7ons	can	be	used	for	transfer	learning	
80
Relational Transfer Learning
Approaches (cont.)
81	
Actor(A) Director(B)
WorkedFor
Movie (M)
Student (B) Professor (A)
AdvisedBy
Paper (T)
Publication Publication
Academic domain (source) Movie domain (target)
MovieMember MovieMember
AdvisedBy (B, A) ˄ Publication (B, T)
=> Publication (A, T)
WorkedFor (A, B) ˄ MovieMember (A, M)
=> MovieMember (B, M)
P1(x, y) ˄ P2 (x, z) => P2 (y, z)
[Mihalkova etal., AAAI-07, Davis and Domingos, ICML-09]
TRANSFER	LEARNING	
APPLICATIONS	
迁移学习应用	
82
83	
Query Classification and Online
Advertisement
•  ACM KDDCUP 05
Winner
•  SIGIR 06
•  ACM Transactions on
Information Systems
Journal 2006
–  Joint work with Dou
Shen, Jiantao Sun and
Zheng Chen
84	 84
QC as Machine Learning
Inspired by the KDDCUP’05 competition
–  Classify a query into a ranked list of categories
–  Queries are collected from real search engines
–  Target categories are organized in a tree with each
node being a category
85	
Target-transfer Learning in QC
•  Classifier, once trained, stays constant
–  Target Classes Before
•  Sports, Politics (European, US, China)
–  Target Classes Now
•  Sports (Olympics, Football, NBA), Stock Market (Asian, Dow,
Nasdaq), History (Chinese, World) How to allow target to change?
•  Application:
–  advertisements come and go,
–  but our query"target mapping needs not be retrained!
•  We call this the target-transfer learning problem
86	 86
Solutions: Query Enrichment
+ Staged Classification
Target
Categories
Queries
Solution: Bridging classifier
Construction of
Synonym- based
Classifiers
Construction of
Statistical Classifier
Query
Search
Engine
Labels of
Returned
Pages
Text of
Returned
Pages
Classified
results
Classified
results
Finial Results
Phase II: the testing phase
Phase I: the training phase
87	 87
$  Category information	
Full
Step 1: Query enrichment
•  Textual information
Title
Snippet
Category
88	 88
Step 2: Bridging Classifier
•  Wish to avoid:
–  When target is changed, training needs to repeat!
•  Solution:
–  Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge
89	 89
Bridging Classifier (Cont.)
$  How to connect?	
Prior prob. of
I
jC
The relation between
and I
jC
T
iC
The relation between
and I
jC
q
The relation between
and T
iC
q
90	 90
Category Selection for Intermediate
Taxonomy
–  Category Selection for Reducing Complexity
•  Total Probability (TP)
•  Mutual Information
91	
Result of Bridging Classifiers
– Using bridging classifier allows the target
classes to change freely
•  no the need to retrain the classifier!
$  Performance of the Bridging Classifier with Different
Granularity of Intermediate Taxonomy
Cross	Domain	Ac7vity	Recogni7on	
[Zheng,	Hu,	Yang,	Ubicomp	2009]	
•  Challenges:	
–  A	new	domain	of	
ac7vi7es	without	
labeled	data	
•  Cross-domain	ac7vity	
recogni7on	
–  Transfer	some	available	
labeled	data	from	
source	ac7vi7es	to	help	
training	the	recognizer	
for	the	target	ac7vi7es.	
92	
Cleaning	
Indoor	
Laundry	
Dishwashing
How	to	use	the	similari7es?		
93	
Source	Domain	
Labeled	Data	
Similarity	
Measure	
<Sensor	Reading,	Ac7vity	
Name>	
Example:	<SS,	“Make	
Coffee”>	
sim(“Make	Coffee”,	
“Make	Tea”)	=	0.6	
Pseudo	Training	
Data:	<SS,	“Make	
Tea”,	0.6>	
Target	Domain	
Pseudo	Labeled	
Data	
Weighted	SVM	
Classifier	
THE	WEB
Calcula7ng	Ac7vity	Similari7es	
! How	similar	are	two	
ac7vi7es?	
◦  Use	Web	search	results	
◦  TFIDF:	Tradi7onal	IR	
similarity	metrics	
(cosine	similarity)	
◦  Example	
"  Mined	similarity	between	
the	ac7vity	“sweeping”	
and	“vacuuming”,	
“making	the	bed”,	
“gardening”	
Calculated	Similarity	with	
the	activity	"Sweeping"
Similarity	
with	the	
activity	
"Sweeping
"
94
Cross-Domain	AR:	Performance	
Mean
Accuracy
with Cross
Domain
Transfer
# Activities
(Source
Domain)
# Activities
(Target
Domain)
Baseline
(Random
Guess)
MIT Dataset
(Cleaning to
Laundry)
58.9% 13 8 12.5%
MIT Dataset
(Cleaning to
Dishwashing)
53.2% 13 7 14.3%
Intel Research
Lab Dataset
63.2% 5 6 16.7%
95	
!  Ac7vi7es	in	the	source	domain	and	the	target	domain	are	
generated	from	ten	random	trials,	mean	accuracies	are	reported.
Transferring	knowledge	from	social	to	
physical	
! Ubiquitous	physical	sensors	mo7vate	extensive	
research	on	ubiquitous	compu7ng.	
	 Which	ac7vity	is	this person performing?
Transferring	from	social	to	physical	
I	am	on	a	business	trip	in	
New	York.	The	
Metropolitan	Museum	of	
Art	is	fantas7c!
Brilliant	night	at	Chilli	Food,	
wine,	hospitality	all	excellent.	
Bristol's	top	restaurant.
Back	in	the	#gym	ajer	3.5	
weeks	:)	feeling	good	
#exercise
Can	we	transfer	
knowledge	from	social	
media	to	physical	
world?
Transfer	from	social	to	physical	
Cellphone	Sensor	Dataset	
! 232	sensor	records	
! 10	volunteers	
! 7me,	GPS,	tri-axial	
accelerometer,	loca7on	
POI	info	
Sina	Weibo
! 10,791	tweets
! Distribu7on	of	labels	
! Distribu7on	of	top	9	labels
Transfer	from	social	to	physical	
! Results
							A	naive	combina7on	of	sensor	and	social	features	
performs	bezer	than	sensor	features	only	(Combined	
v.s.	Sensor),	which	validates	the	necessity	of	ins7lling	
social	knowledge	into	physical	sensor	data.		
							Heterogeneous	transfer	learning	methods	show	
improvement	over	Combined:	employing	social	
messages	to	enrich	sensor	readings’	feature	
representa7on	in	a	latent	space	is	more	effec7ve	than	
naive	combina7on.		
! 			Our	method	could	use	only	50%	labelled	data	of	other	methods	to	
achieve	the	same	performance.
Transfer	Learning	for	Collabora7ve	Filtering		
101	
IMDB Database	
Amazon.com	
101
Transfer	Learning	in	Collabora7ve	Filtering	
•  Source	(Dense):	Encode	cluster-level	ra7ng	pazerns	
•  Target	(Sparse):	Map	users/items	to	the	encoded	prototypes	
102	
A B C
III
II
I
A B
III
II
I
a e b f c d
2
6
4
5
1
3
c d a b e
1
3
6
2
4
7
a b c d e f
a b c d e
1
6
5
4
3
2
7
1
6
5
4
3
2
BOOKS
(Target-Sparse)
MOVIES
(Auxiliary-Dense)
Cluster-level
Rating Pattern
Matching
3
2
3
2
3
1
3
2
3
2
3
1
1
3
2
3 ?
3 3
1 1
1 1
2 2
? 2
2 2
2 2
3 ?
3 3
3 3
3 3
? 3
3 3
2 2
2 2
1 1
1 ?
? 1 1
1 1 ?
1 ? 1
2 ?
2 2
3 3 3
3 ? 3
? 3
3 3
2 2 ?
? 2 2
2
3
2
?
3
3
3
2
3
2
1
1
3
1
3
1
?
2
3
?
3
1
2
2
2
3
2
3
3
?
?
2
3
2
1
1
?
3
1
1
2
3
1
?
?
1
2
3
3
2
?
3
?
2
3
2
3
?
3
?
1
3
1
?
?
3
? 2 3 3 2
Permuterows&cols
5
ReducetoGroups
3 3
3 ?
? 3
ADVANCED	DEVELOPMENTS	
103
Source-Free		
Transfer	Learning	
Evan	Wei	Xiang,	Sinno	Jialin	Pan,	Weike	Pan,	Jian	
Su	and	Qiang	Yang.	Source-Selec7on-Free	
Transfer	Learning.	In	Proceedings	of	the	22nd	
Interna7onal	Joint	Conference	on	Ar7ficial	
Intelligence	(IJCAI-11),	Barcelona,	Spain,	July	
2011.
Transfer	Learning	
Lack	of	labeled	
training	data	
always	happens	
When	we	have	
some	related	
source	domains	
Supervised 	
Learning
Transfer 	
Learning
Where	are	the	“right”	source	data?	
•  We	may	have	an	extremely	large	number	of	choices	of	
poten7al	sources	to	use.
SFTL	–	Building	base	models	
vs.	
vs.	
vs.	
vs.	
vs.	
vs.	
vs.	
vs.	
vs.	
vs.	
vs.	
From	the	taxonomy	of	the	online	informa7on	
source,	we	can	“compile”	a	lot	of	base	
classifica7on	models
Source	Free	Transfer	Learning	
vs.	
vs.	
vs.	
vs.	
vs.	
For	each	target	instance,	we	
can	obtain	a	combined	result	
on	the	label	space	via	
aggrega=ng	the	predic=ons	
from	all	the	base	classifiers	
However,	do	we	need	to	call	the	base	classifiers	during	the	
predic)on	phase?		The	answer	is	No!	
Then	we	can	use	the	projec=on	matrix	V	
to	transform	such	combined	results	from	
the	label	space	to	a	latent	space	
V	
Projection matrix	
q
m
Probability	
Label space	
A	Target	
Instance
Compila7on:	Learning	a	projec7on	matrix	W	to	
amp	the	target	instance	to	latent	space		
vs.	
vs.	
vs.	
vs.	
vs.	
V	
Projection matrix	
Target Domain
Labeled &
Unlabeled
Data
q
m
W	d
m
Learned Projection matrix	
Our	regression	model	
Loss	on	labeled	data	
Loss	on	unlabeled	data	
For	each	target	instance,	we	first	aggregate	
its	predic=on	in	the	base	label	space,	and	
then	project	it	onto	the	latent	space
SFTL	–	Predic7ons	for	the	incoming	test	data	
vs.	
vs.	
vs.	
vs.	
vs.	
V	
Projec=on	matrix	
Target Domain
Incoming
Test Data
q
m
W	d
m
Learned	Projec=on	
matrix	
With	the	parameter	matrix	W,	we	
can	make	predic=on	on	any	incoming	
test	data	based	on	the	distance	to	
the	label	prototypes,	without	calling	
the	base	classifica=on	models	
No need to use base models
explicitly!
Transi7ve	Transfer	
Learning	
with	intermediate	domains
	Qiang Yang
Hong Kong University of Science and
Technology
http://www.cse.ust.hk/~qyang
Far	Transfer	vs.	Near	Transfer
Problem	defini7on	
!  Given	distant	source	and	target	domains,	and	a	set	of	
intermediate	domains,	can	we	find	one	or	more	
intermediate	domains	to	enable	the	transfer	learning	
between	source	and	target?
Not directly Transferrable
Intermediat
e	domain	1
Common factor 1
Previous	work	and	TTL
%  Tradi7onal	machine	learning	
&  training	and	test	data	should	be	from	the	same	problem	domain.	
%  	Transfer	learning			
&  training	and	test	data	should	be	from	similar	problem	domains.	
%  Transi7ve	transfer	learning		
&  	training	and	test	data	could	be	from	distant	problem	domains.	
ML: Same domain
TL: Similar domains
TTL: Distant domains
Text-to-Image	Classifica7on
Source	and	target	domains	have	few	overlaps
Text-to-image
Classification with co-
occurrence data as
intermediate domain
accelerator-to-gyroscope
activity recognition with
data from intelligent
devices as intermediate
domains
TTL:	single	intermediate	domain
Intermediate	domain	selec7on,	then	propagate	knowledge	
!  Crawl	a	lot	of	images	with	annota7ons	from	Internet		
!  		Use	domain	distance,	such	as	A-distance,	to	iden7fy	domain	
!  		Transi7ve	transfer	through	shared	hidden	factors	in	row	by	matrix	tri-
factoriza7on		
Matrix	tri-factoriza7on	for	clustering/classifica7on
TTL:	shared	hidden	factors	in	row	by	matrix	
tri-factoriza7on
Experiments	NUS-WISE	data	set
! 	The	NUS-WISE	data	set	are	used	
! 	45	text-to-image	tasks	
! Each	task	is	composed	of	1200	text	documents,	600	
images,	and	1600	co-occurred	text-image	pairs.
Supervised	Learning	w/	auto-encoder
Labeled
Source
Domain	
Feature Engineering
Predictive Model
Learning
Shared
Text Classification
Designing	Objec7ve	Func7on	of	TTL	
Transitive Transfer Learning with intermediate data
Intermediate domain 	
weighting/selection
The weights for the intermediate domains are learned from data. 	
	
The intermediate data help find a better hidden layer.
Predictive Model
Learning
Feature Engineering
TTL	with	supervised	auto-
encoder	
Source	
Feature
Engineering
Predictive
Model Learning
SharedTarget	
Intermediates	
! The NUS-WISE data 	
! 45 text-to-image
tasks	
! Each task is
composed of 1200 text
documents, 600
images, and 1600 co-
occurred text-image
pairs. In each task,
1600*45 co-occurred
text-image pairs will be
used for knowledge
transfer.
TTL	with	supervised	auto-
encoder	
Source	
Feature
Engineering
Predictive
Model Learning
SharedTarget	
Intermediates	
Text-to-image w/
intermediate data
Reinforcement	Transfer	Learning	via	
Sparse	Coding
•  Slow learning speed remains a fundamental problem for
reinforcement learning in complex environments.
•  Main problem: the numbers of states and actions in the
source and target domains are different.
–  Existing works: hand-coded inter-task mapping between state-
action pairs
•  Tool: new transfer learning based on sparse coding
Ammar, Tuyls, Taylor, Driessens, Weiss: Reinforcement Learning
Transfer via Sparse Coding. AAMAS, 2012.
Reinforcement	Learning	Transfer	via	
Sparse	CodingA u t h o r s	 m e a s u r e d	 t h e	
performance	 as	 the	 number	 of	
steps	during	an	episode	to	control	
the	pole	in	an	upright	posi7on	on	
a	given	fixed	amount	of	samples.
•  Given	State-Ac7on-State	Triplets		in	the	source	task,	learn	dic7onary	
as	
•  Using	the	coefficient	matrix	in	the	first	step,	we	can	learn	the	
dic7onary	in	the	target	task	as	
•  Then	for	each	triplet	in	the	target	task,		-	sparse	projec7on	is	used	to	
find	its	coefficients	
•  As	a	result,	the	inter-task	mapping	can	be	learned!	
Reinforcement	Transfer	Learning	via	
Sparse	Coding
Reference
!  [Thorndike and Woodworth, The Influence of Improvement in one
mental function upon the efficiency of the other functions, 1901]
!  [Taylor and Stone, Transfer Learning for Reinforcement Learning
Domains: A Survey, JMLR 2009]
!  [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009]
!  [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press
2009]
!  [Biltzer	etal..	Domain	Adapta7on	with	Structural	Correspondence	
Learning,	EMNLP	2006]
!  [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature
Alignment, WWW 2010]
!  [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI
2008]
126
Reference	(cont.)
!  [Pan etal., Domain Adaptation via Transfer Component Analysis,
IJCAI 2009]
!  [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004]
!  [Zhang and Yeung, A Convex Formulation for Learning Task
Relationships in Multi-Task Learning, UAI 2010]
!  [Agarwal etal, Learning Multiple Tasks using Manifold
Regularization, NIPS 2010]
!  [Argyriou etal., Multi-Task Feature Learning, NIPS 2007]
!  [Ando and Zhang, A Framework for Learning Predictive Structures
from Multiple Tasks and Unlabeled Data, JMLR 2005]
!  [Ji etal, Extracting Shared Subspace for Multi-label Classification,
KDD 2008]
127
Reference	(cont.)
!  [Raina etal., Self-taught Learning: Transfer Learning from Unlabeled
Data, ICML 2007]
!  [Dai etal., Boosting for Transfer Learning, ICML 2007]
!  [Glorot etal., Domain Adaptation for Large-Scale Sentiment
Classification: A Deep Learning Approach, ICML 2011]
!  [Davis and Domingos, Deep Transfer vis Second-order Markov Logic,
ICML 2009]
!  [Mihalkova etal., Mapping and Revising Markov Logic Networks for
Transfer Learning, AAAI 2007]
!  [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic
Lexicons, ACL 2012]
128
Reference	(cont.)
!  [Sugiyama etal., Direct Importance Estimation with Model Selection
and Its Application to Covariate Shift Adaptation, NIPS 2007]
!  [Kanamori etal., A Least-squares Approach to Direct Importance
Estimation, JMLR 2009]
!  [Cris7anini	etal.,	On	Kernel	Target	Alignment,	NIPS	2002]
!  [Huang etal., Correcting Sample Selection Bias by Unlabeled Data,
NIPS 2006]
!  [Zadrozny, Learning and Evaluating Classifiers under Sample
Selection Bias, ICML 2004]
129
Transfer	Learning	in	Convolu7onal	
Neural	Networks
•  Convolutional neural networks (CNN): outstanding
image-classification.	
•  Learning CNNs requires a very large number of
annotated image samples	
–  Millions of parameters, to many that prevents application
of CNNs to problems with limited training data.	
•  Key Idea: 	
–  the internal layers of the CNN can act as a generic
extractor of mid-level image representation	
–  Model-based Transfer Learning
Thank You
131

Transfer Learning: An overview