Matching Networks for One Shot Learning

1. Introduction



In this work, inspired by metric learning based on deep neural features and memory augment neural networks, authors propose matching networks that map a small labelled support set and an unlabelled example to its label. Then they define one-shot learning problems on vision and language tasks and obtain an improving one-shot accuracy on ImageNet and Omnight. The novelty of their work is twofold: at the modeling level, and at the training procedure.

2. Model



Their non-parametric approach to solving one-shot is based on two components. First, the model architecture follows recent advances in neural networks augmented with memory. Given a support set $S$, the model difines a function $c_S$(or classifier) for each $S$ Sencond, we employ a training strategy which is tailored for one-shot learning from the support set $S$

2.1 Model Architecture

Matching Networks are able to produce sensible test labels for unobserved classes without any changes to the network. We wish to map from a support set of $k$ examples of images-label pairs $S={(x_i,y_i)}_{i=1}^k$ to a classfier $c_S(\hat{x})$ which,given a test example $\hat{x}$, defines a probability distribution over outputs $\hat{y}$. Furthmore, difine the mapping $S\rightarrow c_S(\hat{x})$ to be $P(\hat{y} \mid \hat{x},S)$ where $P$ is parameterised by a neural network. Thus, When given a new support set of examples $S‘$ from which to one-shot learn, we simply use the parametric neural network defined by $P$ to make predictions about the appropriate label $\hat{y}$ for each test example $\hat{x}$: $P(\hat{y} \mid \hat{x},S‘)$. In general, our predicted output class for a given input unseen example $\hat{x}$ and a support set $S$ becomes $arg \max_y P(y\mid \hat{x},S)$. The model in its simplest form computes $\hat{y}$ as follows:

$$ \hat{y}=\sum_{i=1}^k a(\hat{x},x_i)y_i $$

where $x_i,y_i$ are the samples and labels from the support set $S=\{(x_i,y_i)\}_{i=1}^k$, and $a$ is an attention mechanism. Here,the attention kernel function is the softmax over the cosine distance. $$ a(\hat{x},x_i)=\frac{e^{c(f(\hat{x}),g(x_i))}}{\sum_{j=1}^k e^{c(f(\hat{x}),g(x_j))}} $$ where embeding functions $f$ and $g$ are, actually, appropriate neural networks to embed $\hat{x}$ and $x_i$

2.2 Training Strategy

Let us define a tast $T$ as distribution over possible label sets $L$. To form an “episode” to compute gradients and update our model, we first sample $L$ from $T$(e.g.,$L$ could be the label set {cats; dogs}). We then use $L$ to sample the support set $S$ and a batch $B$ (i.e., both $S$ and $B$ are labelled examples of cats and dogs). The Matching Net is then trained to minimise the error predicting the labels in the batch B conditioned on the support set $S$. This is a form of meta-learning since the training procedure explicitly learns to learn from a given support set to minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows:

$$ \theta = arg\max_{\theta}E_{L\sim T}\Big[E_{S\sim L,B\sim L}\Big[\sum_{(x,y)\in B}\log P_{\theta}(y\mid x,S)\Big]\Big] $$

Training $\theta$ with this objective function yields a model which works well when sampling $S‘\sim T‘$ from a different distribution of novel labels

3. Appendix


3.1 The Fully Conditional Embedding $f$

The embedding function for an example $\hat{x}$ in the batch $B$ is as follows:

$$ f(\hat{x},S)=attLSTM(f‘(\hat{x}),g(S),K) $$

where $f‘$ is a neural network. $K$ is the number of "processing" steps following work. $g(S)$ represents the embedding function $g$ applied to each element $x_i$ from the set $S$. Thus, the state after $k$ processing steps is as follows:

$$ \hat{h}_k,c_k = LSTM(f‘(\hat{x}),[h_{k-1},r_{k-1}],c_{k-1}) $$

$$ h_k = \hat{h}_k+f‘(\hat{x}) $$

$$ r_{k-1}=\sum_{i=1}^{|S|}a(h_{k-1},g(x_i))g(x_i) $$

$$ a(h_{k-1},g(x_i))=softmax(h_{k-1}^Tg(x_i)) $$

3.2 The Fully Conditional Embedding $g$

The encoding function for the elements in the support set $S$, $g(x_i,S)$ as a bidirectional LSTM. Let g‘(x_i) be a neural network, then we difine $g(x_i,S)=\vec{h}_i+h_i^{\leftarrow}+g‘(x_i)$ with:

$$ \vec{h}_i,\vec{c}_i=LSTM(g‘(x_i),\vec{h}_{i-1},\vec{c}_{i-1}) $$

$$ h_i^{\leftarrow},c_i^{\leftarrow}=LSTM(g‘(x_i),h_{i+1}^{\leftarrow},c_{i+1}^{\leftarrow}) $$

Reference: https://arxiv.org/abs/1606.04080

时间: 2024-09-30 23:00:37

Matching Networks for One Shot Learning的相关文章

Neural Networks Representation ----- Stanford Machine Learning(by Andrew NG)Course Notes

Andrew NG的Machine learning课程地址为:https://www.coursera.org/course/ml 神经网络一直被认为是比较难懂的问题,NG将神经网络部分的课程分为了两个星期来介绍,可见Neural Networks内容之多.言归正传,通过之前的学习我们知道,使用非线性的多项式能够帮助我们建立更好的分类模型.但当遇特征非常多的时候,需要训练的参数太多,使得训练非常复杂,使得逻辑回归有心无力. 例如我们有100个特征,如果用这100个特征来构建一个非线性的多项式模

Multi-attention Network for One Shot Learning

Multi-attention Network for One Shot Learning 2018-05-15 22:35:50  本文的贡献点在于: 1. 表明类别标签信息对 one shot learning 可以提供帮助,并且设计一种方法来挖掘该信息: 2. 提出一种 attention network 来产生 attention maps  for creating the image representation of an exemplar image in novel class

transfer learning

作者:刘诗昆链接:https://www.zhihu.com/question/41979241/answer/123545914来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出处. Transfer learning 顾名思义就是就是把已学训练好的模型参数迁移到新的模型来帮助新模型训练数据集.就跟其他知友回答的那样,考虑到大部分数据或任务是存在相关性的,所以通过transfer learning我们可以将已经学到的parameter 分享给新模型从而加快并优化模型的学

[C3] Andrew Ng - Neural Networks and Deep Learning

About this Course If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought after, and mastering deep learning will give you numerous new career opportunities. Deep learning is also a new "s

My deep learning reading list

My deep learning reading list 主要是顺着Bengio的PAMI review的文章找出来的.包括几本综述文章,将近100篇论文,各位山头们的Presentation.全部都可以在google上找到.BTW:由于我对视觉尤其是检测识别比较感兴趣,所以关于DL的应用主要都是跟Vision相关的.在其他方面比如语音或者NLP,很少或者几乎没有.个人非常看好CNN和Sparse Autoencoder,这个list也反映了我的偏好,仅供参考. Review Book Lis

[Source] Paper references on Deep Learning

Deep Learning References ________________________________________________________________ Review Book List:[2009 Thesis] Learning Deep Generative Models.pdf[2009] Learning Deep Architectures for AI.pdf[2013 DengLi Review] Deep Learning for Signal and

[C4] Andrew Ng - Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

About this Course This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good res

[C6] Andrew Ng - Convolutional Neural Networks

About this Course This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applica

Deep Learning for Natural Language Processing1

Focus, Follow, and Forward Stanford CS224d 课程笔记 Lecture1 Stanford CS224d 课程笔记 Lecture1 Stanford大学在2015年开设了一门Deep Learning for Natural Language Processing的课程,广受好评.并在2016年春季再次开课.我将开始这门课程的学习,并做好每节课的课程笔记放在博客上.争取做到每周一更吧.本文是第一篇. NLP简介 NLP,全名Natural Languag