Notes on Training recurrent networks online without backtracking

Notes on Training recurrent networks online without backtracking

Link: http://arxiv.org/abs/1507.07680

Summary

 

This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the forward method for automatic differentiation, but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.

My two cents

Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.

This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!

The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.

Also, I don‘t think I buy their argument that the "theory of stochastic gradient descent applies". Here‘s the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note G(t). It is update into G(t+1), using a recursion which is based on the chain rule. However, between computing G(t) and G(t+1), a gradient step is performed during training. This means that G(t) is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that G(t+1) (more specifically, its stochastic estimate as proposed in this paper) isn‘t unbiased anymore. So, unless I‘m missing something (which I might!), I don‘t think we can invoke the theory of SGD as they suggest.

But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn‘t one).

So overall, kudos to the authors, and I‘m really looking forward to read more about where this research goes!

The Fine Print: I write these notes sometimes hastily, and thus they might not always perfectly reflect what‘s in the paper. They are mostly meant to provide a first impression of the paper‘s topic, contribution and achievements. If your appetite is wet, I‘d recommend you dive in the paper and check for yourself. Oh, and do let me know if you think I got things wrong :-)

时间: 2024-08-29 06:29:47

Notes on Training recurrent networks online without backtracking的相关文章

A Beginner’s Guide to Recurrent Networks and LSTMs

A Beginner’s Guide to Recurrent Networks and LSTMs Contents Feedforward Networks Recurrent Networks Backpropagation Through Time Vanishing and Exploding Gradients Long Short-Term Memory Units (LSTMs) Capturing Diverse Time Scales Code Sample & Commen

Training Neural Networks: Q&A with Ian Goodfellow, Google

Training Neural Networks: Q&A with Ian Goodfellow, Google Neural networks require considerable time and computational firepower to train. Previously, researchers believed that neural networks were costly to train because gradient descent slows down n

[CS231n-CNN] Training Neural Networks Part 1 : activation functions, weight initialization, gradient flow, batch normalization | babysitting the learning process, hyperparameter optimization

课程主页:http://cs231n.stanford.edu/ ? Introduction to neural networks -Training Neural Network ______________________________________________________________________________________________________________________________________________________________

[CS231n-CNN] Training Neural Networks Part 1 : parameter updates, ensembles, dropout

课程主页:http://cs231n.stanford.edu/ _______________________________________________________________________________________________________________________________________________________ -Parameter Updates 解决的方法: *Momentum update 其实就是把x再加上mu*v(可以看作是下滑过

Notes on Convolutional Neural Networks

可参考http://blog.csdn.net/zouxy09/article/details/9993371 这篇文章主要介绍CNN卷积神经网络的数学推导,DeepLearningToolbox里面的CNN算法主要是基于这篇文章. 里面主要有BP的前向传播和反向传播的详细推导,可供参考. 来自为知笔记(Wiz) 附件列表

论文笔记《Notes on convolutional neural networks》

这是个06年的老文章了,但是很多地方还是值得看一看的. 一.概要 主要讲了CNN的Feedforward Pass和 Backpropagation Pass,关键是卷积层和polling层的BP推导讲解. 二.经典BP算法 前向传播需要注意的是数据归一化,对训练数据进行归一化到 0 均值和单位方差,可以在梯度下降上改善,因为这样可以防止过早的饱,这主要还是因为早期的sigmoid和tanh作为激活函数的弊端(函数在过大或者过小的时候,梯度都很小),等现在有了RELU和batch normali

DL4J (DeepLearning for java)

http://deeplearning4j.org/lstm.html A Beginner’s Guide to Recurrent Networks and LSTMs Contents Feedforward Networks Recurrent Networks Backpropagation Through Time Vanishing and Exploding Gradients Long Short-Term Memory Units (LSTMs) Capturing Dive

Awesome Recurrent Neural Networks

Awesome Recurrent Neural Networks A curated list of resources dedicated to recurrent neural networks (closely related to deep learning). Maintainers - Jiwon Kim, Myungsub Choi We have pages for other topics: awesome-deep-vision, awesome-random-forest

Recurrent Neural Networks, LSTM, GRU

Refer to : The Unreasonable Effectiveness of Recurrent Neural Networks Recurrent Neural Networks Sequences. Depending on your background you might be wondering: What makes Recurrent Networks so special? A glaring limitation of Vanilla Neural Networks