Notes on Training recurrent networks online without backtracking

Summary

This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the forward method for automatic differentiation, but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.

My two cents

Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.

This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!

The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.

Also, I don‘t think I buy their argument that the "theory of stochastic gradient descent applies". Here‘s the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note G(t). It is update into G(t+1), using a recursion which is based on the chain rule. However, between computing G(t) and G(t+1), a gradient step is performed during training. This means that G(t) is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that G(t+1) (more specifically, its stochastic estimate as proposed in this paper) isn‘t unbiased anymore. So, unless I‘m missing something (which I might!), I don‘t think we can invoke the theory of SGD as they suggest.

But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn‘t one).

So overall, kudos to the authors, and I‘m really looking forward to read more about where this research goes!

The Fine Print: I write these notes sometimes hastily, and thus they might not always perfectly reflect what‘s in the paper. They are mostly meant to provide a first impression of the paper‘s topic, contribution and achievements. If your appetite is wet, I‘d recommend you dive in the paper and check for yourself. Oh, and do let me know if you think I got things wrong :-)

时间： 2024-08-29 06:29:47

Notes on Training recurrent networks online without backtracking的相关文章

A Beginner’s Guide to Recurrent Networks and LSTMs

A Beginner’s Guide to Recurrent Networks and LSTMs Contents Feedforward Networks Recurrent Networks Backpropagation Through Time Vanishing and Exploding Gradients Long Short-Term Memory Units (LSTMs) Capturing Diverse Time Scales Code Sample & Commen

Training Neural Networks: Q&A with Ian Goodfellow, Google

Training Neural Networks: Q&A with Ian Goodfellow, Google Neural networks require considerable time and computational firepower to train. Previously, researchers believed that neural networks were costly to train because gradient descent slows down n

[CS231n-CNN] Training Neural Networks Part 1 : activation functions, weight initialization, gradient flow, batch normalization | babysitting the learning process, hyperparameter optimization

课程主页:http://cs231n.stanford.edu/ ? Introduction to neural networks -Training Neural Network ______________________________________________________________________________________________________________________________________________________________

Notes on Training recurrent networks online without backtracking

Notes on Training recurrent networks online without backtracking

Notes on Training recurrent networks online without backtracking的相关文章

A Beginner’s Guide to Recurrent Networks and LSTMs

Training Neural Networks: Q&A with Ian Goodfellow, Google

[CS231n-CNN] Training Neural Networks Part 1 : activation functions, weight initialization, gradient flow, batch normalization | babysitting the learning process, hyperparameter optimization

[CS231n-CNN] Training Neural Networks Part 1 : parameter updates, ensembles, dropout

Notes on Convolutional Neural Networks

论文笔记《Notes on convolutional neural networks》

DL4J (DeepLearning for java)

Awesome Recurrent Neural Networks

Recurrent Neural Networks, LSTM, GRU