77. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen
[in German] Diploma thesis, T.U. Mu?nich (1991).

78. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term
dependencies with gradient descent is difficult. IEEE Trans. Neural Networks 5,
157–166 (1994).


79. Hochreiter, S. & Schmidhuber, J. Long short-term memory.
Neural Comput. 9, 1735–1780 (1997).

This paper introduced LSTM
recurrent networks, which have become a crucial ingredient in recent advances
with recurrent networks because they are good at learning long-range

80. ElHihi, S. & Bengio, Y. Hierarchical recurrent neural
networks for long-term dependencies. In Proc. Advances in Neural Information
Processing Systems 8


81. Sutskever, I. Training Recurrent Neural Networks. PhD thesis,
Univ. Toronto (2012).

82. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of
training recurrent neural networks. In Proc. 30th International Conference on
Machine Learning 1310–1318 (2013).


83. Sutskever, I., Martens, J. & Hinton, G. E. Generating text
with recurrent neural networks. In Proc. 28th International Conference on
Machine Learning 1017– 1024 (2011).


75. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J.
Distributed representations of words and phrases and their compositionality. In
Proc. Advances in Neural Information Processing Systems 26 3111–3119 (2013).


17. Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence
learning with neural networks. In Proc. Advances in Neural Information
Processing Systems 273104–3112 (2014).

This paper showed
state-of-the-art machine translation results with thearchitecture introduced in
ref. 72, with a recurrent network trained to read asentence in one language,
produce a semantic representation of its meaning,and generate a translation in
another language.

72. Cho, K. et al. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. In Proc. Conference on
Empirical Methods in Natural Language Processing 1724–1734 (2014).

76. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine
translation by jointly learning to align and translate. In Proc. International
Conference on Learning Representations http://arxiv.org/abs/1409.0473 (2015).


84. Lakoff, G. & Johnson, M. Metaphors We Live By (Univ. Chicago
Press, 2008).

85. Rogers, T. T. & McClelland, J. L. Semantic Cognition: A
Parallel Distributed Processing Approach (MIT Press, 2004).


86. Xu, K. et al. Show, attend and tell: Neural image caption
generation with visual attention. In Proc. International Conference on Learning
Representations http://arxiv.org/abs/1502.03044 (2015).


lstm have several layers
for each time step

87. Graves, A., Mohamed, A.-R. & Hinton, G. Speech recognition
with deep recurrent neural networks. In Proc. International Conference on
Acoustics, Speech and Signal Processing 6645–6649 (2013).


88. Graves, A., Wayne, G. &
Danihelka, I. Neural Turing machines. http://arxiv.org/abs/1410.5401 (2014).


89. Weston, J. Chopra, S. & Bordes, A. Memory networks.
http://arxiv.org/abs/1410.3916 (2014).


90. Weston, J., Bordes, A., Chopra, S. & Mikolov, T. Towards
AI-complete question answering: a set of prerequisite toy tasks.http://arxiv.org/abs/1502.05698(2015).


91. Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The
wake-sleep algorithm for unsupervised neural networks. Science 268, 1558–1161

92. Salakhutdinov, R. & Hinton, G. Deep Boltzmann machines. In
Proc. International Conference on Artificial Intelligence and Statistics
448–455 (2009).

93. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A.
Extracting and composing robust features with denoising autoencoders. In Proc.
25th International Conference on Machine Learning 1096–1103 (2008).

94. Kavukcuoglu, K. et al. Learning convolutional feature
hierarchies for visual recognition. In Proc. Advances in Neural Information
Processing Systems 23 1090–1098 (2010).

95. Gregor, K. & LeCun, Y. Learning fast approximations of
sparse coding. In Proc. International Conference on Machine Learning 399–406

96. Ranzato, M., Mnih, V., Susskind, J. M. & Hinton, G. E.
Modeling natural images using gated MRFs. IEEE Trans. Pattern Anal. Machine
Intell. 35, 2206–2222(2013).

97. Bengio, Y., Thibodeau-Laufer, E., Alain, G. & Yosinski, J.
Deep generative stochastic networks trainable by backprop. In Proc. 31st
International Conference on Machine Learning 226–234 (2014).

98. Kingma, D., Rezende, D., Mohamed, S. & Welling, M.
Semi-supervised learning with deep generative models. In Proc. Advances in
Neural Information Processing Systems 27 3581–3589 (2014).


99. Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple object
recognition with visual attention. In Proc. International Conference on
Learning Representations。http://arxiv.org/abs/1412.7755 (2014).


100. Mnih, V. et al. Human-level control through deep reinforcement
learning. Nature518, 529–533 (2015).

rnn learn strategies for
selectively attending to one part at a time

76. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine
translation by jointly learning to align and translate. In Proc. International
Conference on Learning Representations http://arxiv.org/abs/1409.0473 (2015).

86. Xu, K. et al. Show, attend and tell: Neural image caption
generation with visual attention. In Proc. International Conference on Learning
Representations http://arxiv.org/abs/1502.03044 (2015).


101. Bottou, L. From machine learning to machine reasoning. Mach.
Learn. 94,133–149 (2014).


102. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and
tell: a neural image caption generator. In Proc. International Conference on
Machine Learning http://arxiv.org/abs/1502.03044 (2014).


103. van der Maaten, L. & Hinton, G. E. Visualizing data using
t-SNE. J. Mach. Learn.Research 9, 2579–2605 (2008).

