Paper Reading - Attention Is All You Need ( NIPS 2017 )

Link of the Paper: https://arxiv.org/abs/1706.03762

Motivation:

The inherently sequential nature of Recurrent Models precludes parallelization within training examples.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

Innovation:

The first sequence transduction model, the Transformer, relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or Convolutions. The Transformer follows the overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
- Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. The authors employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm (x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.
- Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, they employ residual connections around each of the sub-layers, followed by layer normalization. They also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Scaled Dot-Product Attention and Multi-Head Attention.

General Points:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

原文地址：https://www.cnblogs.com/zlian2016/p/9580587.html

时间： 2024-11-09 03:40:53

Paper Reading - Attention Is All You Need ( NIPS 2017 )的相关文章

Paper Reading: Stereo DSO

开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse Visual Odometry with Stereo Cameras Abstract Optimization objectives: intrinsic/extrinsic parameters of all keyframes all selected pixels' depth Inte

[Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

论文链接:https://arxiv.org/pdf/1502.03044.pdf 代码链接:https://github.com/kelvinxu/arctic-captions & https://github.com/yunjey/show-attend-and-tell & https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow 主要贡献在这篇文章中,作者将“注意力机制(Attention Mechanism

Paper Reading: Perceptual Generative Adversarial Networks for Small Object Detection

Perceptual Generative Adversarial Networks for Small Object Detection 2017-07-11 19:47:46 CVPR 2017 This paper use GAN to handle the issue of small object detection which is a very hard problem in general object detection. As shown in the followin

【Paper Reading】Object Recognition from Scale-Invariant Features

Paper: Object Recognition from Scale-Invariant Features Sorce: http://www.cs.ubc.ca/~lowe/papers/iccv99.pdf SIFT 即Scale Invariant Feature Transfrom, 尺度不变变换,由David Lowe提出.是CV最著名也最常用的特征.在图像目标识别的应用中,常常要求图像的特征有很好的roboust即不容易受到平移,旋转,尺度缩放,光照,仿射的英雄.SIFT算子具有

Paper Reading: Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual TrackingECCV 2016 The key point of KCF is the ability to efficiently exploit available negative data by including all shifted versions of a training sample, in anthor w

CVPR 2016 paper reading (6)

1. Neuroaesthetics in fashion: modeling the perception of fashionability, Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, Raquel Urtasun, in CVPR 2015. Goal: learn and predict how fashionable a person looks on a photograph, and suggest subtle

paper reading in 1/1/2016~1/3/2016

CVPR15:Person Count Localization in Videos from Noisy Foreground and Detections paper主要的contribution是定义了person count localization及其周边,不过虽然之前提过的person count问题常用结果评估标准是只看最后给出的counts,但其实之前的文章也并不完全是只给出global counts的.文中可能是更加重视这个localization的问题并且确实是利用这个信息去

Paper Reading 1 - Playing Atari with Deep Reinforcement Learning

来源:NIPS 2013 作者:DeepMind 理解基础: 增强学习基本知识深度学习特别是卷积神经网络的基本知识创新点:第一个将深度学习模型与增强学习结合在一起从而成功地直接从高维的输入学习控制策略详细是将卷积神经网络和Q Learning结合在一起.卷积神经网络的输入是原始图像数据(作为状态)输出则为每一个动作相应的价值Value Function来预计未来的反馈Reward 实验成果:使用同一个网络学习玩Atari 2600 游戏.在測试的7个游戏中6个超过了以往的方法而且好几个超

Paper Reading: In Defense of the Triplet Loss for Person Re-Identification

In Defense of the Triplet Loss for Person Re-Identification 2017-07-02 14:04:20 This blog comes from: http://blog.csdn.net/shuzfan/article/details/70069822 Paper: https://arxiv.org/abs/1703.07737 Github: https://github.com/VisualComputingInstitu