(转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

2019-03-26 19:14:33

Paper:"The Evolved Transformer." So, David R., Chen Liang, and Quoc V. Le.  arXiv preprint arXiv:1901.11117 (2019).

This blog is copied fromhttps://towardsdatascience.com/the-evolved-transformer-enhancing-transformer-with-neural-architecture-search-f0073a915aca

Neural architecture search (NAS) is the process of algorithmically searching for new designs of neural networks. Though researchers have developed sophisticated architectures over the years, the ability to find the most efficient ones is limited, and recently NAS has reached the point where it can outperform human-designed models.

A new paper by Google Brain presents the first NAS to improve Transformer, one of the leading architecture for many Natural Language Processing tasks. The paper uses an evolution-based algorithm, with a novel approach to speed up the search process, to mutate the Transformer architecture to discover a better one?—?The Evolved Transformer (ET). The new architecture performs better than the original Transformer, especially when comparing small mobile-friendly models, and requires less training time. The concepts presented in the paper, such as the use of NAS to evolve human-designed models, has the potential to help researchers improve their architectures in many other areas.

Background

Transformers, first suggested in 2017, introduced an attention mechanism that processes the entire text input simultaneously to learn contextual relations between words. A Transformer includes two parts?—?an encoder that reads the text input and generates a lateral representation of it (e.g. a vector for each word), and a decoder that produces the translated text from that representation. The design has proven to be very effective and many of today’s state-of-the-art models (e.g. BERT, GPT-2) are based on Transformers. An in-depth review of Transformers can be found here.

While the Transformer’s architecture was hand-crafted manually by talented researchers, an alternative is to use search algorithms. Their goal is to find the best architecture in the given search space?—?A space that defines the constraints of any model in it, such as number of layers, maximum number of parameters, etc. A known search algorithm is the evolution-based algorithm, Tournament Selection, in which the fittest architectures survive and mutate while the weakest die. The advantage of this algorithm is its simplicity while still being efficient. The paper relies on a version presented in Real et al. (see pseudo-code in Appendix A):

  1. The first pool of models is initialized by randomly sampling the search space or by using a known model as a seed.
  2. These models are trained for the given task and randomly sampled to create subpopulation.
  3. The best models are mutated by randomly changing a small part of their architecture, such as replacing a layer or changing the connection between two layers.
  4. The mutated models (child models) are added to the pool while the weakest model from the subpopulation is removed from the pool.

Defining the search space is an additional challenge when solving a search problem. If the space is too broad and undefined, the algorithm might not converge and find a better model in a reasonable amount of time. On the other hand, a space that is too narrow reduces the probability of finding an innovative model that outperforms the hand-crafted ones. The NASNet search architecture approaches this challenge by defining “stackable cells”. A cell can contain a set of operations on its input (e.g. convolution) from a predefined vocabulary and the model is built by stacking the same cell architecture several times. The goal of the search algorithm is only to find the best architecture of a cell.

An example of the NASNet search architecture for image classification task that contains two types of stackable cells (Normal and Reduction Cell). Source: Zoph et al.

How Evolved Transformer (ET) works

As the Transformer architecture has proven itself numerous times, the goal of the authors was to use a search algorithm to evolve it into an even better model. As a result, the model frame and the search space were designed to fit the original Transformer architecture in the following way:

  1. The algorithm searches for two types of cells?—?one for the encoder with six copies (blocks) and another for the decoder with eight copies.
  2. Each block includes two branches of operations as shown in the following chart. For example, the inputs are any two outputs of the previous layers (blocks), a layer can be a standard convolution, attention head (see Transformer), etc, and activation can be ReLU and Leaky ReLU. Some elements can also be an identity operation or a dead-end.
  3. Each cell can be repeated up to six times.

In total, the search space adds up to ~7.3 * 10115 optional models. A detailed description of the space can be found in the appendix of the paper.

ET Stackable Cell format. Source: ET

Progressive Dynamic Hurdles (PDH)

Searching the entire space might take too long if the training and evaluation of each model are prolonged. It’s possible to overcome this problem in the field of image classification by performing the search on a proxy task, such as training a smaller dataset (e.g. CIFAR-10) before testing on a bigger dataset such as ImageNet. However, the authors couldn’t find an equivalent solution for translation models and therefore introduced an upgraded version of the tournament selection algorithm.

Instead of training each model in the pool on the entire dataset, a process that takes ~10 hours on a single TPU, the training is done gradually and only for the best models in the pool. The models in the pool are trained on a given amount of samples and more models are created according to the original tournament selection algorithm. Once there are enough models in the pool, a “fitness” threshold is calculated and only the models with better results (fitness) continue to the next step. These models will be trained on another batch of samples and the next models will be created and mutated based on them. As a result, PDH significantly reduces the training time spent on failing models and increases search efficiency. The downside is that “slow starters”, models that need more samples to achieve good results, might be missed.

An example of the tournament selection training process. The models above the fitness threshold are trained on more sample and therefore reach better fitness. The fitness threshold increases in steps as new models are created. Source: ET

To “help” the search achieve high-quality results the authors initialized the search with the Transformer model instead of a complete random model. This step is necessary due to computing resources constraints. The table below compares the performance of the best model (using the perplexity metric, the lower the better) of different search techniques?—?Transformer vs. random initialization and PDH vs. regular tournament selection (with a given number of training steps per model).

Comparison of different search techniques. Source: ET

The authors kept the total training time of each technique fixed and therefore the number of models differs: more training steps per model -> fewer total number of models can be searched and vice-versa. The PDH technique achieves the best results on average while being more stable (low variance). When reducing the number of training steps (30K), the regular technique performs almost as good on average as PDH. However, it suffers from a higher variance as it’s more prone to mistakes in the search process.

Results

The paper uses the described search space and PDH to find a translation model that performs well on known datasets such as WMT’18. The search algorithm ran on 15,000 models using 270 TPUs for a total of almost 1 billion steps, while without PDH the total required steps would have been 3.6 billion. The best model found was named The Evolved Transformer (ET) and achieved better results compared to the original Transformer (perplexity of 3.95 vs 4.05) and required less training time. Its encoder and decoder block architectures are shown in the following chart (compared to the original ones).

Transformer and ET encoder and decoder architectures. Source: ET

While some of the ET components are similar to the original one, others are less conventional such as depth-wise separable convolutions, which is more parameter efficient but less powerful compared to normal convolution. Another interesting example is the use of parallel branches (e.g. two convolution and RELU layers for the same input) in both the decoder and the encoder. The authors also discovered in an ablation study that the superior performance cannot be attributed to any single mutation of the ET compared to the Transformer.

Both ET and Transformer are heavy models with over 200 million parameters. Their size can be reduced by changing the input embedding (i.e. a word vector) size and the rest of the layers accordingly. Interestingly, the smaller the model the bigger the advantage of ET over Transformer. For example, for the smallest model with only 7 million parameters, ET outperforms Transformer by 1 perplexity point (7.62 vs 8.62).

Comparison of Transformer and ET for different model sizes (according to embedding size). FLOPS represents the training duration of the model. Source: ET

Implementation details

As mentioned, the search algorithm required used over 200 Google’s TPUs in order to train thousands of models in a reasonable time. The training of the final ET model itself is faster than the original Transformer but still takes hours with a single TPU on the WMT’14 En-De dataset.

The code is open-source and is available for Tensorflow here.

Conclusion

Evolved Transformer shows the potential of combining hand-crafted with neural search algorithms to create architectures that are consistently better and faster to train. As computing resources are still limited (even for Google), researchers still need to carefully design the search space and improve the search algorithms to outperform human-designed models. However, this trend will undoubtedly just grow stronger over time.

To stay updated with the latest Deep Learning research, subscribe to my newsletter on LyrnAI

Appendix A - Tournament Selection Algorithm

The paper is based on the tournament selection algorithm from Real et al.except for the aging process of discarding the oldest models from the population:

原文地址:https://www.cnblogs.com/wangxiaocvpr/p/10602764.html

时间: 2024-08-01 14:06:54

(转)The Evolved Transformer - Enhancing Transformer with Neural Architecture Search的相关文章

Neural Architecture Search — Limitations and Extensions

Neural Architecture Search — Limitations and Extensions 2019-09-16 07:46:09 This blog is from: https://towardsdatascience.com/neural-architecture-search-limitations-and-extensions-8141bec7681f For the past couple of years, researchers and companies h

告别炼丹,Google Brain提出强化学习助力Neural Architecture Search | ICLR2017

论文为Google Brain在16年推出的使用强化学习的Neural Architecture Search方法,该方法能够针对数据集搜索构建特定的网络,但需要800卡训练一个月时间.虽然论文的思路有很多改进的地方,但该论文为AutoML的经典之作,为后面很多的研究提供了思路,属于里程碑式的论文,十分值得认真研读,后面读者会持续更新AutoML的论文,有兴趣的可以持续关注 ? 来源:晓飞的算法工程笔记 公众号 论文:Neural Architecture Search with Reinfor

Video Architecture Search

Video Architecture Search 2019-10-20 06:48:26 This blog is from: https://ai.googleblog.com/2019/10/video-architecture-search.html Posted by Michael S. Ryoo, Research Scientist and AJ Piergiovanni, Student Researcher, Robotics at Google Video understa

AAAI 2020 论文

Detection && classification: TANet: Robust 3D Object Detection from Point Clouds with Triple Attention RDSNet: A New Deep Architecture for Reciprocal Object Detection and Instance Segmentation ZoomNet: Part-Aware Adaptive Zooming Neural Network fo

(转) AI突破性论文及代码实现汇总

本文转自:https://zhuanlan.zhihu.com/p/25191377 AI突破性论文及代码实现汇总 极视角 · 2 天前 What Can AI Do For You? "The business plans of the next 10,000 startups are easy to forecast: Take X and add AI." - Kevin Kelly "A hundred years ago electricity transforme

NeuralFinder:集成人工生命和遗传算法自动发现神经网络最优结构

/* 版权声明:可以任意转载,转载时请标明文章原始出处和作者信息 .*/ 张俊林 黄通文 马柏樟  薛会萍                一.为什么要做神经网络结构自动发现 从16年年中开始,我们开始思考最优的深度神经网络结构自动发现的问题,并在业余时间开始逐步做些探索性的实验.当时的出发点其实很简单:对于解决某个机器学习任务,目前的常规做法是通过算法研发人员分析问题特性,并不断设计修改试探深度神经网络的结构,找到最适合解决手头问题的网络结构,然后通过不断调参来获得解决问题的最优网络结构及其对应

【转载】史上最全:TensorFlow 好玩的技术、应用和你不知道的黑科技

[导读]TensorFlow 在 2015 年年底一出现就受到了极大的关注,经过一年多的发展,已经成为了在机器学习.深度学习项目中最受欢迎的框架之一.自发布以来,TensorFlow 不断在完善并增加新功能,直到在这次大会上发布了稳定版本的 TensorFlow V1.0.这次是谷歌第一次举办的TensorFlow开发者和爱好者大会,我们从主题演讲.有趣应用.技术生态.移动端和嵌入式应用多方面总结这次大会上的Submit,希望能对TensorFlow开发者有所帮助. TensorFlow:面向大

NASNet学习笔记——?? 核心一:延续NAS论文的核心机制使得能够自动产生网络结构; ?? 核心二:采用resnet和Inception重复使用block结构思想; ?? 核心三:利用迁移学习将生成的网络迁移到大数据集上提出一个new search space。

from:https://blog.csdn.net/xjz18298268521/article/details/79079008 NASNet总结 论文:<Learning Transferable Architectures for Scalable Image Recognition> 注 ??先啥都不说,看看论文的实验结果,图1和图2是NASNet与其他主流的网络在ImageNet上测试的结果的对比,图3是NASNet迁移到目标检测任务上的检测结果,从这图瞬间感觉论文的厉害之处了,值

MnasNet阅读笔

Abstract 设计移动设备上的CNN具有挑战性,需要保证模型小速度快准确率高,人为地权衡这三方面很困难,有太多种可能结构需要考虑. 本文中作者提出了一种用于设计资源受限的移动CNN模型的神经网络结构搜索方法.作者提出将时间延迟信息明确地整合到主要目标中,这样搜索模型可以识别一个网络是否很好地平衡了准确率和时间延迟. 先前的工作中通常用其他量来代表速度指标,如FLOPS,作者的做法是在特定平台(Pixel phone)上运行模型并直接测量其时间延迟. 为适当平衡搜索的灵活性和搜索空间的大小,作