卷积神经网络和CIFAR-10:Yann LeCun专访 Convolutional Nets and CIFAR-10: An Interview with Yann LeCun

Recently Kaggle hosted a competition on the CIFAR-10 dataset. The CIFAR-10 dataset consists of 60k 32x32 colour images in 10 classes. This dataset was collected by Alex
Krizhevsky
Vinod Nair, and Geoffrey Hinton.

Many contestants used convolutional nets to tackle this competition. Some resulted in scores that beat human performance on this classification task. In this blog series we will interview three contestants and also a founding father of convolutional nets: Yann LeCun.

EXAMPLE OF CIFAR-10 DATASET

Yann LeCun

Yann Lecun is currently Director of AI Research at Facebook and a professor at NYU.

Which other scientists should be named and celebrated for the successes of convolutional nets?

Certainly, Kunihiko Fukushima‘s work on the Neo-Cognitron was an inspiration. Although the early forms of convnets owed little to the Neo-Cognitron, the version we settled on (with pooling layers) did.

A SCHEMATIC DIAGRAM ILLUSTRATING THE INTERCONNECTIONS BETWEEN LAYERS IN THE NEO-COGNITRON. FROM FUKUSHIMA K. (1980) NEOCOGNITRON: A SELF-ORGANIZING NEURAL NETWORK MODEL FOR A MECHANISM OF PATTERN RECOGNITION UNAFFECTED BY SHIFT IN POSITION.

Can you recount an aha!-moment or theoretical breakthrough during your early research into convolutional nets?

Not really. It was the logical thing to do. I had been playing with multi-layer nets with local connections since 1982 or so (not having the right learning algorithm though. Backprop didn‘t exist). I started experimenting with shared weight nets while I was a postdoc in Toronto in 1988.

The reason for not trying earlier is simply that I didn‘t have the software nor the data. Once I arrived at Bell Labs, I had access to a large dataset and fast computers (for the time). So I could try full-size convnets, and it worked amazingly well (though it required 2 weeks of training).

What is your opinion on the recent popularity of convolutional nets for object recognition? Did you expect it?

Yes. I knew it had to happen. It was a matter of time until the datasets become large enough and the computers powerful enough for deep learning algorithms to become better than human engineers at designing vision systems.

There was a symposium entitled "frontiers in computer vision" at MIT in August 2011. The title of my talk was “5 years from now, everyone will learn their features (you might as well start now)”. David Lowe (the inventor ofSIFT) said the same thing.

A SLIDE FROM THE TALK LECUN Y. (2011) “5 YEARS FROM NOW, EVERYONE WILL LEARN THEIR FEATURES (YOU MIGHT AS WELL START NOW)”.

Still, I was surprised by how fast the revolution happened and how much better convnets are, compared to other approaches. I would have expected the transition to be more gradual. Also, I would have expected unsupervised learning to play a greater role.

The character recognition model at AT&T was more than a simple classifier, but a complete pipeline. Can you tell more about the implementation problems your team faced?

We had to implement our own program language and write our own compiler to build this. Leon Bottou and I had written a neural net simulator called SN, back in 1987/1988. It was a Lisp interpreter with a numerical library (multidimensional arrays, neural net graphs…). We used this at Bell Labs to develop the first convnets.

Then in the early 90’s, we wanted to use our code in products. Initially, we hired a team of developers to convert our Lisp code to C/C++. But the resulting system could not be improved easily (it wasn’t a good platform for R&D). So Leon, Patrice Simard and I wrote a compiler for SN, which we used to develop the next generation OCR engine.

That system integrated a segmenter, a convnet, and a graphical model on top. The whole thing was trained end to end.

The graphical model was called a “graph transformer network”. It was conceptually similar to what we now call a conditional random field, or a structured perceptron (which it predates), but it allowed for non-linear scoring function (CRF and structured perceptrons can only have linear scoring functions).

The whole infrastructure was written in SN and compiled. This is the system that was deployed in ATM machines and check reading machines in 1996 and was reading 10 to 20% of all the checks in the US by the late 90’s.

AN ANIMATION SHOWING LENET 5 IN ACTION. FROM "INVARIANCE AND MULTIPLE CHARACTERS WITH SDNN (MULTIPLE CHARACTERS DEMO)".

In comparison with other methods, training convnets is pretty slow. How do you deal with the trade-off between experimentation and increased model training times? What does a typical development iteration look like?

In my experience, the best large-scale learning systems always take 2 or 3 weeks to train, regardless of the task, the method, the hardware, or the data.

I don’t know if convnets are “pretty slow”. Compared to what? They may be slow to train, but the alternative to “slow learning” is months of engineering efforts which doesn’t work as well in the end. Also, convnets are actually pretty fast to run (after training).

In a real application, no one really cares how long it takes to train. But people care a lot about how long it takes to run.

Which recent papers on convolutional nets are you most excited about? Any papers or ideas we should look out for?

There are lots and lots of ideas surrounding convnets and deep learning that have lived in relative obscurity for the last 20 years or so. No ones cared about it, and getting papers published was always a struggle. So, lots of ideas were never properly tried, never published, or were tried and published but soundly ignored and quickly forgotten. Who remembers that the first learning-based face detector that actually worked was a convolutional net (back in 1993, eight years before Viola-Jones)?

A FIGURE WITH PREDICTIONS FROM VAILLANT R., MONROCQ C., LECUN Y. (1993) "AN ORIGINAL APPROACH FOR THE LOCALISATION OF OBJECTS IN IMAGES".

Today, it’s really amazing to see so many young and bright people devoting so much creative energy to the topic and coming up with new ideas and new applications. The hardware / software infrastructure is getting better, and it’s becoming possible to train large networks in a few hours or a few days. So people can explore many more ideas that in the past.

One thing I’m excited about is the idea of “spectral convolutional net”. This was a paper at ICLR 2014 by folks from my NYU lab about a generalization of convolutional nets that can be applied to any graphs (regular convnets can be applied to 1D, 2D or 3D arrays that can be seen as regular grids in terms of graph). There are practical issues, but it opens the door to many more applications of convnets to unstructured data.

MNIST DIGITS ON A SPHERE. FROM BRUNA J., ZAREMBA W., SZLAM A., LECUN Y. (2013) "SPECTRAL NETWORKS AND DEEP LOCALLY CONNECTED NETWORKS ON GRAPHS".

I’m very excited about the application of convnets (and recurrent nets) to natural language understanding (following the seminal work of Collobert and Weston).

Since the error rate of a human is estimated to be around 6%, and Dr. Graham showed results of 4.47%, do you consider CIFAR-10 to be a solved problem?

It’s a solved problem in the same sense as MNIST is a solved problem. But frankly, people are more interested inImageNet than in CIFAR-10 nowadays. In that sense, CIFAR-10 is not a “real” problem. But it’s not a bad benchmark for a new algorithm.

What would it take for convnets to see a much wider adoption in the industry? Will training convnets and the software to set them up become less challenging?

What are you talking about? Convnets are absolutely everywhere now (or about to be everywhere) in industry:FacebookGoogleMicrosoftIBMBaiduNECTwitterYahoo!….

That said, it’s true that all of these companies have significant R&D resources and that training convnets can still be challenging for smaller companies or companies that are less technically advanced.

It still requires quite of bit of experience and time investment to train a convnet if you don’t have prior training. Soon however, there will be several simple to use open source packages with efficient back-ends for that.

Are we close to the limit for convnets? Or could CIFAR-100 be "solved" next?

I don’t think it’s a good test. ImageNet is a much better test.

Shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional architectures. Do deep nets really need to be deep?

Yes, deep nets need to be deep. Try to train a shallow net to emulate a deep convnet trained on ImageNet. Come back when you have done that. In theory, a deep net can be approximated by a shallow one. But on complex tasks, the shallow net will have to be be ridiculously large.

Most of your academic work is highly practical in nature. Is this something you purposefully aim for, or is this an artefact of being employed by companies? Can you tell about the distinction between theory and practice?

Hey, I’ve been in academia since 2003, and I’m still a part-time professor at NYU. I do theory when it helps me understand things. Theory often help us understand what’s possible and what’s not possible. It helps suggest proper ways to do things.

But sometimes theory restricts our thinking. Some people will not work with some models because the theory about them is too difficult. But often, a technique works well before the reasons for it working well are fully understood theoretically.

By restricting yourself to work on stuff you fully understand theoretically, you are condemned to using conceptually simple methods.

Also, sometimes theory blinds us. For example, some people were dazzled by kernel methods because of the cute math that goes with it. But, as I’ve said in the past, in the end, kernel machines are shallow networks that perform “glorified template matching”. There is nothing wrong with that (SVM is a great method), but it has dire limitations that we should all be aware of.

A SLIDE FROM LECUN Y. (2013) LEARNING HIERARCHIES FROM INVARIANT FEATURES

What is your opinion on a well-performing convnet without any theoretical justifications for why it should work so well? Do you generally favor performance over theory? Where do you place the balance?

I don’t think their is a choice to make between performance and theory. If there is performance, there will be theory to explain it.

Also, what kind of theory are we talking about? Is it a generalization bound? Convnets have a finite VC dimension, hence they are consistent and admit the classical VC bounds. What more do you want? Do you want a tighter bound, like what you get for SVMs? No theoretical bound that I know of is tight enough to be useful in practice. So I really don’t understand the point. Sure, generic VC bounds are atrociously non tight, but non-generic bounds (like for SVMs) are only slightly less atrociously non tight.

If what you desire are convergence proofs (or guarantees), that’s a little more complicated. The loss function of multi-layer nets is non convex, so the easy proofs that assume convexity are out the window. But we all know that in practice, a convnet will almost always converge to the same level of performance, regardless of the starting point (if the initialization is done properly). There is theoretical evidence that there are lots and lots of equivalent local minima and a very small number of “bad” local minima. Hence convergence is rarely a problem.

What is your opinion on AI hype. Which practices do you think are detrimental to the field (of AI in general and specifically convnets)?

AI hype is extremely dangerous. It killed various approaches to AI at least 4 times in the past. I keep calling out hype whenever I see it, whether it’s from the press, from startups looking for investors, from large companies looking for PR, or from academics looking for grants.

There is certainly quite a bit of hype around deep learning at the moment. I don’t see a particularly high level of hype around convnets specifically. There is more hype around “cortical this”, “spiking that”, and “neuromorphic blah”. Unlike many of these things, convnets actually yield good results on useful tasks and are widely deployed in industrial applications.

Any interesting projects at Facebook involving convnets that you could talk a little more about? Some basic stats about the size?

DeepFace: a convnet for face recognition. There are also convnets for image tagging. They are big.

A FIGURE DESCRIBING THE ARCHITECTURE FROM THE PRESENTATION "TAIGMAN Y., YANG M., RANZATO M., WOLF L. (2014) DEEPFACE FOR UNCONSTRAINED FACE RECOGNITION".

Recently you posted about 4 types of serious researchers. How would you label yourself?

I’m a 3, with a bit of 1 and 4.

  1. "People who want to explain/understand learning (and perhaps intelligence) at the fundamental/theoretical level.
  2. People who want to solve practical problems and have no interest in neuroscience.
  3. People who want to understand intelligence, build intelligent machines, and have a side interest in understanding how the brain works.
  4. People whose primary interest is to understand how the brain works, but feel they need to build computer models that actually work in order to do so."

Anything you wish to say to the top contestants in the CIFAR-10 challenge? Anything you wish to say to (hobbyist) researchers studying convnets? Anything in general you wish to say about the CIFAR dataset/problem?

I’m impressed by how much creativity and engineering knack went into this. It’s nice that people have pushed the technique as far as it will go on this dataset.

But it’s going to get easier and easier for independent researchers and hobbyist to play with these things and apply them to larger datasets. I think the successor to CIFAR-10 should be ImageNet-1K-128x128. This would be a version of the 1000 category ImageNet classification task where the images have been normalized to 128x128. I see several advantages:

  1. the networks are small enough to be trainable in a reasonable amount of time on a high-end gamer rig;
  2. the network you get at the end can actually be used for useful application (like robot vision);
  3. the network can be run in real time on embedded platforms, like smart phones or the NVIDIA Jetson TK1.

PREDICTIONS ON IMAGENET. FROM "KRIZHEVSKY A., SUTSKEVER I., HINTON. G.E. (2012) IMAGENET CLASSIFICATION WITH DEEP CONVOLUTIONAL NEURAL NETWORKS".

The need to have large amounts of labeled data can be a problem. What is your opinion on nets trained on unlabeled data, or the automatic labeling of data through image search engines?

There are tasks like video understanding and natural language understanding where we are going to have to use unsupervised learning. But these modalities have a temporal dimension that changes how we can approach the problem.

Clearly, we need to devise algorithms that can learn the structure of the perceptual world without being told the name of everything. Many of us have been working on this for years (if not decades), but none of us has a perfect solution.

What is your latest research focusing on?

There are two answers to this question:

  1. Projects I’m personally involved in (enough that I would be co-author on the papers);
  2. projects that I set the stage for, encourage other work on, and advise at the conceptual level, but in which I am not involved enough to be co-author on a paper.

A lot of (1) is at NYU and a lot of (2) is at Facebook.

The general areas are:

unsupervised learning that discovers “invariant” features, the marriage of deep learning and structured prediction, the unification of supervised and unsupervised learning, solving the problem of learning long-term dependencies, building learning systems with short-term/scratchpad memory, learning plans and sequences of actions, different ways to optimize functions than to follow the gradient, the integration of representation learning with reasoning (read Leon Bottou’s excellent position paper “from machine learning to machine reasoning”), the use of learning to perform inference efficiently, and many other topics.

from: http://blog.kaggle.com/2014/12/22/convolutional-nets-and-cifar-10-an-interview-with-yan-lecun/

时间: 2024-11-03 22:00:27

卷积神经网络和CIFAR-10:Yann LeCun专访 Convolutional Nets and CIFAR-10: An Interview with Yann LeCun的相关文章

人脸检测及识别python实现系列(4)——卷积神经网络(CNN)入门

人脸检测及识别python实现系列(4)--卷积神经网络(CNN)入门 上篇博文我们准备好了2000张训练数据,接下来的几节我们将详细讲述如何利用这些数据训练我们的识别模型.前面说过,原博文给出的训练程序使用的是keras库,对我的机器来说就是tensorflow版的keras.训练程序建立了一个包含4个卷积层的神经网络(CNN),程序利用这个网络训练我的人脸识别模型,并将最终训练结果保存到硬盘上.在我们实际动手操练之前我们必须先弄明白一个问题--什么是卷积神经网络(CNN)? CNN(Conv

转:面向视觉识别的卷积神经网络课程 & CNN的近期进展与实用技巧

http://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&mid=2650408190&idx=1&sn=f22adfb13fb14f8a220222355659913f 1. 如何了解NLP的现状: 看最新的博士论文的一些Tips 了解一个领域的现状,看最新的博士论文也许是个捷径.譬如有童鞋问如何了解NLP的state-of-the-art, 其实就斯坦福,伯克利, CMU, JHU等学校的近期博士论文选读一些,领域主流方向的概况就能了解一

数据挖掘(10):卷积神经网络算法的一个实现

前言 从理解卷积神经到实现它,前后花了一个月时间,现在也还有一些地方没有理解透彻,CNN还是有一定难度的,不是看哪个的博客和一两篇论文就明白了,主要还是靠自己去专研,阅读推荐列表在末尾的参考文献.目前实现的CNN在MINIT数据集上效果还不错,但是还有一些bug,因为最近比较忙,先把之前做的总结一下,以后再继续优化. 卷积神经网络CNN是Deep Learning的一个重要算法,在很多应用上表现出卓越的效果,[1]中对比多重算法在文档字符识别的效果,结论是CNN优于其他所有的算法.CNN在手写体

CS231n 卷积神经网络与计算机视觉 9 卷积神经网络结构分析

终于进入我们的主题了ConvNets或者CNNs,它的结构和普通神经网络都一样,之前我们学习的各种技巧方法都适用,其主要不同之处在于: ConvNet假定输入的是图片,我们根据图片的特性对网络进行设定以达到提高效率,减少计算参数量的目的. 1. 结构总览 首先我们分析下传统神经网络对于图片的处理,如果还是用CIFAR-10上的图片,共3072个特征,如果普通网络结构输入那么第一层的每一个神经单元都会有3072个权重,如果更大的像素的图片进入后参数更多,而且用于图片处理的网络一般深度达10层之上,

(转载)Convolutional Neural Networks卷积神经网络

Convolutional Neural Networks卷积神经网络 Contents 一:前导 Back Propagation反向传播算法 网络结构 学习算法 二:Convolutional Neural Networks卷积神经网络 三:LeCun的LeNet-5 四:CNNs的训练过程 五:总结 本文是我在20140822的周报,其中部分参照了以下博文或论文,如果在文中有一些没说明白的地方,可以查阅他们.对Yann LeCun前辈,和celerychen2009.zouxy09表示感谢

卷积神经网络

1. 概述 回想一下BP神经网络.BP网络每一层节点是一个线性的一维排列状态,层与层的网络节点之间是全连接的.这样设想一下,如果BP网络中层与层之间的节点连接不再是全连接,而是局部连接的.这样,就是一种最简单的一维卷积网络.如果我们把上述这个思路扩展到二维,这就是我们在大多数参考资料上看到的卷积神经网络.具体参看下图: 上图左:全连接网络.如果我们有1000x1000像素的图像,有1百万个隐层神经元,每个隐层神经元都连接图像的每一个像素点,就有1000x1000x1000000=10^12个连接

数据挖掘、目标检测中的cnn和cn---卷积网络和卷积神经网络

content 概述 文字识别系统LeNet-5 简化的LeNet-5系统 卷积神经网络的实现问题 深度神经网路已经在语音识别,图像识别等领域取得前所未有的成功.本人在多年之前也曾接触过神经网络.本系列文章主要记录自己对深度神经网络的一些学习心得. 第二篇,讲讲经典的卷积神经网络.我不打算详细描述卷积神经网络的生物学运行机理,因为网络上有太多的教程可以参考.这里,主要描述其数学上的计算过程,也就是如何自己编程去实现的问题. 1. 概述 回想一下BP神经网络.BP网络每一层节点是一个线性的一维排列

CNN卷积神经网络

CNN是一种多层神经网络,基于人工神经网络,在人工神经网络前,用滤波器进行特征抽取,使用卷积核作为特征抽取器,自动训练特征抽取器,就是说卷积核以及阈值参数这些都需要由网络去学习. 图像可以直接作为网络的输入,避免了传统识别算法中复杂的特征提取和数据重建过程. 一般卷积神经网络的结构: 前面feature extraction部分体现了CNN的特点,feature extraction部分最后的输出可以作为分类器的输入.这个分类器你可以用softmax或RBF等等. 局部感受野与权值共享 局部感受

卷积神经网络总结

一. CNN的生物原理,应用以及优点 CNN根据人眼睛视觉神经的局部感受野特点设计,广泛应用在图像图像,模式识别,机器视觉和语音识别中,它对图像平移.缩放.旋转等的变形具有高度不变性.总之,CNN的核心思想是将局部感受野,权值共享,时间或空间子采样这三种思想结合起来获得了某种程度的平移.缩放.旋转不变性. 二. CNN的网络结构 CNN是一个多层的神经网络,每层由多个二维平面组成,而每个平面由多个独立神经元组成. 图1 CNN的网络结构 C层为特征提取层,每个神经元的输入与前一层的局部感受野相连