Deep Learning and the Triumph of Empiricism

Deep Learning and the Triumph of Empiricism

By Zachary Chase Lipton, July 2015

Deep learning is now the standard-bearer for many tasks in supervised machine learning. It could also be argued that deep learning has yielded the most practically useful algorithms in unsupervised machine learning in a few decades. The excitement stemming from these advances has provoked a flurry of research and sensational headlines from journalists. While I am wary of the hype, I too find the technology exciting, and recently joined the party, issuing a 30-page critical review on Recurrent Neural Networks (RNNs) for sequence learning.

But many in the machine learning research community are not fawning over the deepness. In fact, for many who fought to resuscitate artificial intelligence research by grounding it in the language of mathematics and protecting it with theoretical guarantees, deep learning represents a fad. Worse, to some it might seem to be a regression.1

In this article, I‘ll try to offer a high-level and even-handed analysis of the useful-ness of theoretical guarantees and why they might not always be as practically useful as intellectually rewarding. More to the point, I‘ll offer arguments to explain why after so many years of increasingly statistically sound machine learning, many of today‘s best performing algorithms offer no theoretical guarantees.

Guarantee What?

A guarantee is a statement that can be made with mathematical certainty about the behavior, performance or complexity of an algorithm. All else being equal, we would love to say that given sufficient time our algorithm A can find a classifier H from some class of models {H1, H2, ...} that performs no worse than H*, where H* is the best classifier in the class. This is, of course, with respect to some fixed loss function L. Short of that, we‘d love to bound the difference or ratio between the performance of H and H* by some constant. Short of such an absolute bound, we‘d love to be able to prove that with high probability H and H* are give similar values after running our algorithm for some fixed period of time.

Many existing algorithms offer strong statistical guarantees. Linear regression admits an exact solution. Logistic regression is guaranteed to converge over time. Deep learning algorithms, generally, offer nothing in the way of guarantees. Given an arbitrarily bad starting point, I know of no theoretical proof that a neural network trained by some variant of SGD will necessarily improve over time and not be trapped in a local minima. There is a flurry of recent work which suggests reasonably that saddle points outnumber local minima on the error surfaces of neural networks (an m-dimensional surface where m is the number of learned parameters, typically the weights on edges between nodes). However, this is not the same as proving that local minima do not exist or that they cannot be arbitrarily bad.

Problems with Guarantees

Provable mathematical properties are obviously desirable. They may have even saved machine learning, giving succor at a time when the field of AI was ill-defined, over-promising, and under-delivering. And yet many of today‘s best algorithms offer nothing in the way of guarantees. How is this possible?

I‘ll explain several reasons in the following paragraphs. They include:

  1. Guarantees are typically relative to a small class of hypotheses.
  2. Guarantees are often restricted to worst-case analysis, but the real world seldom presents the worst case.
  3. Guarantees are often predicated on incorrect assumptions about data.

Selecting a Winner from a Weak Pool

To begin, theoretical guarantees usually assure that a hypothesis is close to the best hypothesis in some given class. This in no way guarantees that there exists a hypothesis in the given class capable of performing satisfactorily.

Here‘s a heavy handed example: I desire a human editor to assist me in composing a document. Spell-check may come with guarantees about how it will behave. It will identify certain misspellings with 100% accuracy. But existing automated proof-reading tools cannot provide the insight offered by an intelligent human. Of course, a human offers nothing in the way of mathematical guarantees. The human may fall asleep, ignore my emails, or respond nonsensically. Nevertheless, a he/she is capable of expressing a far greater range of useful ideas than Clippy.

A cynical take might be that there are two ways to improve a theoretical guarantee. One is to improve the algorithm. Another is to weaken the hypothesis class of which it is a member. While neural networks offer little in the way of guarantees, they offer a far richer set of potential hypotheses than most better understood machine learning models. As heuristic learning techniques and more powerful computers have eroded the obstacles to effective learning, it seems clear that for many models, this increased expressiveness is essential for making predictions of practical utility.

The Worst Case May Not Matter

Guarantees are most often given in the worst case. By guaranteeing a result that is within a factor epsilon of optimal, we say that the worst case will be no worse than a factor epsilon. But in practice, the worst case scenario may never occur. Real world data is typically highly structured, and worst case scenarios may have a structure such that there is no overlap between a typical and pathological dataset. In these settings, the worst case bound still holds, but it may be the case that all algorithms perform much better. There may not be a reason to believe that the algorithm with the better worst case guarantee will have a better typical case performance.

Predicated on Provably Incorrect Assumptions

Another reason why models with theoretical soundness may not translate into real-world performance is that the assumptions about data necessary to produce theoretical results are often known to be false. Consider Latent Dirichlet Allocation (LDA) for example, a well understood and remarkably useful algorithm for topic modeling. Many theoretical proofs about LDA are predicated upon the assumption that a document is associated with a distribution over topics. Each topic is in turn associated with a distribution over all words in the vocabulary. The generative process then proceeds as follows. For each word in a document, first a topic is chosen stochastically according to the relative probabilities of each topic. Then, conditioned on the chosen topic, a word is chosen according to that topic‘s word distribution. This process repeats until all words are chosen.

Clearly, this assumption does not hold on any real-world natural language dataset. In real documents, words are chosen contextually and depend highly on the sentences they are placed in. Additionally document lengths aren‘t arbitrarily predetermined, although this may be the case in undergraduate coursework. However, given the assumption of such a generative process, many elegant proofs about theoretical properties of LDA hold.

To be clear, LDA is indeed a broadly useful, state of the art algorithm. Further, I am convinced that theoretical investigations of the properties of algorithms, even under unrealistic assumptions is a worthwhile and necessary step to improve our understanding and lay the groundwork for more general and powerful theorems later. In this article, I seek only to contextualize the nature of much known theory and to give intuition to data science practitioners about why the algorithms with the most favorable theoretical properties are not always the best performing empirically.

The Triumph of Empiricism

One might ask, If not guided entirely by theory, what allows methods like deep learning to prevail? Further Why are empirical methods backed by intuition so broadly successful now even as they fell out of favor decades ago?

In answer to these question, I believe that the existence of comparatively humongous, well-labeled datasets like ImageNet is responsible for resurgence in heuristic methods. Given sufficiently large datasets, the risk of overfitting is low. Further, validating against test data offers a means to address the typical case, instead of focusing on the worst case. Additionally, the advances in parallel computing and memory size have made it possible to follow-up on many hypotheses simultaneously with empirical experiments. Empirical studies backed by strong intuition offer a path forward when we reach the limits of our formal understanding.

Caveats

For all the success of deep learning in machine perception and natural language, one could reasonably argue that by far, the three most valuable machine learning algorithms are linear regression, logistic regression, and k-means clustering, all of which are well-understood theoretically. A reasonable counter-argument to the idea of a triumph of empiricism might be that far the best algorithms are theoretically motivated and grounded, and that empiricism is responsible only for the newest breakthroughs, not the most significant.

Few Things Are Guaranteed

When attainable, theoretical guarantees are beautiful. They reflect clear thinking and provide deep insight to the structure of a problem. Given a working algorithm, a theory which explains its performance deepens understanding and provides a basis for further intuition. Given the absence of a working algorithm, theory offers a path of attack.

However, there is also beauty in the idea that well-founded intuitions paired with rigorous empirical study can yield consistently functioning systems that outperform better-understood models, and sometimes even humans at many important tasks. Empiricism offers a path forward for applications where formal analysis is stifled, and potentially opens new directions that might eventually admit deeper theoretical understanding in the future.

1Yes, corny pun.

Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.

Related:

时间: 2024-12-16 23:43:39

Deep Learning and the Triumph of Empiricism的相关文章

Why Deep Learning Works – Key Insights and Saddle Points

Why Deep Learning Works – Key Insights and Saddle Points A quality discussion on the theoretical motivations for deep learning, including distributed representation, deep architecture, and the easily escapable saddle point. By Matthew Mayo. This post

Does Deep Learning Come from the Devil?

Does Deep Learning Come from the Devil? Deep learning has revolutionized computer vision and natural language processing. Yet the mathematics explaining its success remains elusive. At the Yandex conference on machine learning prospects and applicati

Neural Networks and Deep Learning学习笔记ch1 - 神经网络

近期開始看一些深度学习的资料.想学习一下深度学习的基础知识.找到了一个比較好的tutorial,Neural Networks and Deep Learning,认真看完了之后觉得收获还是非常多的.从最主要的感知机開始讲起.到后来使用logistic函数作为激活函数的sigmoid neuron,和非常多其它如今深度学习中常使用的trick. 把深度学习的一个发展过程讲得非常清楚,并且还有非常多源代码和实验帮助理解.看完了整个tutorial后打算再又一次梳理一遍,来写点总结.以后再看其它资料

Deep Learning Enables You to Hide Screen when Your Boss is Approaching

https://github.com/Hironsan/BossSensor/ 背景介绍 学生时代,老师站在窗外的阴影挥之不去.大家在玩手机,看漫画,看小说的时候,总是会找同桌帮忙看着班主任有没有来. 一转眼,曾经的翩翩少年毕业了,新的烦恼来了,在你刷知乎,看视频,玩手机的时候,老板来了! 不用担心,不用着急,基于最新的人脸识别+手机推送做出的BossComing.老板站起来的时候,BossComing会通过人脸识别发现老板已经站起来,然后通过手机推送发送通知“BossComing”,并且震动告

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1

3.Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1 http://blog.csdn.net/sunbow0 Spark MLlib Deep Learning工具箱,是根据现有深度学习教程<UFLDL教程>中的算法,在SparkMLlib中的实现.具体Spark MLlib Deep Learning(深度学习)目录结构: 第一章Neural Net(NN) 1.源码 2.源码解析 3.实例 第二章D

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.2

3.Spark MLlib Deep Learning Convolution Neural Network(深度学习-卷积神经网络)3.2 http://blog.csdn.net/sunbow0 第三章Convolution Neural Network (卷积神经网络) 2基础及源码解析 2.1 Convolution Neural Network卷积神经网络基础知识 1)基础知识: 自行google,百度,基础方面的非常多,随便看看就可以,只是很多没有把细节说得清楚和明白: 能把细节说清

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.3

3.Spark MLlib Deep Learning Convolution Neural Network(深度学习-卷积神经网络)3.3 http://blog.csdn.net/sunbow0 第三章Convolution Neural Network (卷积神经网络) 3实例 3.1 测试数据 按照上例数据,或者新建图片识别数据. 3.2 CNN实例 //2 测试数据 Logger.getRootLogger.setLevel(Level.WARN) valdata_path="/use

Deep Learning(深度学习)学习笔记整理系列七

Deep Learning(深度学习)学习笔记整理系列 声明: 1)该Deep Learning的学习系列是整理自网上很大牛和机器学习专家所无私奉献的资料的.具体引用的资料请看参考文献.具体的版本声明也参考原文献. 2)本文仅供学术交流,非商用.所以每一部分具体的参考资料并没有详细对应.如果某部分不小心侵犯了大家的利益,还望海涵,并联系博主删除. 3)本人才疏学浅,整理总结的时候难免出错,还望各位前辈不吝指正,谢谢. 4)阅读本文需要机器学习.计算机视觉.神经网络等等基础(如果没有也没关系了,没

TensorFlow和深度学习新手教程(TensorFlow and deep learning without a PhD)

前言 上月导师在组会上交我们用tensorflow写深度学习和卷积神经网络.并把其PPT的參考学习资料给了我们, 这是codelabs上的教程:<TensorFlow and deep learning,without a PhD> 当然登入须要FQ,我也顺带巩固下,做个翻译.不好之处请包括指正. 当然须要安装python,教程推荐使用python3.假设是Mac,能够參考博主的另外两片博文,Mac下升级python2.7到python3.6, Mac安装tensorflow1.0 好多专业词