Lessons learned developing a practical large scale machine learning system

原文：http://googleresearch.blogspot.jp/2010/04/lessons-learned-developing-practical.html

Lessons learned developing a practical large scale machine learning system

Tuesday, April 06, 2010

Posted by Simon Tong, Google Research

When faced with a hard prediction problem, one possible approach is to attempt to perform statistical miracles on a small training set. If data is abundant then often a more fruitful approach is to design a highly scalable learning system and use several orders of magnitude more training data.

This general notion recurs in many other fields as well. For example, processing large quantities of data helps immensely for information retrieval and machine translation.

Several years ago we began developing a large scale machine learning system, and have been refining it over time. We gave it the codename “Seti” because it searches for signals in a large space. It scales to massive data sets and has become one of the most broadly used classification systems at Google.

After building a few initial prototypes, we quickly settled on a system with the following properties:

Binary classification (produces a probability estimate of the class label)

Parallelized

Scales to process hundreds of billions of instances and beyond

Scales to billions of features and beyond

Automatically identifies useful combinations of features

Accuracy is competitive with state-of-the-art classifiers

Reacts to new data within minutes

Seti’s accuracy appears to be pretty decent. For example, tests on standard smaller datasets indicate that it is comparable with modern classifiers.

Seti has the flexibility to be used on a broad range of training set sizes and feature sets. These sizes are substantially larger than those typically used in academia (e.g., the largest UCI datasethas 4 million instances). A sample of the data sets used with Seti gives the following statistics:

	Training set size	Unique features
Mean	100 Billion	1 Billion
Median	1 Billion	10 Million

A good machine learning system is all about accuracy, right?

In the process of designing Seti we made plenty of mistakes. However, we made some good key decisions as well. Here are a few of the practical lessons that we learned. Some are obvious in hindsight, but we did not necessarily realize their importance at the time.

Lesson: Keep it simple (even at the expense of a little accuracy).

Having good accuracy across a variety of domains is very important, and we were tempted to focus exclusively on this aspect of the algorithm. However, in a practical system there are several other aspects of an algorithm that are equally critical:

Ease of use: Teams are more willing to experiment with a machine learning system that is simple to set up and use. Those teams are not necessarily die-hard machine learning experts, and so they do not want to waste much time figuring out how to get a system up and running.

System reliability: Teams are much more willing to deploy a reliable machine learning system in a live environment. They want a system that is dependable and unlikely to crash or need constant attention. Early versions of Seti had marginally better accuracy on large data sets, but were complex, stressed the network and GFS architecture considerably, and needed constant babysitting. The number of teams willing to deploy these versions was low.

Seti is typically used in places where a machine learning system will provide a significant improvement in accuracy over the existing system. The gains are usually large enough that most teams do not care about the small differences in accuracy between different flavors of algorithms. And, in practice, the small differences are often washed out by other effects such as better data filtering, adding another useful feature, parameter tuning, etc. Teams much prefer having a stable, scalable and easy-to-use classification system. We found that these other aspects can be the difference between a deployable system and one that gets abandoned.

It is perhaps less academically interesting to design an algorithm that is slightly worse in accuracy, but that has greater ease of use and system reliability. However, in our experience, it is very valuable in practice.

Lesson: Start with a few specific applications in mind.

It was tempting to build a learning system without focusing on any particular application. After all, our goal was to create a large scale system that would be useful on a wide variety of present and future classification tasks. Nevertheless, we decided to focus primarily on a small handful of initial applications. We believe this decision was useful in several ways:

We could examine what the small number of domains had in common. By building something that would work for a few domains, it was likely the resulting system would be useful for others.

More importantly, it helped us quickly decide what aspects were unnecessary. We noticed that it was surprisingly easy to over-generalize or over-engineer a machine learning system. The domains grounded our project in reality and drove our decision making. Without them, even deciding how broad to make the input file format would have been harder (e.g., is it important to permit binary/categorical/real-valued features? Multiple classes? Fractional labels? Weighted instances?).

Working with a few different teams as initial guinea pigs allowed us to learn about common teething problems, and helped us smooth the process of deployment for future teams.

Lesson: Know when to say “no”.

We have a hammer, but we don‘t want to end up with bent screws. Being machine learning practitioners, it was very tempting for us to always recommend using machine learning for a problem. We saw very early on that, despite its many significant benefits, machine learning typically adds complexity, opacity and unpredictability to a system. In reality, simpler techniques are sometimes good enough for the task at hand. And in the long run, the extra effort that would have been spent integrating, maintaining and diagnosing issues with a live machine learning system could be spent on other way of improving the system instead.

Seti is often used in places where there is a good chance of significantly improving predictive accuracy over the incumbent system. And we usually advise teams against trying the system when we believe there is likely to be only a small improvement.

Large-scale machine learning is an important and exciting area of research. It can be applied to many real world problems. We hope that we have given a flavor of the challenges that we face, and some of the practical lessons that we have learned.

时间： 2024-10-27 01:34:24

Lessons learned developing a practical large scale machine learning system的相关文章

Machine Learning - XVII. Large Scale Machine Learning大规模机器学习 (Week 10)

http://blog.csdn.net/pipisorry/article/details/44904649 机器学习Machine Learning - Andrew NG courses学习笔记 Large Scale Machine Learning大规模机器学习 Learning With Large Datasets大数据集学习 Stochastic Gradient Descent随机梯度下降 Mini-Batch Gradient Descent迷你批处理梯度下降 Stochas

Week 10:Large Scale Machine Learning课后习题解答

大家好,我是Mac Jiang,今天和大家分享Coursera-Stanford University-Machine Learning-Week 10:Large Scale Machine Learning的课后习题解答.虽然我的答案通过了系统测试,但是我的分析不一定是正确的,如果各位博友发现错误或有更好的想法,请留言联系,谢谢.希望我的博客对您的学习有所帮助! 这单元,吴恩达老师主要讲了五个方面的内容: 1.随机梯度下降(Stochastic Gradient Descent),比较了随机

（原创）Stanford Machine Learning (by Andrew NG) --- (week 10) Large Scale Machine Learning & Application Example

本栏目来源于Andrew NG老师讲解的Machine Learning课程,主要介绍大规模机器学习以及其应用.包括随机梯度下降法.维批量梯度下降法.梯度下降法的收敛.在线学习.map reduce以及应用实例:photo OCR.课程地址为:https://www.coursera.org/course/ml (一)大规模机器学习从前面的课程我们知道,如果我们的系统是high variance的,那么增加样本数会改善我们的系统,假设现在我们有100万个训练样本,可想而知,如果使用梯度下降法,

Coursera机器学习-第十周-Large Scale Machine Learning

Gradient Descent with Large Datasets Learning With Large Datasets 我们已经知道,得到一个高效的机器学习系统的最好的方式之一是,用一个低偏差(low bias)的学习算法,然后用很多数据来训练它. 下面是一个区分混淆词组的例子: 但是,大数据存在一个问题,当样本容量m=1,000时还行,但是当m=100,000,000呢?请看一下梯度下降的更新公式: 计算一个θ值需要对1亿个数据进行求和,计算量显然太大,所以时间消耗肯定也就大了.

斯坦福第十七课：大规模机器学习(Large Scale Machine Learning)

17.1 大型数据集的学习 17.2 随机梯度下降法 17.3 微型批量梯度下降 17.4 随机梯度下降收敛 17.5 在线学习 17.6 映射化简和数据并行 17.1 大型数据集的学习 17.2 随机梯度下降法 17.3 微型批量梯度下降 17.4 随机梯度下降收敛 17.5 在线学习 17.6 映射化简和数据并行

Ng第十七课：大规模机器学习(Large Scale Machine Learning)

Coursera在线学习---第十节.大规模机器学习(Large Scale Machine Learning)

一.如何学习大规模数据集? 在训练样本集很大的情况下,我们可以先取一小部分样本学习模型,比如m=1000,然后画出对应的学习曲线.如果根据学习曲线发现模型属于高偏差,则应在现有样本上继续调整模型,具体调整策略参见第六节的高偏差时模型如何调整:如果发现模型属于高方差,则可以增加训练样本集. 二.随机梯度下降法(Stochastic Gradient Descent) 之前在讲到优化代价函数的时候,采取的都是"批量梯度下降法"Batch Gradient,这种方法在每次迭代的时候,都需要计

Stanford机器学习笔记-7. Machine Learning System Design

7 Machine Learning System Design Content 7 Machine Learning System Design 7.1 Prioritizing What to Work On 7.2 Error Analysis 7.3 Error Metrics for Skewed Classed 7.3.1 Precision/Recall 7.3.2 Trading off precision and recall: F1 Score 7.4 Data for ma

Advice for Applying Machine Learning & Machine Learning System Design----- Stanford Machine Learning（by Andrew NG）Course Notes

Adviceforapplyingmachinelearning Deciding what to try next 现在我们已学习了线性回归.逻辑回归.神经网络等机器学习算法,接下来我们要做的是高效地利用这些算法去解决实际问题,尽量不要把时间浪费在没有多大意义的尝试上,Advice for applying machine learning & Machinelearning system design 这两课介绍的就是在设计机器学习系统的时候,我们该怎么做? 假设我们实现了一个正则化的线性回