Large Margin DAGs for Multiclass Classification

Abstract

We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many two-class classifiers into a multiclass classifiers. For an -class problem, the DDAG contains classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernel-induced feature space and uses two-class maximal margin hyperplanes at each decision-node of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms.

  1. 1 Introduction

The problem of multiclass classification, especially for systems like SVMs, doesn‘t present an easy solution. It is generally simpler to construct classifier theory and algorithms for two mutually-exclusive classes than for mutually-exclusive classes. We believe constructing -class SVMs is still an unsolved research problem.

The standard method for -class SVMs is to construct SVMs. The th SVM will be trained with all of the examples in the th class with positive labels, and all other examples with negative labels. We refer to SVMs trained in this way as SVMs (short for one-versus-rest). The final output of the SVMs is the class that corresponds to the SVM with the highest output value. Unfortunately, there is no bound on the generalization error for the SVM, and the training time of the standard method scales linearly with .

Another method for constructing -class classifiers from SVMs is derived from previous research into combining two-class classifiers. Knerr suggested constructing all possible two-class classifiers from a training set of classes, each classifier being trained on only two out of classes. There would thus be classifiers. When applied to SVMs, we refer to this as SVMs (short for one-versus-one).

Knerr suggested combining these two-class classifiers with an “AND” gate. Friedman suggested a Max Wins algorithm: each classifier casts one vote for its preferred class, and the final result is the class with the most votes. Friedman shows circumstances in which this algorithm is Bayes optimal. KreBel applies the Max Wins algorithm to Support Vector Machines with excellent results.

A significant disadvantage of the approach, however, is that, unless the individual classifiers are carefully regularized (as in SVMs), the overall -class classifier system will tend to overfit. The “AND” combination method and the Max Wins combination method do not have bounds on the generalization error. Finally, the size of the classifier may grow superlinearly with , and hence, may be slow to evaluate on large problems.

  1. 2 Decision DAGs

A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. A Rooted DAG has a unique node such that it is the only node which has no arcs pointing into it. A Rooted Binary DAG has nodes which have either or arcs leaving them. We will use Rooted Binary DAGs in order to define a class of functions to be used in classification tasks. The class of functions computed by Rooted Binary DAGs is formally defined as follows.

Definition 1Decision DAGs (DDAGs). Given a space and a set of boolean functions , the class of Decision DAGs on classes over are functions which can be implemented using a rooted binary DAGs with leaves labeled by the classes where each of the internal nodes is labeled with an element of . The nodes are arranged in a triangle with the single root node at the top, two nodes in the second layer and so on until the final layer of leaves. The -th node in layer is connected to the -th and -st node in the -st layer.

To evaluate a particular DDAG G on input , starting at the root node, the binary function at node is evaluated. The node is then exited via the left edge, if the binary function is zero; or the right edge, if the binary function is one. The next node‘s binary function is then evaluated. The value of the decision function is the value associated with the final leaf node. The path taken through the DDAG is known as the evaluation path. The input reaches a node of the graph, if that node is on the evaluation path for . We refer to the decision node distinguishing classes and as the -node. Assuming that the number of a leaf is its class, this node is the -th node in the -th layer provided . Similarly the -nodes are those nodes involving class , that is, the internal nodes on the two diagonals containing the leaf labeled by .

The DDAG is equivalent to operating on a list, where each node eliminates one class from the list. The list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. Thus, for a problem with classes, decision nodes will be evaluated in order to derive an answer.

The current state of the list is the total state of the system. Therefore, since a list state is reachable in more than one possible path through the system, the decision graph the algorithm traverses is a DAG, not simply a tree.

Decision DAGs naturally generalize the class of Decision Trees, allowing for a more efficient representation of redundancies and repetitions that can occur in different branches of the tree, by allowing the merging of different decision paths. The class of functions implemented is the same as that of Generalized Decision Trees, but this particular representation presents both computational and learning-theoretical advantages.

3 Analysis of Generalization

In this paper we study DDAGs where the node-classifiers are hyperplanes. We define a Perceptron DDAG to be a DDAG with a perceptron at every node. Let be the (unit) weight vector correctly splitting the and classes at the -node with threshold . We define the margin of the -node to be , where is the class associated to training example . Note that, in this definition, we only take into account examples with class labels equal to or .

Theorem 1 Suppose we are able to classifya random sample of labeled examples using a Perceptron DDAG on classes containing decision nodes with margins at node , then we can bound the generalization error with probability greater than to be less than

,

where and is the radius of a ball containing the distribution‘s support.

Theorem 1 implies that we can control the capacity of DDAGs by enlarging their margin. Note that, in some situations, this bound may be pessimistic: the DDAG partitions the input space into polytopic regions, each of which is mapped to a leaf node and assigned to a specific class. Intuitively, the only margins that should matter are the ones relative to the boundaries of the cell where a given training point is assigned, whereas the bound in Theorem 1 depends on all the margins in the graph.

By the above observations, we would expect that a DDAG whose -node margins are large would be accurate at identifying class , even when other nodes do not have large margins. Theorem 2 substantiates this by showing that the appropriate bound depends only on the -node margins, but first we introduce the notation, .

Theorem 2 Suppose we are able to correctly distinguish class from the other classes in a random -sample with a DDAG over classes containing decision nodes with margins at node , then with probability ,

,

where , and is the radius of a ball containing the support of the distribution.

时间: 2024-10-22 03:22:13

Large Margin DAGs for Multiclass Classification的相关文章

基于Caffe的Large Margin Softmax Loss的实现(中)

小喵的唠叨话:前一篇博客,我们做完了L-Softmax的准备工作.而这一章,我们开始进行前馈的研究. 小喵博客: http://miaoerduo.com 博客原文:  http://www.miaoerduo.com/deep-learning/基于caffe的large-margin-softmax-loss的实现(中).html 四.前馈 还记得上一篇博客,小喵给出的三个公式吗?不记得也没关系. 这次,我们要一点一点的通过代码来实现这些公式.小喵主要是GPU上实现前后馈的代码,因为这个层只

基于Caffe的Large Margin Softmax Loss的实现(上)

小喵的唠叨话:在写完上一次的博客之后,已经过去了2个月的时间,小喵在此期间,做了大量的实验工作,最终在使用的DeepID2的方法之后,取得了很不错的结果.这次呢,主要讲述一个比较新的论文中的方法,L-Softmax,据说单model在LFW上能达到98.71%的等错误率.更重要的是,小喵觉得这个方法和DeepID2并不冲突,如果二者可以互补,或许单model达到99%+将不是梦想. 再次推销一下~ 小喵的博客网址是: http://www.miaoerduo.com 博客原文:  http://

machine learning(11) -- classification: multi-class classification

Multiclass classification例子: 邮箱的邮件的分类: 工作邮件,私人邮件,朋友的邮件,兴趣爱好的邮件 医学诊断: 没有生病,患有流感,患有普通感冒 天气: 晴天,兩,多云等 One-vs-all classfication = one-vs-rest : 每一次将一个class分出来,共构建3个classifiers hθ(i)(x) = P(y=i|x;θ)    (i=1;2;3) train a logistic regression classifier hθ(i

Andrew Ng 的 Machine Learning 课程学习 (week4) Multi-class Classification and Neural Networks

这学期一直在跟进 Coursera上的 Machina Learning 公开课, 老师Andrew Ng是coursera的创始人之一,Machine Learning方面的大牛.这门课程对想要了解和初步掌握机器学习的人来说是不二的选择.这门课程涵盖了机器学习的一些基本概念和方法,同时这门课程的编程作业对于掌握这些概念和方法起到了巨大的作用. 课程地址 https://www.coursera.org/learn/machine-learning 笔记主要是简要记录下课程内容,以及MATLAB

Multiclass Classification: One-vs-all

原文地址:https://www.cnblogs.com/7fancier/p/9424742.html

The perception and large margin classifiers

假设样例按照到来的先后顺序依次定义为.为样本特征,为类别标签.我们的任务是到来一个样例x,给出其类别结果y的预测值,之后我们会看到y的真实值,然后根据真实值来重新调整模型参数,整个过程是重复迭代的过程,直到所有的样例完成.这么看来,我们也可以将原来用于批量学习的样例拿来作为在线学习的样例.在在线学习中我们主要关注在整个预测过程中预测错误的样例数. 原文地址:https://www.cnblogs.com/wzdLY/p/10094730.html

focal loss for multi-class classification

转自:https://blog.csdn.net/Umi_you/article/details/80982190 Focal loss 出自何恺明团队Focal Loss for Dense Object Detection一文,用于解决分类问题中数据类别不平衡以及判别难易程度差别的问题.文章中因用于目标检测区分前景和背景的二分类问题,公式以二分类问题为例.项目需要,解决Focal loss在多分类上的实现,用此博客以记录过程中的疑惑.细节和个人理解,Keras实现代码链接放在最后. 框架:K

机器学习------资源分享

=======================国内==================== 之前自己一直想总结一下国内搞机器学习和数据挖掘的大牛,但是自己太懒了.所以没搞… 最近看到了下面转载的这篇博文,感觉总结的比较全面了. 个人认为,但从整体研究实力来说,机器学习和数据挖掘方向国内最强的地方还是在MSRA, 那边的相关研究小组太多,很多方向都能和数据挖掘扯上边.这里我再补充几个相关研究方向 的年轻老师和学者吧. 蔡登:http://www.cad.zju.edu.cn/home/dengca

Support Vector Machines for classification

Support Vector Machines for classification To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng: “SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learni