Abstract
We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many two-class classifiers into a multiclass classifiers. For an -class problem, the DDAG contains classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernel-induced feature space and uses two-class maximal margin hyperplanes at each decision-node of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms.
-
1 Introduction
The problem of multiclass classification, especially for systems like SVMs, doesn‘t present an easy solution. It is generally simpler to construct classifier theory and algorithms for two mutually-exclusive classes than for mutually-exclusive classes. We believe constructing -class SVMs is still an unsolved research problem.
The standard method for -class SVMs is to construct SVMs. The th SVM will be trained with all of the examples in the th class with positive labels, and all other examples with negative labels. We refer to SVMs trained in this way as SVMs (short for one-versus-rest). The final output of the SVMs is the class that corresponds to the SVM with the highest output value. Unfortunately, there is no bound on the generalization error for the SVM, and the training time of the standard method scales linearly with .
Another method for constructing -class classifiers from SVMs is derived from previous research into combining two-class classifiers. Knerr suggested constructing all possible two-class classifiers from a training set of classes, each classifier being trained on only two out of classes. There would thus be classifiers. When applied to SVMs, we refer to this as SVMs (short for one-versus-one).
Knerr suggested combining these two-class classifiers with an “AND” gate. Friedman suggested a Max Wins algorithm: each classifier casts one vote for its preferred class, and the final result is the class with the most votes. Friedman shows circumstances in which this algorithm is Bayes optimal. KreBel applies the Max Wins algorithm to Support Vector Machines with excellent results.
A significant disadvantage of the approach, however, is that, unless the individual classifiers are carefully regularized (as in SVMs), the overall -class classifier system will tend to overfit. The “AND” combination method and the Max Wins combination method do not have bounds on the generalization error. Finally, the size of the classifier may grow superlinearly with , and hence, may be slow to evaluate on large problems.
-
2 Decision DAGs
A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. A Rooted DAG has a unique node such that it is the only node which has no arcs pointing into it. A Rooted Binary DAG has nodes which have either or arcs leaving them. We will use Rooted Binary DAGs in order to define a class of functions to be used in classification tasks. The class of functions computed by Rooted Binary DAGs is formally defined as follows.
Definition 1Decision DAGs (DDAGs). Given a space and a set of boolean functions , the class of Decision DAGs on classes over are functions which can be implemented using a rooted binary DAGs with leaves labeled by the classes where each of the internal nodes is labeled with an element of . The nodes are arranged in a triangle with the single root node at the top, two nodes in the second layer and so on until the final layer of leaves. The -th node in layer is connected to the -th and -st node in the -st layer.
To evaluate a particular DDAG G on input , starting at the root node, the binary function at node is evaluated. The node is then exited via the left edge, if the binary function is zero; or the right edge, if the binary function is one. The next node‘s binary function is then evaluated. The value of the decision function is the value associated with the final leaf node. The path taken through the DDAG is known as the evaluation path. The input reaches a node of the graph, if that node is on the evaluation path for . We refer to the decision node distinguishing classes and as the -node. Assuming that the number of a leaf is its class, this node is the -th node in the -th layer provided . Similarly the -nodes are those nodes involving class , that is, the internal nodes on the two diagonals containing the leaf labeled by .
The DDAG is equivalent to operating on a list, where each node eliminates one class from the list. The list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. Thus, for a problem with classes, decision nodes will be evaluated in order to derive an answer.
The current state of the list is the total state of the system. Therefore, since a list state is reachable in more than one possible path through the system, the decision graph the algorithm traverses is a DAG, not simply a tree.
Decision DAGs naturally generalize the class of Decision Trees, allowing for a more efficient representation of redundancies and repetitions that can occur in different branches of the tree, by allowing the merging of different decision paths. The class of functions implemented is the same as that of Generalized Decision Trees, but this particular representation presents both computational and learning-theoretical advantages.
3 Analysis of Generalization
In this paper we study DDAGs where the node-classifiers are hyperplanes. We define a Perceptron DDAG to be a DDAG with a perceptron at every node. Let be the (unit) weight vector correctly splitting the and classes at the -node with threshold . We define the margin of the -node to be , where is the class associated to training example . Note that, in this definition, we only take into account examples with class labels equal to or .
Theorem 1 Suppose we are able to classifya random sample of labeled examples using a Perceptron DDAG on classes containing decision nodes with margins at node , then we can bound the generalization error with probability greater than to be less than
,
where and is the radius of a ball containing the distribution‘s support.
Theorem 1 implies that we can control the capacity of DDAGs by enlarging their margin. Note that, in some situations, this bound may be pessimistic: the DDAG partitions the input space into polytopic regions, each of which is mapped to a leaf node and assigned to a specific class. Intuitively, the only margins that should matter are the ones relative to the boundaries of the cell where a given training point is assigned, whereas the bound in Theorem 1 depends on all the margins in the graph.
By the above observations, we would expect that a DDAG whose -node margins are large would be accurate at identifying class , even when other nodes do not have large margins. Theorem 2 substantiates this by showing that the appropriate bound depends only on the -node margins, but first we introduce the notation, .
Theorem 2 Suppose we are able to correctly distinguish class from the other classes in a random -sample with a DDAG over classes containing decision nodes with margins at node , then with probability ,
,
where , and is the radius of a ball containing the support of the distribution.