SLIQ/SPRINT

*/-->

SLIQ/SPRINT

Before SLIQ, most classification alogrithms have the problem that they do not scale. Because these alogrithms have the limit that the traning data should fit in memory. That‘s why SLIQ was raised.

1 Generic Decision-Tree Classification

Most decision-tree classifiers perform classification in two phases: Tree Building and Tree Pruning.

1.1 Tree Building

MakeTree(Training Data T)
   Partition(T);

Partition(Data S)
   if(all points in S are in the same class) then return;
   Evaluate splits for each attribute A
   Use the best split found to partition S into S1 and S2
   Partition(S1);
   Partition(S2);

1.2 Tree Pruning

As we have known, no matter how your preprocess works, there always exist "noise" data or other bad data. So, when we use the traning data to build the decision-tree classification, it also create branches for thos bad data. These branches can lead to errors when classifying test data. Tree pruning is aimed at removing these braches from decision tree by selecting the subtree with the least estimated error rate.

2 Scalability Issues

2.1 Tree Building

As I mentioned, ID3/C4.5/Gini¹ is used to evaluate the "goodness" of the alternative splits for an attribute.

2.1.1 Splits for Numeric Attribute

The cost of evaluating splits for a numeric attribute is dominated by the cost of sorting the values. Therefore, an important scalability issue is the reduction of sorting costs for numeric attributes.

2.1.2 Splits for Categorical Attribute

2.2 Tree Pruning

3 SLIQ Classifier

To achieve this pre-sorting, we use the following data structures. We create a separate list for each attribute of the training data. Additionally, a separate list,called class list , is created for the class labels attached to the examples. An entry in an attribute list has two fields: one contains an attribute value, the other anindex into the class list. An entry of the class list also has two fields: one contains a class label, the other a reference to a leaf node of the decision tree. The i th entry of the class list corresponds to the i th example in the training data. Each leaf node of the decision tree represents a partition of the training data, the partition being defined by the conjunction of the predicates on the path from the node to the root. Thus, the class list can at any time identify the partition to which an example belongs. We assume that there is enough memory to keep the class list memory-resident. Attribute lists are written to disk if necessary.

Footnotes:

¹

:http://www.cnblogs.com/mlhy/p/4856062.html

Author: mlhy

Created: 2015-10-08 四 21:29

Emacs 24.5.1 (Org mode 8.2.10)

时间： 2024-10-04 11:13:24

SLIQ/SPRINT

SLIQ/SPRINT

1 Generic Decision-Tree Classification

1.1 Tree Building

1.2 Tree Pruning

2 Scalability Issues

2.1 Tree Building

2.1.1 Splits for Numeric Attribute

2.1.2 Splits for Categorical Attribute

2.2 Tree Pruning

3 SLIQ Classifier

Footnotes:

SLIQ/SPRINT的相关文章

02. 基本分类（1）：基于决策树的分类

Sprint第一个冲刺（第六天）

第二个Sprint冲刺总结

Sprint第三个冲刺（第五天）

Sprint第三个冲刺（第三天）

Sprint第三个冲刺（第一天）

Sprint第二个冲刺（第一天）

sprint 2（第一天）

sprint 1 的总结