SLIQ/SPRINT

*/-->

SLIQ/SPRINT

Before SLIQ, most classification alogrithms have the problem that they do not scale. Because these alogrithms have the limit that the traning data should fit in memory. That‘s why SLIQ was raised.

1 Generic Decision-Tree Classification

Most decision-tree classifiers perform classification in two phases: Tree Building and Tree Pruning.

1.1 Tree Building

MakeTree(Training Data T)
   Partition(T);

Partition(Data S)
   if(all points in S are in the same class) then return;
   Evaluate splits for each attribute A
   Use the best split found to partition S into S1 and S2
   Partition(S1);
   Partition(S2);

1.2 Tree Pruning

As we have known, no matter how your preprocess works, there always exist "noise" data or other bad data. So, when we use the traning data to build the decision-tree classification, it also create branches for thos bad data. These branches can lead to errors when classifying test data. Tree pruning is aimed at removing these braches from decision tree by selecting the subtree with the least estimated error rate.

2 Scalability Issues

2.1 Tree Building

As I mentioned, ID3/C4.5/Gini1 is used to evaluate the "goodness" of the alternative splits for an attribute.

2.1.1 Splits for Numeric Attribute

The cost of evaluating splits for a numeric attribute is dominated by the cost of sorting the values. Therefore, an important scalability issue is the reduction of sorting costs for numeric attributes.

2.1.2 Splits for Categorical Attribute

2.2 Tree Pruning

3 SLIQ Classifier

To achieve this pre-sorting, we use the following data structures. We create a separate list for each attribute of the training data. Additionally, a separate list,called class list , is created for the class labels attached to the examples. An entry in an attribute list has two fields: one contains an attribute value, the other anindex into the class list. An entry of the class list also has two fields: one contains a class label, the other a reference to a leaf node of the decision tree. The i th entry of the class list corresponds to the i th example in the training data. Each leaf node of the decision tree represents a partition of the training data, the partition being defined by the conjunction of the predicates on the path from the node to the root. Thus, the class list can at any time identify the partition to which an example belongs. We assume that there is enough memory to keep the class list memory-resident. Attribute lists are written to disk if necessary.

Footnotes:

1

:http://www.cnblogs.com/mlhy/p/4856062.html

Author: mlhy

Created: 2015-10-08 四 21:29

Emacs 24.5.1 (Org mode 8.2.10)

时间: 2024-10-04 11:13:24

SLIQ/SPRINT的相关文章

02. 基本分类(1):基于决策树的分类

分类技术 主要的分类技术 ? 基于决策树的方法 ? 基于规则的方法 ? 基于实例的方法 ? 贝叶斯信念网络 ? 神经网络 ? 支持向量机 分类的两个主要过程 训练/学习过程 预测/应用过程 决策树归纳 构建决策树的主要算法 - Hunt  (最早的决策树归纳算法之一) - CART  (较为复杂,只适用于小规模数据的拟合) - ID3   (无法处理数值属性,需要将数值属性进行离散化预处理) - C4.5  (ID3的升级版本,基本算法同ID3,可以处理数值属性) - SLIQ,SPRINT(主

Sprint第一个冲刺(第六天)

一.Sprint介绍 今天我们完成了修改注册和登录直接用滚轮选择,主界面加入轮播图 . 实验截图: 主界面加入轮播图: 任务进度: 二.Sprint周期 看板: 燃尽图:

第二个Sprint冲刺总结

第二个Sprint冲刺总结 ( 1)团队Github: https://github.com/ouqifeng/EasyGoOperation.git ( 2 ) 团队贡献分: 廖焯燊:22 何武鹏:21 欧其锋:19 林海信:18 ( 3 ) 本阶段总结: 本阶段实现了软件的基本雏形,基本上可以投入正常的使用.但其中还存在许多不足,如UI不够美观以致用户体验不是很好.在某些方面做得不是很完善.存在一定的BUG. 在下一阶段,我们会把现有的问题逐一解决.并且还会有一定的创新,从而使该软件发挥更大

Sprint第三个冲刺(第五天)

一.Sprint介绍 实验截图: 任务进度: 二.Sprint周期 看板: 燃尽图:

Sprint第三个冲刺(第三天)

一.Sprint介绍 任务进度: 二.Sprint周期 看板: 燃尽图:

Sprint第三个冲刺(第一天)

一.Sprint介绍 任务进度: 二.Sprint周期 看板: 燃尽图:

Sprint第二个冲刺(第一天)

一.Sprint介绍 完成支付界面,可以选择支付宝和微信两种支付方式,但因为支付需要身份实名认证,所以暂时还不能真正支付. 实验截图: 任务进度: 二.Sprint周期 看板: 燃尽图:

sprint 2(第一天)

Sprint 2计划会议: 由于昨天周末,有同学回家不便于讨论,所以我们的sprint冲刺今天开始. sprint 2冲刺日期为:2016.11.28-2016.12.06. 我们开会讨论了这个sprint我们要实现的功能,根据需求分析和我们的进度做出了计划,然后进行大致的分工.这次计划的主要任务是做点餐系统中各个模块的增删改查功能,这些功能做好了,下面的工作就会更顺了. sprint目标: 1.实现用户模块的权限控制,能够进行用户登录的功能: 2.对菜单模块实现增加菜单列表详情,修改菜单列表详

sprint 1 的总结

sprint 1 的总结 做完第一个sprint冲刺,休息了两天,今天我们来总结一下. 1.之前没有看清楚要求,没有把我们的项目具体负责人的名单发出来,现在进行补充说明一下,便于大家了解我们的身份 现阶段目标:暂时先把有关前台处理的先做出来,如界面等 项目的角色分配: 产品负责人:  郭志豪(决定开发内容,最大化产品以及开发团队工作的价值.) Scrum Master: 刘森松 (保证团队的执行力) PM项目经理: 杨子健 (引导,督促团队的工作) 用户: 谭宇森( 从用户方面思考软件的缺陷)