数据挖掘算法学习（六）CART

分类回归树算法：CART(Classification And Regression Tree)算法采用一种二分递归分割的技术，将当前的样本集分为两个子样本集，使得生成的的每个非叶子节点都有两个分支。因此，CART算法生成的决策树是结构简洁的二叉树。

分类树两个基本思想：第一个是将训练样本进行递归地划分自变量空间进行建树的想法，第二个想法是用验证数据进行剪枝。

CART与C4.5的不同之处是节点在分裂时使用GINI指数。GINI指标主要是度量数据划分或训练数据集D的不纯度为主，系数值的属性作为测试属性，GINI值越小，表明样本的纯净度越高（即该样本只属于同一类的概率越高）。选择该属性产生最小的GINI指标的子集作为它的分裂子集。

算法步骤：

CART_classification(DataSet,
featureList, alpha,)：

创建根节点R

如果当前DataSet中的数据的类别相同，则标记R的类别标记为该类

如果决策树高度大于alpha，则不再分解，标记R的类别classify(DataSet)

递归情况：

标记R的类别classify(DataSet)

从featureList中选择属性F（选择Gini(DataSet,F)最小的属性划分，连续属性参考C4.5的离散化过程(以Gini最小作为划分标准)）

根据F，将DataSet做二元划分DS_L
和DS_R：

如果DS_L或DS_R为空，则不再分解

如果DS_L和DS_R都不为空，节点

C_L= CART_classification(DS_L,
featureList, alpha);

C_R= CART_classification(DS_R
featureList, alpha)

将节点C_L和C_R添加为R的左右子节点

使用SQL实现核心代码：

rr:while (1=1) do
	set @weather = (select id from weather where class = 0 limit 0,1);
	set @feature =(select parent from finalgini where statetemp=1 limit 0,1);
	if (@weather is null ) then
		leave rr;
	else if(@feature is null) then
		update finalgini set statetemp = state;
	end if;
	end if;
	if (@weather is not null) then
		b:begin
			set current_gini = (select min(gini) from finalgini where statetemp=1);
			set current_class = (select parent from finalgini where gini = current_gini);
			drop table if exists aa;
			create temporary table aa (namee varchar(100));
			insert into aa select class from finalgini where parent=current_class;
			insert into aa select class2 from finalgini where parent=current_class;
			tt:while (1=1) do
				set @x = (select namee from aa limit 0,1);
				if (@x is not null) then
					a0:begin
						drop table if exists bb;
						set @b=concat('create temporary table bb as \(select id from ', current_table,' where ',current_class,' regexp \'',@x,'\' and class = 0 \)');
						prepare stmt2 from @b;
						execute stmt2;
						set @count = (select count(distinct play) from bb left join weather on bb.id = weather.id);
						if (@count =1) then
							a1:begin
								update bb left join weather on bb.id=weather.id set class = current_num;
								set current_num = current_num+1;
								if (current_table ='cc') then
									delete from cc where id in (select id from bb);
								end if;
								set @f=(select play from cc limit 0,1);
								if (@f is null) then
									set current_table='weather';
									update finalgini set statetemp=state;
								end if;
							delete from aa where namee = @x;
							end a1;
							end if;
						if (@count>1) then
								set @id = (select count(id) from bb);
								if(@id = 2) then
									w:begin
									update bb left join weather on bb.id=weather.id set class = current_num where play='yes';
									set current_num = current_num+1;
									update bb left join weather on bb.id=weather.id set class = current_num where play='no';
									set current_num = current_num+1;
									if (current_table ='cc') then
										delete from cc where id in (select id from bb);
									end if;
									set @f=(select play from cc limit 0,1);
									if (@f is null) then
										set current_table='weather';
										update finalgini set statetemp=state;
									end if;
									delete from aa where namee = @x;
									end w;
									end if;
								if(@id > 2) then
									drop table if exists cc;
									create temporary table cc select * from weather inner join bb using(id);
									set current_table = 'cc';
									leave tt;
								end if;
							end if;
						if(@count=0) then
								delete from aa where namee = @x;
							end if;
				end a0;
				else
					update finalgini set state=0 where parent=current_class;
					leave tt;
				end if;
			end while;
			update finalgini set statetemp=0 where parent=current_class;
	 end b;
end if;
end while;
end |
delimiter ;

程序中表的解释：

?表2 classgini各属性不同分类集合的gini值

?表3finalgini存放各个属性的最优分类及对应gini值

时间： 2024-12-07 11:08:26

数据挖掘算法学习（六）CART

数据挖掘算法学习（六）CART的相关文章

数据挖掘算法学习（三）NaiveBayes算法

数据挖掘算法学习（一）K-Means算法

数据挖掘算法学习（四）PCA算法

数据挖掘算法学习（八）Adaboost

数据挖掘算法学习（七）SVM

数据挖掘算法学习（九）EM算法-上篇-多元高斯分布

数据结构和算法学习六，之非递归排序

数据挖掘算法学习（五）C4.5

18大经典数据挖掘算法小结