C4.5分类决策树算法,其核心算法是ID3算法。目前应用在临床决策、生产制造、文档分析、生物信息学、空间数据建模等领域。算法的输入是带类标的数据,输出是树形的决策规则。
C4.5比ID3的改进:
1)用信息增益率来选择属性,克服了用信息增益选择属性时偏向选择取值多的属性的不足;
2)在树构造过程中进行剪枝;
3)能够完成对连续属性的离散化处理;
4)能够对不完整数据进行处理。
C4.5算法优点:产生的分类规则易于理解,准确率较高。
C4.5算法缺点:在构造树的过程中,需要对数据集进行多次的顺序扫描和排序,因而导致算法的低效。
算法过程:
?C4.5(DataSet,featureList):
–创建根节点R
–如果当前DataSet中的数据都属于同一类,则标记R的类别为该类
–如果当前featureList集合为空,则标记R的类别为当前DataSet中样本最多的类别
–递归情况:
?从featureList中选择属性F(选择GainRatio(DataSet,
F)最大的属性,连续属性参见上面的离散化过程)
?根据F的每一个值v,将DataSet划分为不同的子集DS,对于每一个DS:
–创建节点C
–如果DS为空,节点C标记为DataSet中样本最多的类别
–如果DS不为空,节点C=C4.5(DS,featureList
- F)
–将节点C添加为R的子节点
利用weka将weather数据用c4.5分类构建的决策树如下图:
用SQL算法实现c4.5算法的核心代码:
drop procedure if exists buildtree; DELIMITER | create procedure buildtree() begin declare le int default 1; declare letemp int default 1; declare current_num int default 1; declare current_class varchar(20); declare current_gain double; declare current_table varchar(20) default 'weather'; update infoset set state=1,statetemp=1; rr:while (1=1) do set @weather = (select play from weather where class is null limit 0,1); set @feature =(select x from infoset where statetemp=1 limit 0,1); if (@weather is null ) then leave rr; else if(@feature is null) then update infoset set statetemp = state; end if; end if; if (@weather is not null) then b:begin set current_gain = (select max(info_Gain) from infoset where statetemp=1); set current_class = (select x from infoset where info_Gain = current_gain); drop table if exists aa; set @a=concat('create temporary table aa select distinct ',current_class,' as namee from weather where class is null'); prepare stmt1 from @a; execute stmt1; tt:while (1=1) do set @x = (select namee from aa limit 0,1); if (@x is not null) then a0:begin drop table if exists bb; set @b=concat('create temporary table bb select * from ', current_table,' where ',current_class,' = ? and class is null'); prepare stmt2 from @b; execute stmt2 using @x; set @count = (select count(distinct play) from bb); if (@count =1) then a1:begin update weather set class = current_num,levelnum = letemp where id in (select id from bb); set current_num = current_num+1; if (current_table ='cc') then delete from cc where id in (select id from bb); end if; set @f=(select play from cc limit 0,1); if (@f is null) then set current_table='weather'; update infoset set statetemp=state; set letemp =le; end if; delete from aa where namee = @x; end a1; end if; if (@count>1) then drop table if exists cc; create temporary table cc select * from bb; set current_table = 'cc'; set letemp = letemp+1; leave tt; end if; if(@count=0) then delete from aa where namee = @x; set le = le+1; end if; end a0; else update infoset set state=0 where x=current_class; leave tt; end if; end while; update infoset set statetemp=0 where x=current_class; end b; end if; end while; end | delimiter ;
运行后分类结果如下图所示:
其中,class属性记录分类结果,levelnum属性记录各个数据在决策树中的层数。
对程序中各个表的解释:
?表1 weather
增加levelnum属性列,用于存放每条记录在生成的分类树中的层数。
?表2 C45_temp
用于存放中间结果。其中属性列logv和parent,分别用于存储各属性分量的中间计算结果和所属类别。
?表3 infoset用于存放各个属性的信息增益率。属性列state和statetemp用于在计算过程中做标记。