在机器学习中,随机森林由许多的决策树组成,因为这些决策树的形成采用了随机的方法,因此也叫做随机决策树。随机森林中的树之间是没有关联的。当测试数据进入随机森林时,其实就是让每一颗决策树进行分类,最后取所有决策树中分类结果最多的那类为最终的结果。因此随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。随机森林可以既可以处理属性为离散值的量,比如ID3算法,也可以处理属性为连续值的量,比如C4.5算法。另外,随机森林还可以用来进行无监督学习聚类和异常点检测。
决策树这种算法有着很多良好的特性,比如说训练时间复杂度较低,预测的过程比较快速,模型容易展示(容易将得到的决策树做成图片展示出来)等。但是同时,单决策树又有一些不好的地方,比如说容易over-fitting,虽然有一些方法,如剪枝可以减少这种情况,但是还是不够的。
决策树实际上是将空间用超平面进行划分的一种方法,每次分割的时候,都将当前的空间一分为二,比如说下面的决策树:
就是将空间划分成下面的样子:
这样使得每一个叶子节点都是在空间中的一个不相交的区域,在进行决策的时候,会根据输入样本每一维feature的值,一步一步往下,最后使得样本落入N个区域中的一个(假设有N个叶子节点)
Data Mining
Data Mining is all about automating the process of searching for patterns in the data.
Searching for High Info Gains
Given something (e.g. wealth) you are trying to predict, it is easy to ask the computer to find which attribute has highest information gain for it.
Base Cases
• Base Case One: If all records in current data subset have
the same output then don’t recurse
• Base Case Two: If all records have exactly the same set of
input attributes then don’t recurse
Training Set Error
• For each record, follow the decision tree to see what it would predict For what number of records does the decision tree’s prediction disagree with the true value in the database?
• This quantity is called the training set error.
The smaller the better.
Test Set Error
• Suppose we are forward thinking.
• We hide some data away when we learn the decision tree.
• But once learned, we see how well the tree predicts that data.
• This is a good simulation of what happens when we try to predict future data.
• And it is called Test Set Error.
决策树剪枝