Principle of DecisionTree Algorithm

Decision tree algorithm is a classic algorithm series in machine learning. It can be used as both a classification algorithm and a regression algorithm, and is also particularly suitable for integrated learning such as random forests. This article summarizes the principle of decision tree algorithm. The upper part summarizes the algorithm ideas of ID3 and C4.5 and the lower section focuses on the CART algorithm. The decision tree divides the entire feature space according to the stepwise attribute classification, thus distinguishing different classification samples.

1. The information theory basis of decision tree ID3 algorithm
The machine learning algorithm is very old. As a code farmer, I often knock on if, else if, else, but I already use the idea of ??decision tree. Just have you thought about it, there are so many conditions, which conditional feature is used to do if, and which conditional feature is better after if? How to accurately select this standard is the key to the decision tree machine learning algorithm. In the 1970s, a geek named Quinlan found a decision-making process that used the entropy of information theory to measure the decision tree. As soon as the method came out, its simplicity and efficiency caused a sensation. Quinlan called this algorithm ID3. Let‘s take a look at how the ID3 algorithm chooses features.

First, we need to be familiar with the concept of entropy in information theory. Entropy measures the uncertainty of things, and the more uncertain things, the greater its entropy. Specifically, the expression of the entropy of the random variable X is as follows:

H(X)=−∑ni=1[pilogpi]

Where n represents n different discrete values of X. Pi represents the probability that X is i, and log is the logarithm of 2 or e. For example, if X has 2 possible values, and the two values are 1/2 each, the entropy of X is the largest, and X has the greatest uncertainty. The value is H(X)=−(1/2log1/2+1/2log1/2)=log2. If one value has a probability greater than 1/2 and the other has a probability less than 1/2, the uncertainty decreases and the corresponding entropy decreases. For example, a probability 1/3, a probability 2/3, then the corresponding entropy is H(X)=−(1/3log1/3+2/3log2/3)=log3−2/3log2 < log2.

Familiar with the entropy of a variable X, it is easy to generalize to the joint entropy of multiple variables, here is the joint entropy expression of two variables X and Y:

H(X,Y)=−∑ni=1[p(xi,yi)logp(xi,yi)]

With joint entropy, the expression H(X|Y) of conditional entropy can be obtained. Conditional entropy is similar to conditional probability, which measures the uncertainty of our X after knowing Y. The expression is as follows:

H(X|Y)=−∑ni=1[p(xi,yi)logp(xi|yi)]=∑nj=1[p(yj)H(X|yj)]

Ok, after a big lap, I can finally go back to the ID3 algorithm. We just mentioned that H(X) measures the uncertainty of X. The conditional entropy H(X|Y) measures the uncertainty of X after we know Y, then H(X)-H(X|Y )? As can be seen from the above description, it measures the degree of uncertainty of X after knowing Y. This measure is called mutual information in information theory, and is recorded as I(X, Y). It is called information gain in the decision tree ID3 algorithm. The ID3 algorithm uses information gain to determine what features the current node should use to build a decision tree. The greater the information gain, the more suitable it is for classification.

Above a bunch of concepts, we are estimated to be dizzy, it is easy to understand their relationship with the following figure. The ellipse on the left represents H(X), the ellipse on the right represents H(Y), the overlap in the middle is our mutual information or information gain I(X,Y), and the left ellipse removes the coincident part as H(X|Y) The ellipse on the right side removes the coincident part and is H(Y|X). The sum of the two ellipse is H(X, Y).

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Attach Note: Entropy & Information Gain Concept Interpretation

1. Information entropy:

    H(X) describes the amount of information carried by X. The greater the amount of information (the more the value changes), the more uncertain it is and the less likely it is to be predicted.

For the problem of coin flipping, there are 2 cases at a time, the information entropy is 1

For the problem of casting a dice, there are 6 cases at a time, the information entropy is 1.75.

The following is the formula:

Where log2(p) can be understood as p, which needs to be represented by several bits. For example, p(x1)=1/2, p(x2)=1/4, p(x3)=1/8, p(x4)=1/8,
       It can be represented by x1: 1, x2: 10, x3: 110, x4: 111, because the bit with the higher probability is set to be shorter in order to minimize the average bit position. And -log2(p) corresponds to the number of bits.
       Then H(X) can be understood as the expected value of the bit.
Characteristics of information entropy: (with a probability of 1 as a premise)

  • a) The more uniform the probability distribution of different categories, the greater the information entropy;
  • b) The more the number of categories, the larger the information entropy;
  • c) The larger the information entropy, the less likely it is to be predicted; (the number of changes is large, the smaller the difference between changes, the less likely it is to be predicted) (for deterministic problems, the information entropy is 0; p=1; E=p *logp=0)

2. Information Gain IG(Y|X): Measures the ability of an attribute (x) to distinguish between samples (y). When a new attribute (x) is added, the change in the information entropy H(Y) is the information gain. The larger IG(Y|X), the more important x is.

Conditional entropy: H(Y|X), information entropy of Y under X condition

Information gain: IG(Y|X)=H(Y)-H(Y|X)

  • Entropy: In information theory and probability statistics, entropy is a measure of the uncertainty of a random variable.
  • Conditional entropy: A measure of the uncertainty of a random variable Y under the condition of a random variable X.
  • Information gain: The information gain indicates the degree to which the information of the feature X is known to reduce the uncertainty of the information of the class Y.
  • Information gain ratio: the ratio of the information gain g(D, A) to the entropy HA(D) of the training data set D with respect to the value of the feature A.
  • Gini index: Gini (D) represents the uncertainty of set D. The larger the Gini index, the greater the uncertainty of the sample set, which is similar to entropy.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

原文地址:https://www.cnblogs.com/aiden-liu/p/10773606.html

时间: 2024-10-13 06:46:24

Principle of DecisionTree Algorithm的相关文章

Complexity and Tractability (3.44) - The Traveling Salesman Problem

转自:http://csfieldguide.org.nz/en/curriculum-guides/ncea/level-3/complexity-tractability-TSP.html This is a guide for students attempting Complexity and Tractability in digital technologies achievement standard 3.44. This guide is not official, althou

[C7] Andrew Ng - Sequence Models

About this Course This course will teach you how to build models for natural language, audio, and other sequence data. Thanks to deep learning, sequence algorithms are working far better than just two years ago, and this is enabling numerous exciting

Affinity Propagation Algorithm

The principle of Affinity Propagation Algorithm is discribed at above. It is widly applied in many fields.

人工智能有简单的算法吗?Appendix: Is there a simple algorithm for intelligence?

In this book, we've focused on the nuts and bolts of neural networks: how they work, and how they can be used to solve pattern recognition problems. This is material with many immediate practical applications. But, of course, one reason for interest

The Inclusion-Exclusion Principle

The Inclusion-Exclusion Principle The inclusion-exclusion principle is an important combinatorial way to compute the size of a set or the probability of complex events. It relates the sizes of individual sets with their union. Statement The verbal fo

PLA Percentron Learning Algorithm #台大 Machine learning #

Percentron Learning Algorithm 于垃圾邮件的鉴别 这里肯定会预先给定一个关于垃圾邮件词汇的集合(keyword set),然后根据四组不通过的输入样本里面垃圾词汇出现的频率来鉴别是否是垃圾邮件.系统输出+1判定为垃圾邮件,否则不是.这里答案是第二组. 拿二维数据来做例子.我们要选取一条线来划分红色的叉叉,和蓝色的圈圈样本点(线性划分).怎么做呢?这里的困难之处就在于,其实可行的解可能存在无数条直线可以划分这些样本点.很难全部求解,或许实际生活中并不需要全部求解.于是,

STL algorithm算法is_partitioned(26)

is_partitioned原型: std::is_partitioned template <class InputIterator, class UnaryPredicate> bool is_partitioned (InputIterator first, InputIterator last, UnaryPredicate pred); 测试范围内的元素是否是以pred为准则的一个划分.如果是,则返回true,否则返回false. 划分的意思是说,对每个元素进行pred(*it),得

支付宝支付php的demo或sdk报错 Warning: openssl_sign() [function.openssl-sign]: Unknown signature algorithm. in

最近在做支付宝支付,在本地测试一切正常,上传到服务器就遇到报错: Warning: openssl_sign() [function.openssl-sign]: Unknown signature algorithm. in 后来查了查,是我的服务器上PHP环境支持openssl_sign()但却不支持 OPENSSL_ALGO_SHA256这样的参数,问了一下大佬,才发现这个参数是在php5.4.8以上版本才支持,低版本的是使用的SHA256,于是乎试了一下,搞定! 报错原因是支付宝的dem

Berlekamp-Massey Algorithm [for Team Problem 5525]

Input: 第一行为两个正整数n,m 第二行为n个整数a1..an 最后一行为一个正整数k Output: 为一个整数,代表方案数对1000000007取模的值 Sample Input 5 3 1 1 2 0 2 2 Sample Output 3 来自毛爷爷17年论文 Berlekamp-Massey Algorithm直接开算 1 #include<bits/stdc++.h> 2 using namespace std; 3 typedef long long ll; 4 const