1、 P220
对该段文字的解决:
得到最大bin 数量后,求最大split 数量。对于无序特征,split = bin 数目/2;对于有序特征,split = bin 数目–1。
其中有读者问到:对于无序特征,split = bin 数目/2这个的由来,解释如下:
1)首先计算numBins:
// 当前的特征数量小于m值,则认为无序
if (numCategories <=maxCategoriesForUnorderedFeature) {//无序时
unorderedFeatures.add(featureIndex)
numBins(featureIndex) = numUnorderedBins(numCategories)
} else {//有序时
numBins(featureIndex) = numCategories
}
根据以上可知,无序时numBins = numUnorderedBins(numCategories)
其中numUnorderedBins函数如下:
/**
* Given the arity of a categorical feature(arity = number of categories),
* return the number of bins for the featureif it is to be treated as an unordered feature.
* There is 1 split for every partitioning ofcategories into 2 disjoint, non-empty sets;
* there are math.pow(2, arity - 1) - 1 suchsplits.
* Each split has 2 corresponding bins.
* 解释:一次划分会有2个bins,好比,切西瓜,一刀下去,分成2块
*/
def numUnorderedBins(arity: Int): Int = 2 * ((1 << arity - 1) - 1)
根据公式:numBins = 2*math.pow(2,arity - 1) – 1
2)根据numBins计算numSplits:
def numSplits(featureIndex: Int): Int = if(isUnordered(featureIndex)) {
numBins(featureIndex) >> 1
} else {
numBins(featureIndex) - 1
}
根据公式:numSplits = numBins/2= math.pow(2, arity - 1) – 1