五.Classification (I) – Tree, Lazy, and Probabilistic
五.分类(I)-树,延迟,和概率
In this chapter, we will cover the following recipes:
在本章中,我们将介绍下列菜谱:准备培训和测试数据集
1.Preparing the training and testing datasets
Introduction
介绍
Classification is used to identify a category of new observations (testing datasets) based on a classification model built from the training dataset, of which the categories are already known. Similar to regression, classification is categorized as a supervised learning method as it employs known answers (label) of a training dataset to predict the answer (label) of the testing dataset. The main difference between regression and classification is that regression is used to predict continuous values.
分类用于确定基于的新观察(测试数据集)的类别一个由培训数据集构建的分类模型,其中已经有分类已知的。与回归相似,分类被归类为监督学习方法因为它采用了一个训练数据集的已知答案(标签)来预测答案(标签)测试数据集。回归和分类的主要区别是回归用于预测连续值。
In contrast to this, classification is used to identify the category of a given observation. For example, one may use regression to predict the future price of a given stock based on historical prices. However, one should use the classification method to predict whether the
stock price will rise or fall.
与此形成对比的是,分类用于识别给定观察的类别。例如,人们可以使用回归来预测给定股票的未来价格历史价格。然而,我们应该用分类方法来预测股票价格将上涨或下跌。
In this chapter, we will illustrate how to use R to perform classification. We first build a training dataset and a testing dataset from the churn dataset, and then apply different classification methods to classify the churn dataset. In the following recipes, we will introduce the tree-based classification method using a traditional classification tree and a conditional inference tree, lazy-based algorithm, and a probabilistic-based method using the training dataset to
build up a classification model, and then use the model to predict the category (class label) of
the testing dataset. We will also use a confusion matrix to measure the performance.
在本章中,我们将说明如何使用R来执行分类。我们首先进行培训数据集和一个测试数据集从搅拌数据集,然后应用不同的分类方法对搅拌数据集进行分类。在接下来的食谱,我们将介绍基于树分类方法使用传统分类树和一个条件推理树、基于惰性的算法,以及使用培训数据集来实现的基于概率的方法建立一个分类模型,然后使用模型来预测类别(类标签)测试数据集。我们还将使用一个混乱矩阵来度量性能。
Preparing the training and testing datasets
准备培训和测试数据集
Building a classification model requires a training dataset to train the classification model, and testing data is needed to then validate the prediction performance. In the following recipe, we will demonstrate how to split the telecom churn dataset into training and testing datasets, respectively.
建立分类模型需要训练数据集来训练分类模型,然后需要测试数据来验证预测性能。在接下来的我们将演示如何将电信搅动数据集拆分为培训和测试分别的数据集。
Getting ready
准备
In this recipe, we will use the telecom churn dataset as the input data source, and split the data into training and testing datasets.
在此菜谱中,我们将使用电信搅动数据集作为输入数据源,并将其拆分数据用于培训和测试数据集。
How to do it...
怎么做
Perform the following steps to split the churn dataset into training and testing datasets
执行以下步骤将搅动数据集分解为培训和测试数据
1. You can retrieve the churn dataset from the C50 package:
> install.packages("C50")
> library(C50)
> data(churn)
2. Use str to read the structure of the dataset:
> str(churnTrain)
3. We can remove the state, area_code, and account_length attributes, which
are not appropriate for classification features:
> churnTrain = churnTrain[,! names(churnTrain) %in% c("state",
"area_code", "account_length") ]
4. Then, split 70 percent of the data into the training dataset and 30 percent of the data
into the testing dataset:
> set.seed(2)
> ind = sample(2, nrow(churnTrain), replace = TRUE, prob=c(0.7, 0.3))
> trainset = churnTrain[ind == 1,]
> testset = churnTrain[ind == 2,]
5. Lastly, use dim to explore the dimensions of both the training and testing datasets:
> dim(trainset)
[1] 2315 17
> dim(testset)
[1] 1018 17
1。您可以从C50包中检索搅动数据集:
> install.packages(“网”)
>库(网)
>数据(生产)
2。使用str来读取数据集的结构:
> str(churnTrain)
3。我们可以删除状态、area_code和account_length属性
不适用于分类特征:
> churnTrain = churnTrain[,!名称(churnTrain)% % c(“状态”,
“area_code”、“account_length”)
4。然后,将70%的数据分成训练数据集和30%的数据
到测试数据集:
> set.seed(2)
示例(2,nrow(churnTrain),替换= TRUE,prob = c(0.7,0.3))
表示“训练”的意思。
表示“教会”的意思。
5。最后,使用模糊来探究培训和测试数据集的各个方面:
>暗(小火车)
[1]2315年17
>暗(testset)
[1]1018年17
How it works...
它是如何工作的……
In this recipe, we use the telecom churn dataset as our example data source. The dataset contains 20 variables with 3,333 observations. We would like to build a classification model to predict whether a customer will churn, which is very important to the telecom company as the cost of acquiring a new customer is significantly more than retaining one
在这个菜谱中,我们使用电信搅动数据集作为示例数据源。数据集包含20个变量,有3,333个观测值。我们想建立一个分类模型预测客户是否会流失,这对电信公司来说是非常重要的获得新客户的成本远远超过了留住新客户的成本
Before building the classification model, we need to preprocess the data first. Thus, we load the churn data from the C50 package into the R session with the variable name as churn. As we determined that attributes such as state, area_code, and account_length are not useful features for building the classification model, we remove these attributes.
在构建分类模型之前,我们需要先对数据进行预处理。因此,我们装载将C50包中的数据从变量名称转换为“搅动”。作为我们确定属性,如国家、地区代码长度不是构建分类模型的有用特性,我们删除这些属性。
After preprocessing the data, we split it into training and testing datasets, respectively. We then use a sample function to randomly generate a sequence containing 70 percent of the training dataset and 30 percent of the testing dataset with a size equal to the number of observations. Then, we use a generated sequence to split the churn dataset into the training dataset, trainset, and the testing dataset, testset. Lastly, by using the dim function, we found that 2,315 out of the3,333observations are categorized into the training dataset, trainset, while the other 1,018 are categorized into the testing dataset, testset.
在预处理数据之后,我们分别将其分为培训和测试数据集。我们然后使用一个示例函数随机生成包含70%的序列的序列训练数据集和30%的测试数据集的大小等于
观察。然后,我们使用生成的序列将搅动数据集拆分为培训数据集、培训集和测试数据集,testset。最后,通过使用模糊函数,我们发现3333个观测资料中有2315个属于训练数据集,另一个1018被分类为测试数据集,testset。
There‘s more...
有更多的……
You can combine the split process of the training and testing datasets into the split.data function. Therefore, you can easily split the data into the two datasets by calling this function and specifying the proportion and seed in the parameters:
您可以将培训和测试数据集的分割过程合并到spli. data中
函数。因此,您可以通过调用这个函数轻松地将数据分成两个数据集并在参数中指定比例和种子:
>split.data = function(data, p = 0.7, s = 666){
+ set.seed(s)
+ index = sample(1:dim(data)[1])
+train = data[index[1:floor(dim(data)[1] * p)], ]
+ test = data[index[((ceiling(dim(data)[1] * p)) + 1):dim(data)[1]], ]
+ return(list(train = train, test = test))
+ }
>分裂。数据=函数(数据,p = 0.7,s = 666){
+ set.seed(年代)
样本(+指数= 1:昏暗的(数据)[1])
+ train = data[索引[1:下限(数据)[1]* p]
+测试=数据(指数(((上限(昏暗的(数据)[1]* p))+ 1):昏暗的(数据)[1]],)
+返回(列表(train = train,test = test))
+ }
Building a classification model with recursive partitioning trees
建立一个分类模型递归分区树
A classification tree uses a split condition to predict class labels based on one or multiple input variables. The classification process starts from the root node of the tree; at each node, the process will check whether the input value should recursively continue to the right or left sub-branch according to the split condition, and stops when meeting any leaf (terminal) nodes of the decision tree. In this recipe, we will introduce how to apply a recursive partitioning tree on the customer churn dataset.
分类树使用分割条件来预测基于一个或多个的类标签
输入变量。分类过程从树的根节点开始;在每个节点上,
这个过程将检查输入值是否应该以递归方式继续向左或向右根据分叉条件分支,在遇到任何叶片(终端)节点时停止
决策树。在这个菜谱中,我们将介绍如何应用递归分区树
在客户搅动数据集上。
Getting ready
准备
You need to have completed the previous recipe by splitting the churn dataset into the training dataset (trainset) and testing dataset (testset), and each dataset should contain exactly 17 variables.
您需要通过将搅动数据集分解为培训来完成前面的菜谱
数据集(trainset)和测试数据集(testset),每个数据集都应该包含正确的数据集17个变量。
How to do it...
怎么做……
Perform the following steps to split the churn dataset into training and testing datasets:
执行以下步骤将搅拌数据集分解为培训和测试数据集:
1. Load the rpart package:
> library(rpart)
2. Use the rpart function to build a classification tree model:
> churn.rp = rpart(churn ~ ., data=trainset)
3. Type churn.rp to retrieve the node detail of the classification tree:
> churn.rp
4. Next, use the printcp function to examine the complexity parameter:
> printcp(churn.rp)
Classification tree:
rpart(formula = churn ~ ., data = trainset)
Variables actually used in tree construction:
[1] international_plan number_customer_service_calls
[3] total_day_minutes total_eve_minutes
[5] total_intl_calls total_intl_minutes
[7] voice_mail_plan
Root node error: 342/2315 = 0.14773
n= 2315
CP nsplit rel error xerror xstd
1 0.076023 0 1.00000 1.00000 0.049920
2 0.074561 2 0.84795 0.99708 0.049860
3 0.055556 4 0.69883 0.76023 0.044421
4 0.026316 7 0.49415 0.52632 0.037673
5 0.023392 8 0.46784 0.52047 0.037481
6 0.020468 10 0.42105 0.50877 0.037092
7 0.017544 11 0.40058 0.47076 0.035788
8 0.010000 12 0.38304 0.47661 0.035993
5. Next, use the plotcp function to plot the cost complexity parameters:
> plotcp(churn.rp)
Figure 1: The cost complexity parameter plot
6. Lastly, use the summary function to examine the built model:
> summary(churn.rp)
How it works...
它是如何工作的……
In this recipe, we use a recursive partitioning tree from the rpart package to build a tree-based classification model. The recursive portioning tree includes two processes: recursion and partitioning. During the process of decision induction, we have to consider a
statistic evaluation question (or simply a yes/no question) to partition the data into different partitions in accordance with the assessment result. Then, as we have determined the child node, we can repeatedly perform the splitting until the stop criteria is satisfied
在这个菜谱中,我们使用来自rpart包的递归分区树来构建一个基于树的分类模型。递归分配树包括两个过程:递归和分区。在决策过程中,我们必须考虑a统计评估问题(或者简单的是一个是/否的问题)将数据划分为不同的根据评估结果划分分区。然后,我们已经确定了这个孩子节点,我们可以重复执行分离,直到满足停止条件
For example, the data (shown in the following figure) in the root node can be partitioned into two groups with regard to the question of whether f1 is smaller than X. If so, the data is divided into the left-hand side.Otherwise, it is split into the right-hand side. Then, we can continue to partition the left-hand side data with the question of whether f2 is smaller than Y:
例如,可以分区根节点中的数据(如下图所示)关于f1是否小于x的问题,分成两组,如果是,数据被分为左边。否则,它就被分成了右边。然后,我们可以继续划分左边的数据和f2是否小于Y的问题:
In the first step, we load the rpart package with the library function. Next, we build a classification model using the churn variable as a classification category (class label) and the remaining variables as input features.
在第一步中,我们使用库函数加载rpart包。接下来,我们建立一个分类模型使用搅动变量作为分类类别(类标签)和
其余变量作为输入特性。
After the model is built, you can type the variable name of the built model, churn.rp, to display the tree node details. In the printed node detail, n indicates the sample size, loss indicates the misclassification cost, yval stands for the classified membership (no or yes, in
this case), and yprob stands for the probabilities of two classes (the left value refers to the probability reaching label no, and the right value refers to the probability reaching label, yes).
在构建模型之后,您可以键入构建模型的变量名称,搅动。rp,显示树节点的详细信息。在打印的节点细节中,n表示样本大小,损失指出分类费用的分类成本,yval代表分类会员(no或yes,在这个例子),yprob代表两个类的概率(左值指的是)概率到达标签没有,正确的值是指到达标签的概率,是的)。
Then, we use the printcp function to print the complexity parameters of the built tree model. From the output of printcp, one should find the value of CP, a complexity parameter, which serves as a penalty to control the size of the tree. In short, the greater the CP value, the fewer the number of splits there are (nsplit). The output value (the rel error) represents the average deviance of the current tree divided by the average deviance of the null tree. A xerror value represents the relative error estimated by a 10-fold classification. xstd stands for the standard error of the relative error.
然后,我们使用printcp函数来打印构建树模型的复杂性参数。从printcp的输出中,我们可以看到CP的值,一个复杂的参数作为一个惩罚来控制树的大小。简而言之,CP值越大分割的数目减少了。输出值(rel error)表示当前树的平均偏差除以零树的平均偏差。xerror值表示由10倍的分类所估计的相对误差。xstd表示相对误差的标准误差。
To make the CP (cost complexity parameter) table more readable, we use plotcp to generate an information graphic of the CP table. As per the screenshot (step 5), the x-axis at the bottom illustrates the cp value, the y-axis illustrates the relative error, and the upper x-axis
displays the size of the tree. The dotted line indicates the upper limit of a standard deviation. From the screenshot, we can determine that minimum cross-validation error occurs when the tree is at a size of 12.
为了使CP(成本复杂度参数)表更加易读,我们使用plotcp
生成CP表的信息图。根据屏幕截图(步骤5),x轴底部说明了cp值,y轴说明相对误差和上x轴显示树的大小。虚线表示标准偏差的上限。从屏幕截图中,我们可以确定最小的交叉验证错误发生在什么时候树的大小是12。
We can also use the summary function to display the function call, complexity parameter table for the fitted tree model, variable importance, which helps identify the most important variable for the tree classification (summing up to 100), and detailed information of each node.
我们还可以使用摘要函数来显示函数调用、复杂性参数表
对于适合的树模型,变量重要性,它有助于识别最重要的变量对于树分类(summing到100),以及每个节点的详细信息。
The advantage of using the decision tree is that it is very flexible and easy to interpret. It works on both classification and regression problems, and more; it is nonparametric. Therefore, one does not have to worry about whether the data is linear separable. As for the disadvantage of using the decision tree, it is that it tends to be biased and over-fitted. However, you can
conquer the bias problem through the use of a conditional inference tree, and solve the problem of over-fitting through a random forest method or tree pruning.
使用决策树的优点是它非常灵活,易于解释。它的工作原理在分类和回归问题上,以及更多;它是非参数的。因此,一个不必担心数据是否是线性可分的。至于缺点使用决策树的时候,它往往是有偏见和过度的。然而,您可以通过使用条件推理树来克服偏见问题,并解决问题过度拟合的问题,通过随机的森林方法或树枝修剪。
See also
另请参阅
·For more information about the rpart, printcp, and summary functions, please
use the help function:
> ?rpart
> ?printcp
> ?summary.rpart
·C50 is another package that provides a decision tree and a rule-based model. If you
are interested in the package, you may refer to the document at http://cran.rproject.org/web/packages/C50/C50.pdf
·有关rpart、printcp和汇总函数的更多信息,请
使用帮助功能:
rpart > ?
printcp > ?
summary.rpart > ?
·C50是另一个提供决策树和基于规则的模型的包。如果你
包感兴趣,你可以参考文档:http://cran.rproject.org/web/packages/C50/C50.pdf。
Visualizing a recursive partitioning tree
可视化递归分区树
From the last recipe, we learned how to print the classification tree in a text format. To make the tree more readable, we can use the plot function to obtain the graphical display of a built classification tree.
在最后的菜谱中,我们学习了如何以文本格式打印分类树。为了使这棵树更易读,我们可以使用绘图功能来获得图形显示的a建立分类树。
Getting ready
准备
One needs to have the previous recipe completed by generating a classification model, and assign the model into the churn.rp variable.
需要通过生成分类模型来完成前一个配方将模型分配到搅拌器中。rp变量。
How to do it...
怎么做……
Perform the following steps to visualize the classification tree:
1. Use the plot function and the text function to plot the classification tree:
> plot(churn.rp, margin= 0.1)
> text(churn.rp, all=TRUE, use.n = TRUE)
2. You can also specify the uniform, branch, and margin parameter to adjust
the layout:
> plot(churn.rp, uniform=TRUE, branch=0.6, margin=0.1)
> text(churn.rp, all=TRUE, use.n = TRUE)
执行以下步骤来可视化分类树:
1。使用plot函数和文本函数来绘制分类树:
>图(生产。rp,利润率= 0.1)
>文本(生产。rp,所有= TRUE,使用。n = TRUE)
2。您还可以指定要调整的统一、分支和边界参数
布局:
>图(生产。rp,统一= TRUE,分支= 0.6,利润= 0.1)
>文本(生产。rp,所有= TRUE,使用。n = TRUE)
How it works...
它是如何工作的……
Here, we demonstrate how to use the plot function to graphically display a classification
tree. The plot function can simply visualize the classification tree, and you can then use the
text function to add text to the plot.
这里,我们演示了如何使用绘图功能来图形化显示分类
树。绘图函数可以简单地将分类树可视化,然后您可以使用文本函数将文本添加到情节中。
In Figure 3, we assign margin = 0.1 as a parameter to add extra white space around the border to prevent the displayed text being truncated by the margin. It shows that the length of the branches displays the relative magnitude of the drop in deviance. We then use the text
function to add labels for the nodes and branches. By default, the text function will add a split condition on each split, and add a category label in each terminal node. In order to add extra information on the tree plot, we set the parameter as all equal to TRUE to add a label
to all the nodes. In addition to this, we add a parameter by specifying use.n = TRUE to add extra information, which shows that the actual number of observations fall into two different categories (no and yes).
在图3中,我们将保证金= 0.1作为参数,以在周围添加额外的空格边界防止显示的文本被截断。它显示了长度在树枝上显示出偏差的相对大小。然后我们使用文字
函数为节点和分支添加标签。默认情况下,文本函数将添加一个在每一个分割点上分割条件,并在每个终端节点中添加一个类别标签。为了添加关于树图的额外信息,我们将参数设置为TRUE,以添加一个标签的所有节点。除此之外,我们还通过指定使用来添加一个参数。要加额外的信息显示,实际的观测结果分为两种类别(是或否)。
In Figure 4, we set the option branch to 0.6 to add a shoulder to each plotted branch. In addition to this, in order to display branches of an equal length rather than relative magnitude of the drop in deviance, we set the option uniform to TRUE. As a result, Figure 4 shows a
classification tree with short shoulders and branches of equal length.
在图4中,我们将选项分支设置为0.6,以在每个绘制的分支中添加一个肩膀。在除此之外,为了显示出相等长度的分支而不是相对大小在偏差的下降中,我们将选项统一为TRUE。结果,图4显示了一个具有短肩和等长枝的分类树。
See also
另请参阅
·You may use ?plot.rpart to read more about the plotting of the classification
tree. This document also includes information on how to specify the parameters,
uniform, branch, compress, nspace, margin, and minbranch, to adjust the
layout of the classification tree
你可以用吗?情节。rpart阅读更多关于分类的图树。这个文档还包含了如何指定参数的信息,统一,分枝,压缩,nspace,空白,和minbranch,来调整分类树的布局
Measuring the prediction performance of a recursive partitioning tree
测量a的预测性能递归分区树
Since we have built a classification tree in the previous recipes, we can use it to predict thecategory (class label) of new observations. Before making a prediction, we first validate the prediction power of the classification tree, which can be done by generating a classification table on the testing dataset. In this recipe, we will introduce how to generate a predicted label versus a real label table with the predict function and the table function, and explain how to generate a confusion matrix to measure the performance.
由于我们在之前的菜谱中构建了一个分类树,所以我们可以用它来预测新发现的类别(类标签)。在做预测之前,我们先验证一下分类树的预测能力,可以通过生成分类来实现在测试数据集上的表。在这个菜谱中,我们将介绍如何生成一个预期的标签相对于实际的标签表和预测函数和表函数,并解释如何进行生成一个混乱矩阵来度量性能。
Getting ready
准备
You need to have the previous recipe completed by generating the classification model, churn.rp. In addition to this, you have to prepare the training dataset,trainset,
and the testing dataset, testset, generated in the first recipe of this chapter.
您需要通过生成分类模型来完成前一个配方,churn.rp。除此之外,你还要准备培训数据集,培训集,测试数据集,testset,在本章的第一个菜谱中生成。
How to do it...
怎么做……
Perform the following steps to validate the prediction performance of a classification tree:
1. You can use the predict function to generate a predicted label of testing the dataset:
> predictions = predict(churn.rp, testset, type="class")
2. Use the table function to generate a classification table for the testing dataset:
> table(testset$churn, predictions)
predictions
yes no
yes 100 41
no 18 859
3. One can further generate a confusion matrix using the confusionMatrix function
provided in the caret package:
> library(caret)
> confusionMatrix(table(predictions, testset$churn))
Confusion Matrix and Statistics
predictions yes no
yes 100 18
no 41 859
Accuracy : 0.942
95?I : (0.9259, 0.9556)
No Information Rate : 0.8615
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7393
Mcnemar‘s Test P-Value : 0.004181
Sensitivity : 0.70922
Specificity : 0.97948
Pos Pred Value : 0.84746
Neg Pred Value : 0.95444
Prevalence : 0.13851
执行以下步骤来验证分类树的预测性能:
1。您可以使用预测函数来生成测试数据集的预期标签:
= >预测预测(生产。rp、testset type = "类")
2。使用表函数为测试数据集生成分类表:
>表(testset搅动,美元预测)
预测
是的没有
是的100年41
没有18 859
3。我们可以利用混乱矩阵函数进一步生成一个混乱矩阵
在货物包裹中提供:
>库(插入符号)
> confusionMatrix(表(预测,testset生产美元))
混淆矩阵和统计
预测是的没有
是的100年18
没有41 859
准确性:0.942
95?我:(0.9259,0.9556)
没有信息率:0.8615
p值[Acc > NIR]:< 2.2e - 16
卡帕:0.7393
Mcnemar的测试p值:0.004181
灵敏度:0.70922
特异性:0.97948
Pos Pred值:0.84746
Neg Pred值:0.95444
流行率:0.13851
How it works...
它如何工作……
In this recipe, we use a predict function and built up classification model, churn.rp, to predict the possible class labels of the testing dataset, testset. The predicted categories (class labels) are coded as either no or yes. Then, we use the table function to generate
a classification table on the testing dataset. From the table, we discover that there are 859 correctly predicted as no, while 18 are misclassified as yes. 100 of the yes predictions are correctly predicted, but 41 observations are misclassified into no. Further, we use the confusionMatrix function from the caret package to produce a measurement of the classification model.
在这个食谱中,我们使用了一个预测函数并建立了分类模型,搅拌。rp,预测测试数据集的可能类标签,testset。预测的分类(类标签)被编码为no或yes。然后,我们使用表函数来生成在测试数据集上的分类表。从表中,我们发现有
正确的预测是不,而18则被错误地归类为“是”。100个肯定的预测正确地预测了,但是有41项观察被错误地划分为不。此外,我们使用从caret包中得到的混乱矩阵函数来生成度量分类模型。
See also
另请参阅
f You may use ?confusionMatrix to read more about the performance measurement using the confusion matrix
f For those who are interested in the definition output by the confusion matrix, please refer to the Wikipedia entry, Confusion_matrix (http://en.wikipedia.org/
wiki/Confusion_matrix)
你可以用吗?混淆矩阵阅读更多有关性能的信息
使用混淆矩阵进行测量
对于那些对定义输出感兴趣的人,请参考维基百科词条,困惑的信息(http://en.wikipedia.org/)维基/Confusion_matrix)
Pruning a recursive partitioning tree
修剪一个递归分区树
In previous recipes, we have built a complex decision tree for the churn dataset. However, sometimes we have to remove sections that are not powerful in classifying instances to avoid over-fitting, and to improve the prediction accuracy. Therefore, in this recipe, we introduce the cost complexity pruning method to prune the classification tree.
在以前的菜谱中,我们为搅动数据集构建了一个复杂的决策树。然而,有时,我们必须删除那些在分类实例中不具有强大功能的部分超拟合,提高预测精度。因此,在这个食谱中,我们引入成本复杂性修剪方法修剪分类树。
Getting ready
准备
You need to have the previous recipe completed by generating a classification model, and assign the model into the churn.rp variable.
您需要通过生成分类模型来完成前一个配方将模型分配到搅拌器中。rp变量。
How to do it...
怎么做……
Perform the following steps to prune the classification tree:
1. Find the minimum cross-validation error of the classification tree model:
> min(churn.rp$cptable[,"xerror"])
[1] 0.4707602
2. Locate the record with the minimum cross-validation errors:
> which.min(churn.rp$cptable[,"xerror"])
7
3. Get the cost complexity parameter of the record with the minimum cross-validation
errors:
> churn.cp = churn.rp$cptable[7,"CP"]
> churn.cp
[1] 0.01754386
4. Prune the tree by setting the cp parameter to the CP value of the record with
minimum cross-validation errors:
> prune.tree = prune(churn.rp, cp= churn.cp)
5. Visualize the classification tree by using the plot and text function:
> plot(prune.tree, margin= 0.1)
> text(prune.tree, all=TRUE , use.n=TRUE)
Figure 5: The pruned classification tree
6. Next, you can generate a classification table based on the pruned classification
tree model:
> predictions = predict(prune.tree, testset, type="class")
> table(testset$churn, predictions)
predictions
yes no
yes 95 46
no 14 863
7. Lastly, you can generate a confusion matrix based on the classification table:
> confusionMatrix(table(predictions, testset$churn))
Confusion Matrix and Statistics
predictions yes no
yes 95 14
no 46 863
Accuracy : 0.9411
95% CI : (0.9248, 0.9547)
No Information Rate : 0.8615
P-Value [Acc > NIR] : 2.786e-16
Kappa : 0.727
Mcnemar‘s Test P-Value : 6.279e-05
Sensitivity : 0.67376
Specificity : 0.98404
Pos Pred Value : 0.87156
Neg Pred Value : 0.94939
Prevalence : 0.13851
Detection Rate : 0.09332
Detection Prevalence : 0.10707
Balanced Accuracy : 0.82890
‘Positive‘ Class : yes
执行以下步骤来修剪分类树:
1。找到分类树模型的最小交叉验证错误:
> min(churn.rp cptable美元[" xerror "])
[1]0.4707602
2。用最小交叉验证错误定位记录:
> which.min(churn.rp cptable美元[" xerror "])七
3。用最小的交叉验证获取记录的成本复杂性参数
错误:
>生产。cp = churn.rp cptable美元[7,“cp”]
> churn.cp
[1]0.01754386
4。通过将cp参数设置为记录的cp值来修剪树
最小交叉验证错误:
>删除。树=修剪(生产。rp,cp = churn.cp)
5。使用绘图和文本函数可视化分类树:
>图(修剪。树,利润率= 0.1)
>文本(修剪。树,所有= TRUE,use.n = TRUE)
图5:修剪的分类树
6。接下来,您可以根据修剪分类生成一个分类表
树模型:
= >预测预测(修剪。树、testset type = "类")
>表(testset搅动,美元预测)
预测
是的没有
是的95年46
没有14 863
7。最后,您可以根据分类表生成一个混乱矩阵:
> confusionMatrix(表(预测,testset生产美元))
混淆矩阵和统计
预测是的没有
是的95年14
没有46 863
准确性:0.9411
95?我:(0.9248,0.9547)
没有信息率:0.8615
p值[Acc > NIR]:2.786e - 16
卡帕:0.727
Mcnemar的测试p值:6.279e - 05
灵敏度:0.67376
特异性:0.98404
Pos Pred值:0.87156
Neg Pred值:0.94939
流行率:0.13851
检测率:0.09332
检测患病率:0.10707
平衡精度:0.82890
"正面"类:是的
How it works...
它是如何工作的……
In this recipe, we discussed pruning a classification tree to avoid over-fitting and producing a more robust classification model. We first located the record with the minimum cross?validation errors within the cptable, and we then extracted the CP of the record and assigned the value to churn.cp. Next, we used the prune function to prune the classification tree with churn.cp as the parameter. Then, by using the plot function, we graphically displayed the pruned classification tree. From Figure 5, it is clear that the split of the tree is less than the original classification tree (Figure 3). Lastly, we produced a classification table and used the confusion matrix to validate the performance of the pruned
tree. The result shows that the accuracy (0.9411) is slightly lower than the original model (0.942), and also suggests that the pruned tree may not perform better than the original classification tree as we have pruned some split conditions (Still, one should examine the change in sensitivity and specificity). However, the pruned tree model is more robust as it removes some split conditions that may lead to over-fitting
在这个食谱中,我们讨论了修剪分类树以避免过度拟合和产生一个更健壮的分类模型。我们首先找到了记录与最小交叉?cptable验证错误,然后我们提取的CP和记录将值赋给churn . cp。接下来,我们用prune函数来修剪与生产分类树。cp作为参数。然后利用绘图功能,我们图形化地显示了修剪分类树。从图5中可以看出树的分割比原始的分类树少(图3),最后,我们生成了一个分类表和使用混淆矩阵来验证修剪的性能树。结果表明,精度(0.9411)略低于原来的模型(0.942),也表明修剪树的表现可能不如原来的好分类树,因为我们修剪了一些分裂的条件(仍然,一个应该检查敏感性和特异性的变化。但是,修剪树模型更健壮
移除一些可能导致过度拟合的分割条件
See also
另请参阅
f For those who would like to know more about cost complexity pruning, please refer to the Wikipedia article for Pruning (decision_trees): http://en.wikipedia.org/wiki/Pruning_(decision_trees
对于那些想了解更多关于成本复杂性的人,请参考
维基百科上关于修剪的文章(决策树):http://en.wikipedia.org/维基/ Pruning_(decision_trees
Building a classification model with a conditional inference tree
用a构建一个分类模型条件推理树
In addition to traditional decision trees (rpart), conditional inference trees (ctree)
are another popular tree-based classification method. Similar to traditional decision trees,
conditional inference trees also recursively partition the data by performing a univariate
split on the dependent variable. However, what makes conditional inference trees different
from traditional decision trees is that conditional inference trees adapt the significance test
procedures to select variables rather than selecting variables by maximizing information
measures (rpart employs a Gini coefficient). In this recipe, we will introduce how to adapt
a conditional inference tree to build a classification model.
除了传统决策树(rpart),条件推理树(ctree)是另一种基于树的分类方法。与传统的决策树相似,条件推理树还通过执行一个单变量来递归地划分数据对因变量进行分割。然而,什么使条件推理树不同传统的决策树是条件推理树适应重要的测试选择变量而不是通过最大化信息选择变量的过程测量(rpart采用Gini系数)。在这个食谱中,我们将介绍如何适应构建分类模型的条件推理树。
Getting ready
准备
You need to have the first recipe completed by generating the training dataset, trainset,
and the testing dataset, testset.
您需要通过生成培训数据集、培训集来完成第一个菜谱,测试数据集,testset。
How to do it...
怎么做
Perform the following steps to build the conditional inference tree:
1. First, we use ctree from the party package to build the classification model:
> library(party)
> ctree.model = ctree(churn ~ . , data = trainset)
2. Then, we examine the built tree model:
> ctree.model
执行以下步骤来构建条件推理树:
1。首先,我们使用来自party包的ctree构建分类模型:
>库(方)
> ctree。模型= ctree(搅拌)。、数据=小火车)
2。然后,我们研究构建的树模型:
> ctree.model
How it works...
它是如何工作的……
In this recipe, we used a conditional inference tree to build a classification tree. The use of ctree is similar to rpart. Therefore, you can easily test the classification power using either a traditional decision tree or a conditional inference tree while confronting classification problems. Next, we obtain the node details of the classification tree by examining the built model. Within the model, we discover that ctree provides information similar to a split condition, criterion (1 – p-value), statistics (test statistics), and weight (the case weight corresponding to the node). However, it does not offer as much information as rpart does through the use of the summary function.
在这个菜谱中,我们使用了条件推理树来构建分类树。的使用ctree类似于rpart。因此,您也可以轻松地使用任何一种方法测试分类权一种传统的决策树或条件推理树,同时面对分类问题。接下来,我们通过检查构建来获取分类树的节点细节模型。在模型中,我们发现ctree提供了类似于分割的信息条件、标准(1 - p值)、统计数据(测试统计)和权重(案例权重)对应节点)。但是,它提供的信息并不像rpart那样多通过使用摘要函数。
See also
另请参阅
You may use the help function to refer to the definition of Binary Tree Class and
read more about the properties of binary trees:
> help("BinaryTree-class")
您可以使用帮助函数来引用二进制树类的定义阅读更多关于二叉树的特性:
>帮助(“BinaryTree-class”)
Visualizing a conditional inference tree
可视化条件推理树
Similar to rpart, the party package also provides a visualization method for users to plot conditional inference trees. In the following recipe, we will introduce how to use the plotfunction to visualize conditional inference trees.
与rpart相似,party包也为用户提供了一个可视化的绘图方法条件推理树。在下面的菜谱中,我们将介绍如何使用这个情节函数可视化条件推理树。
Getting ready
准备
You need to have the first recipe completed by generating the conditional inference tree model, ctree.model. In addition to this, you need to have both, trainset and testset, loaded in an R session.
您需要通过生成条件推断树来完成第一个配方模型,ctree.model。除此之外,你需要有两个,训练集和testset,在一个R会话中加载。
How to do it...
怎么做……
Perform the following steps to visualize the conditional inference tree:
1. Use the plot function to plot ctree.model built in the last recipe:
> plot(ctree.model)
Figure 6: A conditional inference tree of churn data
2. To obtain a simple conditional inference tree, one can reduce the built model with
less input features, and redraw the classification tree:
> daycharge.model = ctree(churn ~ total_day_charge, data = trainset)
> plot(daycharge.model)
执行以下步骤来可视化条件推理树:
1。使用绘图函数来绘制ctree。在最后的食谱中建立的模型:
>的情节(ctree.model)
图6:一个有条件的波动数据树
2。为了获得一个简单的条件推理树,可以用它来减少构建模型减少输入特性,重新绘制分类树:
> daycharge。模型= ctree(搅动~ total_day_charge,data =小火车)
>的情节(daycharge.model)
Figure 7: A conditional inference tree using the total_day_charge variable as only split condition
图7:使用total_day_charge变量的条件推断树只作为分割条件
How it works...
它是如何工作的……
To visualize the node detail of the conditional inference tree, we can apply the plot function on a built classification model. The output figure reveals that every intermediate node shows the dependent variable name and the p-value. The split condition is displayed on the left and right branches. The terminal nodes show the number of categorized observations, n, and the probability of a class label of either 0 or 1.Taking Figure 7 as an example, we first build a classification model using total_day_charge as the only feature and churn as the class label. The built classification tree shows that when total_day_charge is above 48.18, the lighter gray area is greater than the darker gray in node 9, which indicates that the customer with a day charge of over 48.18 has a greater likelihood to churn (label = yes).
为了可视化条件推理树的节点细节,我们可以应用绘图函数建立起来的分类模型。输出图显示每个中间节点显示
从属变量名和p值。分离条件显示在左边和正确的分支。终端节点显示了分类观察的数量,n,和0或1的类标签的概率。以图7为例,我们首先使用total_day_构建分类模型
作为唯一的特性和搅拌,作为标签。构建的分类树显示
当total_day_charge超过48.18时,较浅的灰色区域大于
在节点9中暗灰色,表明客户的一天收费超过48.18更大的变动可能性(标签= yes)。
See also
另请参阅
The visualization of the conditional inference tree comes from the plot.BinaryTree function. If you are interested in adjusting the layout of the classification tree, you may use the help function to read the following document::plot.BinaryTree
条件推理树的形象化来自于情节。BinaryTree函数。如果你有兴趣调整布局分类树,您可以使用帮助函数来阅读下面的文档:plot.BinaryTree >
Measuring the prediction performance of a conditional inference tree
测量a的预测性能条件推理树
After building a conditional inference tree as a classification model, we can use the treeresponse and predict functions to predict categories of the testing dataset, testset, and further validate the prediction power with a classification table and a confusion matrix.
在构建条件推断树作为分类模型之后,我们可以使用
treeresponse和预测函数来预测测试数据集的类别,testset,进一步验证了具有分类表和混淆矩阵的预测能力。
Getting ready
准备
You need to have the previous recipe completed by generating the conditional inference tree model, ctree.model. In addition to this, you need to have both trainset and testsetloaded in an R session.
您需要通过生成条件推断树来完成前一个配方模型,ctree.model。除此之外,您还需要有培训集和testset在一个R会话中加载。
How to do it...
怎么做……
Perform the following steps to measure the prediction performance of a conditional
inference tree:
1. You can use the predict function to predict the category of the testing dataset,
testset:
> ctree.predict = predict(ctree.model ,testset)
> table(ctree.predict, testset$churn)
ctree.predict yes no
yes 99 15
no 42 862
2. Furthermore, you can use confusionMatrix from the caret package to generate
the performance measurements of the prediction result:
> confusionMatrix(table(ctree.predict, testset$churn))
Confusion Matrix and Statistics
ctree.predict yes no
yes 99 15
no 42 862
Accuracy : 0.944
95?I : (0.9281, 0.9573)
No Information Rate : 0.8615
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7449
Mcnemar‘s Test P-Value : 0.0005736
Sensitivity : 0.70213
Specificity : 0.98290
Pos Pred Value : 0.86842
Neg Pred Value : 0.95354
Prevalence : 0.13851
Detection Rate : 0.09725
Detection Prevalence : 0.11198
Balanced Accuracy : 0.84251
‘Positive‘ Class : yes
3. You can also use the treeresponse function, which will tell you the list of
class probabilities:
> tr = treeresponse(ctree.model, newdata = testset[1:5,])
> tr
[[1]]
[1] 0.03497409 0.96502591
[[2]]
[1] 0.02586207 0.97413793
[[3]]
[1] 0.02586207 0.97413793
[[4]]
[1] 0.02586207 0.97413793
[[5]]
[1] 0.03497409 0.96502591
执行以下步骤来度量条件的预测性能推理树:
1。您可以使用预测函数来预测测试数据集的类别,
testset:
> ctree。预测=预测(ctree。模型,testset)
>表(ctree。预测,testset生产美元)
ctree。预测是的没有
是的99年15
没有42 862
2。此外,您可以使用来自caret包的混淆矩阵来生成
预测结果的性能指标:
> confusionMatrix(表(ctree。预测,testset美元波动)
混淆矩阵和统计
ctree。预测是的没有
是的99年15
没有42 862
准确性:0.944
95?我:(0.9281,0.9573)
没有信息率:0.8615
p值[Acc > NIR]:< 2.2e - 16
卡帕:0.7449
Mcnemar的测试p值:0.0005736
灵敏度:0.70213
特异性:0.98290
Pos Pred值:0.86842
Neg Pred值:0.95354
流行率:0.13851
检测率:0.09725
检测患病率:0.11198
平衡精度:0.84251
"正面"类:是的
3。您还可以使用treeresponse函数,它将告诉您列表
类概率:
> tr = treeresponse(ctree。模型,newdata = testset[1:5])
> tr
[[1]]
[1]0.03497409 - 0.96502591
[[2]]
[1]0.02586207 - 0.97413793
[[3]]
[1]0.02586207 - 0.97413793
[[4]]
[1]0.02586207 - 0.97413793
[[5]]
[1]0.03497409 - 0.96502591
How it works...
它是如何工作的……
In this recipe, we first demonstrate that one can use the prediction function to predict the category (class label) of the testing dataset, testset, and then employ a table function to generate a classification table. Next, you can use the confusionMatrix function built into the caret package to determine the performance measurements.
In addition to the predict function, treeresponse is also capable of estimating the class probability, which will often classify labels with a higher probability. In this example, we demonstrated how to obtain the estimated class probability using the top five records of the testing dataset, testset. The treeresponse function returns a list of five probabilities. You can use the list to determine the label of instance.
在这个食谱中,我们首先证明一个人可以使用预测函数来预测测试数据集的类别(类标签),testset,然后使用一个表函数生成一个分类表。接下来,您可以使用内置的混乱矩阵函数用于确定性能度量的caret包。除了预测函数,treeresponse还可以估计类概率,它常常将标签分类为更高的概率。在这个例子中,我们演示如何使用前五项记录获得估计的类概率测试数据集,testset。treeresponse函数返回5个概率的列表。你可以使用列表来确定实例的标签。
In addition to the predict function, treeresponse is also capable of estimating the class probability, which will often classify labels with a higher probability. In this example, we demonstrated how to obtain the estimated class probability using the top five records of the testing dataset, testset. The treeresponse function returns a list of five probabilities. You can use the list to determine the label of instance.
除了预测函数,treeresponse还可以估计类概率,它常常将标签分类为更高的概率。在这个例子中,我们演示如何使用前五项记录获得估计的类概率测试数据集,testsettreeresponse函数返回5个概率的列表。你可以使用列表来确定实例的标签。
See also
另请参阅
For the predict function, you can specify the type as response, prob, or node. If you specify the type as prob when using the predict function (for example, predict(… type="prob")), you will get exactly the same result as what treeresponse returns.
对于预测函数,可以指定类型为响应、prob或节点。如果在使用预测函数时指定类型为prob(例如,预测(…类型=“概率”)),你会得到完全相同的结果treeresponse回报。
Classifying data with the k-nearest neighbor classifier
对与k最近的邻居进行数据分类分类器
K-nearest neighbor (knn) is a nonparametric lazy learning method. From a nonparametric view, it does not make any assumptions about data distribution. In terms of lazy learning, it does not require an explicit learning phase for generalization. The following recipe will introduce how to apply the k-nearest neighbor algorithm on the churn dataset.
k最近的邻居(knn)是一种非参数惰性学习方法。从非参数
视图,它没有对数据分布做任何假设。就懒惰的学习而言,它不需要明确的学习阶段来泛化。下面的方法将介绍如何在搅动数据集上应用k最近的邻居算法。
Getting ready
准备
You need to have the previous recipe completed by generating the training and testing datasets.
您需要通过生成培训和测试完成之前的菜谱数据集。
How to do it...
怎么做……
Perform the following steps to classify the churn data with the k-nearest neighbor algorithm:
1. First, one has to install the class package and have it loaded in an R session:
> install.packages("class")
> library(class)
2. Replace yes and no of the voice_mail_plan and international_plan
attributes in both the training dataset and testing dataset to 1 and 0:
> levels(trainset$international_plan) = list("0"="no", "1"="yes")
> levels(trainset$voice_mail_plan) = list("0"="no", "1"="yes")
> levels(testset$international_plan) = list("0"="no", "1"="yes")
> levels(testset$voice_mail_plan) = list("0"="no", "1"="yes")
3. Use the knn classification method on the training dataset and the testing dataset:
> churn.knn = knn(trainset[,! names(trainset) ??("churn")],
testset[,! names(testset) ??("churn")], trainset$churn, k=3)
4. Then, you can use the summary function to retrieve the number of predicted labels:
> summary(churn.knn)
yes no
77 941
5. Next, you can generate the classification matrix using the table function:
> table(testset$churn, churn.knn)
churn.knn
yes no
yes 44 97
no 33 844
6. Lastly, you can generate a confusion matrix by using the confusionMatrix
function:
> confusionMatrix(table(testset$churn, churn.knn))
Confusion Matrix and Statistics
churn.knn
yes no
yes 44 97
no 33 844
Accuracy : 0.8723
95?I : (0.8502, 0.8922)
No Information Rate : 0.9244
P-Value [Acc > NIR] : 1
Kappa : 0.339
Mcnemar‘s Test P-Value : 3.286e-08
Sensitivity : 0.57143
Specificity : 0.89692
Pos Pred Value : 0.31206
Neg Pred Value : 0.96237
Prevalence : 0.07564
Detection Rate : 0.04322
Detection Prevalence : 0.13851
Balanced Accuracy : 0.73417
‘Positive‘ Class : yes
执行以下步骤,用k最近的邻居算法对搅动数据进行分类:
1。首先,必须安装类包并在R会话中加载:
> install.packages(“阶级”)
>库(类)
2。替换yes和no的voice_mail_plan和international_plan
训练数据集和测试数据集的属性值为1和0:
>级(trainset $ international al_plan)= list(" 0 " = " no "," 1 " = " yes ")
(小火车voice_mail_plan美元)= >水平列表(“0”=“不”,“1”=“是”)
>级(testset $ international al_plan)= list(" 0 " = " no "," 1 " = " yes ")
(testset voice_mail_plan美元)= >水平列表(“0”=“不”,“1”=“是”)
3。在培训数据集和测试数据集上使用knn分类方法:
>生产。然而=资讯(小火车,!名(小火车)??(“生产”)】,
testset[,!名(testset)??(“生产”)],小火车美元搅动,k = 3)
4。然后,您可以使用summary函数来检索预测值的数量:
>总结(churn.knn)
是的没有
77 941
5。接下来,您可以使用表函数生成分类矩阵:
>表(美元testset搅动,churn.knn)
churn.knn
是的没有
是的44 97
没有33 844
6。最后,您可以使用困惑矩阵来生成一个混乱矩阵
功能:
> confusionMatrix(表(美元testset搅动,churn.knn))
混淆矩阵和统计
churn.knn
是的没有
是的44 97
没有33 844
准确性:0.8723
95?我:(0.8502,0.8922)
没有信息率:0.9244
p值[Acc >]:1
卡帕:0.339
Mcnemar的测试p值:3.286 - 08
灵敏度:0.57143
特异性:0.89692
Pos Pred值:0.31206
Neg Pred值:0.96237
流行率:0.07564
检测率:0.04322
检测患病率:0.13851
平衡精度:0.73417
"正面"类
How it works...
它是如何工作的……
knn trains all samples and classifies new instances based on a similarity (distance) measure.
For example, the similarity measure can be formulated as follows:
In knn, a new instance is classified to a label (class) that is common among the k-nearest neighbors. If k = 1, then the new instance is assigned to the class where its nearest neighbor belongs. The only required input for the algorithm is k. If we give a small k input, it may lead
to over-fitting. On the other hand, if we give a large k input, it may result in under-fitting. To choose a proper k-value, one can count on cross-validation.
The advantages of knn are:
The cost of the learning process is zero
It is nonparametric, which means that you do not have to make the assumption of data distribution
You can classify any data whenever you can find similarity measures of given instances
The main disadvantages of knn are:
It is hard to interpret the classified result.
It is an expensive computation for a large dataset.
The performance relies on the number of dimensions. Therefore, for a high dimension problem, you should reduce the dimension first to increase the process performance.
The use of knn does not vary significantly from applying a tree-based algorithm mentioned in the previous recipes. However, while a tree-based algorithm may show you the decision tree model, the output produced by knn only reveals classification category factors. However, before building a classification model, one should replace the attribute with a string type to an integer since the k-nearest neighbor algorithm needs to calculate the distance between observations. Then, we build up a classification model by specifying k=3, which means choosing the three nearest neighbors. After the classification model is built, we can generate a classification table using predicted factors and the testing dataset label as the input. Lastly, we can generate a confusion matrix from the classification table. The confusion matrix output reveals an accuracy result of (0.8723), which suggests that both the tree-based methods mentioned in previous recipe outperform the accuracy of the k-nearest neighbor
classification method in this case. Still, we cannot determine which model is better depending merely on accuracy, one should also examine the specificity and sensitivity from the output.
knn对所有示例进行培训,并根据相似(距离)度量对新实例进行分类。
在knn中,一个新实例被分类为一个标签(类),这在k -最近中是常见的邻居。如果k = 1,则新实例被分配给它最近的邻居的类属于。这个算法唯一需要的输入是k。如果我们给一个小k的输入,它可能会导致过度学习。另一方面,如果我们输入大量的k,可能会导致不合适。到
选择一个合适的k值,一个可以依赖交叉验证。
knn的优点是:
学习过程的成本是零f是非参数的,这意味着你不必做出假设数据的分布如果你能找到类似的方法,你可以对任何数据进行分类给定的实例
knn的主要缺点是:
很难解释分类结果。对于大型数据集来说,这是一个昂贵的计算。性能依赖于维度的数量。因此,对于一个高维度
问题是,您应该首先减少维度,以增加流程性能。knn的使用与应用所提到的基于树的算法并无显著差异在前面的菜谱。然而,基于树的算法可以向您显示决策树模型,由knn生成的输出只显示分类的类别因子。然而,在构建分类模型之前,应该用字符串类型替换属性一个整数,因为k -近邻算法需要计算距离观察。然后,我们通过指定k = 3来构建分类模型,这意味着选择最近的三个邻居。在构建分类模型之后,我们可以生成使用预测因子和测试数据集标签作为输入的分类表。最后,我们可以从分类表生成一个混乱矩阵。混淆矩阵输出结果显示了(0.8723)的精度,这表明这两种方法都是基于树的前面提到的方法比k最近的邻居的准确度要高这种情况下的分类方法。然而,我们无法确定哪种模式更好仅仅在准确性上,我们还应该检查输出的特异性和敏感性。
See also
另请参阅
There is another package named kknn, which provides a weighted k-nearest neighbor classification, regression, and clustering. You can learn more about the package by reading this document: http://cran.r-project.org/web/
packages/kknn/kknn.pdf.
是另一个名为kknn的包,它提供了一个加权的k -最近
相邻分类、回归和集群。你可以学到更多这个包是通过阅读这个文件的:http://cran.r - project.org/web/包/ kknn / kknn.pdf。
Classifying data with logistic regression
用逻辑回归对数据进行分类
Logistic regression is a form of probabilistic statistical classification model, which can be used to predict class labels based on one or more features. The classification is done by using the logit function to estimate the outcome probability. One can use logistic regression
by specifying the family as a binomial while using the glm function. In this recipe, we will introduce how to classify data using logistic regression.
逻辑回归是概率统计分类模型的一种形式用于根据一个或多个特性来预测类标签。分类是通过的使用logit函数来估计结果概率。一个人可以使用逻辑回归通过在使用glm函数时指定家庭为二项。在这个食谱中,我们会介绍如何利用逻辑回归对数据进行分类。
Getting ready
准备
You need to have completed the first recipe by generating training and testing datasets.
您需要通过生成培训和测试数据集来完成第一个菜谱。
How to do it...
怎么做……
Perform the following steps to classify the churn data with logistic regression:
1. With the specification of family as a binomial, we apply the glm function on the
dataset, trainset, by using churn as a class label and the rest of the variables as
input features:
> fit = glm(churn ~ ., data = trainset, family=binomial)
2. Use the summary function to obtain summary information of the built logistic
regression model:
> summary(fit)
Call:
glm(formula = churn ~ ., family = binomial, data = trainset)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1519 0.1983 0.3460 0.5186 2.1284
Coefficients:
Estimate Std. Error z value
Pr(>|z|)
(Intercept) 8.3462866 0.8364914 9.978 < 2e-
16
international_planyes -2.0534243 0.1726694 -11.892 < 2e-
16
voice_mail_planyes 1.3445887 0.6618905 2.031
0.042211
number_vmail_messages -0.0155101 0.0209220 -0.741
0.458496
total_day_minutes 0.2398946 3.9168466 0.061
0.951163
total_day_calls -0.0014003 0.0032769 -0.427
0.669141
total_day_charge -1.4855284 23.0402950 -0.064
0.948592
total_eve_minutes 0.3600678 1.9349825 0.186
0.852379
total_eve_calls -0.0028484 0.0033061 -0.862
0.388928
total_eve_charge -4.3204432 22.7644698 -0.190
0.849475
total_night_minutes 0.4431210 1.0478105 0.423
0.672367
total_night_calls 0.0003978 0.0033188 0.120
0.904588
total_night_charge -9.9162795 23.2836376 -0.426
0.670188
total_intl_minutes 0.4587114 6.3524560 0.072
0.942435
total_intl_calls 0.1065264 0.0304318 3.500
0.000464
total_intl_charge -2.0803428 23.5262100 -0.088
0.929538
number_customer_service_calls -0.5109077 0.0476289 -10.727 < 2e-
16
(Intercept) ***
international_planyes ***
voice_mail_planyes *
number_vmail_messages
total_day_minutes
total_day_calls
total_day_charge
total_eve_minutes
total_eve_calls
total_eve_charge
total_night_minutes
total_night_calls
total_night_charge
total_intl_minutes
total_intl_calls ***
total_intl_charge
number_customer_service_calls
Signif. codes: 0 ‘***‘ 0.001 ‘**‘ 0.01 ‘*‘ 0.05 ‘.‘ 0.1 ‘ ‘ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1938.8 on 2314 degrees of freedom
Residual deviance: 1515.3 on 2298 degrees of freedom
AIC: 1549.3
Number of Fisher Scoring iterations: 6
3. Then, we find that the built model contains insignificant variables, which would
lead to misclassification. Therefore, we use significant variables only to train the
classification model:
> fit = glm(churn ~ international_plan voice_mail_plan total_
intl_calls number_customer_service_calls, data = trainset,
family=binomial)
> summary(fit)
Call:
glm(formula = churn ~ international_plan voice_mail_plan
total_intl_calls number_customer_service_calls, family =
binomial,
data = trainset)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7308 0.3103 0.4196 0.5381 1.6716
Coefficients:
Estimate Std. Error z value
(Intercept) 2.32304 0.16770 13.852
international_planyes -2.00346 0.16096 -12.447
voice_mail_planyes 0.79228 0.16380 4.837
total_intl_calls 0.08414 0.02862 2.939
number_customer_service_calls -0.44227 0.04451 -9.937
Pr(>|z|)
(Intercept) < 2e-16 ***
international_planyes < 2e-16 ***
voice_mail_planyes 1.32e-06 ***
total_intl_calls 0.00329 **
number_customer_service_calls < 2e-16 ***
---
Signif. codes:
0 es: des: **rvice_calls < ‘. es: de
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1938.8 on 2314 degrees of freedom
Residual deviance: 1669.4 on 2310 degrees of freedom
AIC: 1679.4
Number of Fisher Scoring iterations: 5
4. Then, you can then use a fitted model, fit, to predict the outcome of testset. You
can also determine the class by judging whether the probability is above 0.5:
> pred = predict(fit,testset, type="response")
> Class = pred >.5
5. Next, the use of the summary function will show you the binary outcome count, and
reveal whether the probability is above 0.5:
> summary(Class)
Mode FALSE TRUE NA‘s
logical 29 989 0
6. You can generate the counting statistics based on the testing dataset label and
predicted result:
> tb = table(testset$churn,Class)
> tb
Class
FALSE TRUE
yes 18 123
no 11 866
7. You can turn the statistics of the previous step into a classification table, and then
generate the confusion matrix:
> churn.mod = ifelse(testset$churn == "yes", 1, 0)
> pred_class = churn.mod
> pred_class[pred<=.5] = 1- pred_class[pred<=.5]
> ctb = table(churn.mod, pred_class)
> ctb
pred_class
churn.mod 0 1
0 866 11
1 18 123
> confusionMatrix(ctb)
Confusion Matrix and Statistics
pred_class
churn.mod 0 1
0 866 11
1 18 123
Accuracy : 0.9715
95?I : (0.9593, 0.9808)
No Information Rate : 0.8684
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8781
Mcnemar‘s Test P-Value : 0.2652
Sensitivity : 0.9796
Specificity : 0.9179
Pos Pred Value : 0.9875
Neg Pred Value : 0.8723
Prevalence : 0.8684
Detection Rate : 0.8507
Detection Prevalence : 0.8615
Balanced Accuracy : 0.9488
‘Positive‘ Class : 0
Signif。码:0 ‘ * * * ‘ 0.001 ‘ 0.01 ‘ * ‘ 0.05 ‘。“0.1”
(二项家庭的离散参数为1)
零偏差:1938.8在2314自由度
剩余偏差:1515.3在2298自由度
AIC:1549.3
费舍尔得分迭代的数量:6
3。然后,我们发现构建的模型包含一些无关紧要的变量,这是可以的
导致误分类。因此,我们只使用重要的变量来训练
分类模型:
= glm(搅动~国际计划,voice_mail_plan total_)
intl_调用number_customer_service_调用,data = trainset,
家庭=二项)
>总结(适合)
电话:
(公式=搅动~国际化计划
total_intl_calls number_customer_service_calls、家庭=
二项,
data =小火车)
异常残差:
最小1Q中值3Q最大值
- 2.7308 0.3103 0.5381 1.6716
系数:
估计Std,误差z值
(拦截)2.32304 0.16770 13.852
international_planyes -2.00346 0.16096 -12.447
voice_mail_planyes 0.79228 0.16380 4.837
total_intl_calls 0.08414 0.02862 2.939
number_customer_service_calls -0.44227 0.04451 -9.937
公关(z > | |)
(拦截)< 2 e-16 * * *
international_planyes < 2 e-16 * * *
voice_mail_planyes 1.32 e-06 * * *
total_intl_calls 0.00329 * *
number_customer_service_calls < 2 e-16 * * *
——
Signif。代码:
0 es:des:* * rvice_calls < ‘。es:德
(二项家庭的离散参数为1)
零偏差:1938.8在2314自由度
剩余偏差:1669.4在2310自由度
AIC:1679.4
费舍尔得分迭代的数量:5
4。然后,您可以使用一个合适的模型来预测testset的结果。你
还可以通过判断概率是否在0.5以上来确定这个类:
> pred =预测(fit,testset,type = " response ")
>类= pred > . 5
5。接下来,使用summary函数将显示二进制结果计数
显示概率是否在0.5以上:
>总结(类)
模式错误真的NA
逻辑29 989 0
6。您可以根据测试数据集标签生成计数统计信息
预测结果:
>结核病=表(testset搅动,美元类)
>结核病
类
虚假的真
是的18 123
没有866年11
7。您可以将前一步的统计数据转换为分类表,然后
生成混淆矩阵:
>生产。mod = ifelse(testset $搅动= = " yes ",1,0)
> pred_class = churn.mod
(pred < = > pred_class。5)= 1 - pred_class(pred < = 5)
>施=表(生产。国防部,pred_class)
>施
pred_class
生产。国防部0 1
0 866年11
18 123
> confusionMatrix(施)
混淆矩阵和统计
pred_class
生产。国防部0 1
0 866年11
18 123
准确性:0.9715
95?我:(0.9593,0.9808)
没有信息率:0.8684
p值[Acc >nir]:< 2e - 16
卡帕:0.8781
Mcnemar的测试p值:0.2652
灵敏度:0.9796
特异性:0.9179
Pos Pred值:0.9875
Neg Pred值:0.8723
流行率:0.8684
检测率:0.8507
检测患病率:0.8615
平衡精度:0.9488
"正面"类:0
How it works...
它是如何工作的……
Logistic regression is very similar to linear regression; the main difference is that the
dependent variable in linear regression is continuous, but the dependent variable in logistic
regression is dichotomous (or nominal). The primary goal of logistic regression is to use logit
to yield the probability of a nominal variable is related to the measurement variable. We can
formulate logit in following equation: ln(P/(1-P)), where P is the probability that certain event
occurs.
The advantage of logistic regression is that it is easy to interpret, it directs model logistic
probability, and provides a confidence interval for the result. Unlike the decision tree, which
is hard to update the model, you can quickly update the classification model to incorporate
new data in logistic regression. The main drawback of the algorithm is that it suffers from
multicollinearity and, therefore, the explanatory variables must be linear independent. glm
provides a generalized linear regression model, which enables specifying the model in the
option family. If the family is specified to a binomial logistic, you can set the family as a
binomial to classify the dependent variable of the category.
The classification process begins by generating a logistic regression model with the use of
the training dataset by specifying Churn as the class label, the other variables as training
features, and family set as binomial. We then use the summary function to generate the
model‘s summary information. From the summary information, we may find some insignificant
variables (p-values > 0.05), which may lead to misclassification. Therefore, we should
consider only significant variables for the model.
Next, we use the fit function to predict the categorical dependent variable of the testing
dataset, testset. The fit function outputs the probability of a class label, with a result
equal to 0.5 and below, suggesting that the predicted label does not match the label of
the testing dataset, and a probability above 0.5 indicates that the predicted label matches
the label of the testing dataset. Further, we can use the summary function to obtain the
statistics of whether the predicted label matches the label of the testing dataset. Lastly, in
order to generate a confusion matrix, we first generate a classification table, and then use
confusionMatrix to generate the performance measurement.
逻辑回归与线性回归非常相似;主要区别是
线性回归中的因变量是连续的,但在逻辑上是因变量
回归是二分的(或称名义的)。逻辑回归的主要目标是使用logit
给出一个标称变量的概率与测量变量有关。我们可以
用下面的方程来描述logit:ln(P /(1 - P)),P是某个事件的概率
发生。
逻辑回归的优势在于它很容易解释,它指导模型逻辑
概率,并为结果提供置信区间。和决策树不同
很难更新模型,您可以快速更新分类模型来合并
逻辑回归中的新数据。该算法的主要缺点是它会受到影响
因此,解释变量必须是线性无关的。全球语言监测机构
提供了一种广义线性回归模型,该模型可用于指定模型
选择家庭。如果家庭被指定为二项式,你可以把家庭设置为a
二项来对类别的因变量进行分类。
分类过程首先通过生成一个逻辑回归模型来开始
通过指定搅拌器作为类标签来训练数据集,其他变量作为培训
特征,家庭设置为二项。然后我们使用总结函数来生成
模型的摘要信息。从摘要信息中,我们可以发现一些不重要的东西
变量(p值> 0.05),可能导致错误分类。因此,我们应该
只考虑模型的重要变量。
接下来,我们使用fit函数来预测测试的分类依赖变量
数据集,testset。fit函数输出一个类标签的概率,结果是
等于0。5和下面,表明所预测的标签与标签不符
测试数据集,以及高于0.5的概率表明预期的标签匹配
测试数据集的标签。此外,我们可以使用summary函数来获取
是否预测的标签与测试数据集的标签相匹配。最后,在
为了生成一个混乱矩阵,我们首先生成一个分类表,然后使用
混淆矩阵来生成性能度量。
See also
另请参阅
f For more information of how to use the glm function, please refer to Chapter 4,
Understanding Regression Analysis, which covers how to interpret the output of
the glm function
关于如何使用glm函数的更多信息,请参阅第4章,
理解回归分析,包括如何解释输出全球语言检测机构的功能
Classifying data with the Na?ve Bayes classifier
用幼稚的贝叶斯分类数据分类器
The Na?ve Bayes classifier is also a probability-based classifier, which is based on applying the
Bayes theorem with a strong independent assumption. In this recipe, we will introduce how to
classify data with the Na?ve Bayes classifier.
Getting ready
You need to have the first recipe completed by generating training and testing datasets.
How to do it……
怎么做……
Perform the following steps to classify the churn data with the Na?ve Bayes classifier:
1. Load the e1071 library and employ the naiveBayes function to build the classifier:
> library(e1071)
> classifier=naiveBayes(trainset[, !names(trainset) ??
c("churn")], trainset$churn)
2. Type classifier to examine the function call, a-priori probability, and conditional
probability:
> classifier
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = trainset[, !names(trainset) ??
c("churn")],
y = trainset$churn)
A-priori probabilities:
trainset$churn
yes no
0.1477322 0.8522678
Conditional probabilities:
international_plan
贝叶斯分类器也是基于应用程序的概率分类器
贝叶斯定理有一个很强的独立假设。在这个食谱中,我们将介绍如何使用
用幼稚的贝叶斯分类器对数据进行分类。
准备
您需要通过生成培训和测试数据集来完成第一个菜谱。
执行以下步骤,用幼稚的贝叶斯分类器对搅动数据进行分类:
1。加载e1071库并使用naiveBayes函数来构建分类器:
>库(e1071)
>分类器= naiveBayes(小火车[!名字(小火车)??
c(“生产”)],小火车生产美元)
2。类型分类器来检查函数调用,a - priori概率和条件
概率:
>分类器
用于离散预测的天真贝叶斯分类器
电话:
naiveBayes。默认(x =小火车[!名字(小火车)??
c(“生产”),
y =小火车生产美元)
先验概率:
小火车美元波动
是的没有
0.1477322 - 0.8522678
条件概率:
international_plan
trainset$churn no yes
yes 0.70467836 0.29532164
no 0.93512418 0.06487582
3. Next, you can generate a classification table for the testing dataset:
> bayes.table = table(predict(classifier, testset[,
!names(testset) ??("churn")]), testset$churn)
> bayes.table
yes no
yes 68 45
no 73 832
4. Lastly, you can generate a confusion matrix from the classification table:
> confusionMatrix(bayes.table)
Confusion Matrix and Statistics
yes no
yes 68 45
no 73 832
Accuracy : 0.8841
95?I : (0.8628, 0.9031)
No Information Rate : 0.8615
P-Value [Acc > NIR] : 0.01880
Kappa : 0.4701
Mcnemar‘s Test P-Value : 0.01294
Sensitivity : 0.4823
Specificity : 0.9487
Pos Pred Value : 0.6018
Neg Pred Value : 0.9193
Prevalence : 0.1385
Detection Rate : 0.0668
Detection Prevalence : 0.1110
Balanced Accuracy : 0.7155
‘Positive‘ Class : yes
小火车美元波动没有是的
是的0.70467836 - 0.29532164
没有0.93512418 - 0.06487582
3。接下来,您可以为测试数据集生成一个分类表:
>贝叶斯。表=表(预测(分类器,testset(,
!名字(testset)??(“生产”)]),testset生产美元)
> bayes.table
是的没有
是的68年45
73号832
4。最后,您可以从分类表生成一个混乱矩阵:
> confusionMatrix(bayes.table)
混淆矩阵和统计
是的没有
是的68年45
73号832
准确性:0.8841
95?我:(0.8628,0.9031)
没有信息率:0.8615
p值[Acc > NIR]:0.01880
卡帕:0.4701
Mcnemar的测试p值:0.01294
灵敏度:0.4823
特异性:0.9487
Pos Pred值:0.6018
Neg Pred值:0.9193
流行率:0.1385
检测率:0.0668
检测患病率:0.1110
平衡精度:0.7155
"正面"类:是的
How it works...
Naive Bayes assumes that features are conditionally independent, which the effect
of a predictor(x) to class (c) is independent of the effect of other predictors to class(c).
It computes the posterior probability, P(c|x), as the following formula:
( ) ( ) ( )
( )
| P c | x
P x c P c
P x =
Where P(x|c) is called likelihood, p(x) is called the marginal likelihood, and p(c) is called the prior probability. If there are many predictors, we can formulate the posterior probability
as follows:
P c( ) | x = × P x( ) 1 | | c P( ) x c 2 × × …P x( ) n | c P( ) c
The advantage of Na?ve Bayes is that it is relatively simple and straightforward to use. It is suitable when the training set is relative small, and may contain some noisy and missing data. Moreover, you can easily obtain the probability for a prediction. The drawbacks of Na?ve Bayes are that it assumes that all features are independent and equally important, which is very unlikely in real-world cases.In this recipe, we use the Na?ve Bayes classifier from the e1071 package to build a classification model. First, we specify all the variables (excluding the churn class label) as the first input parameters, and specify the churn class label as the second parameter in the naiveBayes function call. Next, we assign the classification model into the variable classifier. Then, we print the variable classifier to obtain information, such as function call, A-priori probabilities, and conditional probabilities. We can also use the predict function to obtain the predicted outcome and the table function to retrieve the classification table of the testing dataset. Finally, we use a confusion matrix to calculate the performance measurement of the classification mode
它是如何工作的…
天真的贝叶斯认为特征是有条件独立的,这是有条件的对类(c)的预测器(x)与其他谓词对类(c)的影响无关。
它计算了后验概率P(c | x),如下公式:
()()()
()
| P c | x
pxcpc
P x =
P(x |)被称为可能,P(x)被称为边际可能性,P(c)叫什么先验概率。如果有很多预测因子,我们可以制定后验概率
如下:
P c()| x = P×x()1 | | c P()x c 2××…x P c n c | P()()
天真的贝叶斯的优点是使用起来相对简单直接。它是适合当训练集相对较小,并且可能包含一些噪声和丢失的数据。此外,你可以很容易地获得预测的概率。天真的贝叶斯的缺点它假设所有的特征都是独立的,同样重要的,这是什么在现实的情况下不太可能。在这个菜谱中,我们使用来自e1071包的朴素贝叶斯分类器来构建一个分类模型。首先,我们指定所有变量(不包括搅动类标签)第一个输入参数,并指定搅动类标签作为第二个参数naiveBayes函数调用。接下来,我们将分类模型分配到变量器分类器。然后,我们打印变量分类器来获取信息,如函数调用,先验概率和条件概率。我们还可以使用预测函数获取预期的结果和表函数来检索分类表测试数据集。最后,我们使用一个混乱矩阵来计算性能度量分类模式
项目的深入研究:
整合项目目录如下:
1.准备培训和测试数据集
2.构建带有递归分区树的分类模型
3.可视化递归分区树
4.度量递归分区树的预测性能
5.修剪一个递归分区树
6.构建带有条件推理树的分类模型
7.可视化条件推理树
8.测量条件推理树的预测性能
9.用k最近的邻居分类器分类数据
10.用逻辑回归对数据进行分类
11.用简单的贝叶斯分类器对数据进行分类
1:准备培训和测试数据集,对某种项目进行研究时,先对项目进行前期调查,在实践中取得部分有价值数据,写成R包,使用R在电脑中打开数据。
2:构建带有递归分区树的分类模型,某些数据具有一定的规律和意义,把他们在R中写出来之后,建立一个数据树,在分枝上进行标注。
3:可视化递归分区树是利用计算机图形学和图像处理技术,将数据转换图形或图像在屏幕上显示出来,并进行交互处理的理论、方法和技术。可视化技术最早运用于计算机科学中,并形成了可视化技术的一个重要分支,计算机图形学的发展使得三维表现技术得以形成,这些三维表现技术使我们能够再现三维世界中的物体,能够用三维形体来表示复杂的信息。
4:度量递归分区树的预测性能,在这个菜谱中,我们使用来自rpart包的递归分区树来构建一个基于树的分类模型。递归分配树包括两个过程:递归和分区。在决策过程中,我们必须考虑a统计评估问题(或者简单的是一个是/否的问题)将数据划分为不同的根据评估结果划分分区。然后,我们已经确定了这个孩子节点,我们可以重复执行分离,直到满足停止条件。
5:修建一个递归分区树,对已经建立的分区树,上面存在的假数据,无用数据进行修剪。使数据树精简。
6:构建带有条件推理树的分类模型,在这个食谱中,我们首先证明一个人可以使用预测函数来预测测试数据集的类别(类标签),testset,然后使用一个表函数生成一个分类表。接下来,您可以使用内置的混乱矩阵函数用于确定性能度量的caret包。除了预测函数,treeresponse还可以估计类概率,它常常将标签分类为更高的概率。在这个例子中,我们演示如何使用前五项记录获得估计的类概率测试数据集,testset。treeresponse函数返回5个概率的列表。你可以使用列表来确定实例的标签。
7:可视化条件推理树,同3一样,利用计算机图形学和图像处理技术,将数据转换图形或者图像在屏幕上显示出来,并进行交互处理理论,对数据进行推理。
8: 测量条件推理树的预测性能,在搭建好条件推理树分类模型后,使用treeresponse 和predict函数预测测试数据集testset 的类标号,并进一步使用分类表和混淆矩阵来评价算法的预测能力。
9:用k最近的邻居分类器分类数据,knn算法属于一种无参惰性学习方法,无参类算法不会对数据的分布做任何假设,而惰性学习方法则不要求算法具备显性学习过程。
10:用逻辑回归对数据进行分类,逻辑回归属于基于概率统计分类模型的方法,是根据一个或多个特征进行类别标号预测。
11: 用简单的贝叶斯分类器对数据进行分类,贝叶斯分类器是一种有监督学习,常见有两种模型,多项式模型(multinomial model)即为词频型和伯努利模(Bernoulli model)即文档型。二者的计算粒度不一样,多项式模型以单词为粒度,伯努利模型以文件为粒度,因此二者的先验概率和类条件概率的计算方法都不同。计算后验概率时,对于一个文档d,多项式模型中,只有在d中出现过的单词,才会参与后验概率计算,伯努利模型中,没有在d中出现,但是在全局单词表中出现的单词,也会参与计算,不过是作为“反方”参与的。这里暂不虑特征抽取、为避免消除测试文档时类条件概率中有为0现象而做的取对数等问题。
研究人:刘秉茜 张达衢 刘宇霆