Understand the data

A new data set (problem) is a wrapped gift. It’s full of promise and anticipation at the miracles you can wreak once you’ve solved it. But it remains a mystery until you’ve opened it. This chapter is about opening up your new data set so you can see what’s inside, get an appreciation for what you’ll be able to do with the data, and start thinking about how you’ll approach model building with it.

Attributes (the variables being used to make predictions) are also known as the
following:
■Predictors
■Features

■Independent variables
■Inputs
Labels are also known as the following:
■Outcomes
■Targets
■Dependent variables
■Responses

Different Types of Attributes and Labels Drive Modeling Choices

The attributes come in two different types: numeric variables and categorical (or factor) variables. Attribute 1 (height) is a numeric variable and is the most usual type of attribute. Attribute 2 is gender and is indicated by the entry Male or Female. This type of attribute is called a categoricalor factor variable. Categorical variables have the property that there’s no order relation between the various values. There’s no sense to Male < Female (despite centuries of squabbling). Categorical variables can be two‐valued, like Male Female, or multivalued, like states (AL, AK, AR . . . WY). Other distinctions can be drawn regarding attributes (integer versus float, for example), but they do not have the same impact on machine learning algorithms. The reason for this is that many machine learning algorithms take numeric attributes only; they cannot handle categorical or factor variables. Penalized regression algorithms deal only with numeric attributes. The same is true for support vector machines, kernel methods, and K‐nearest neighbors.

When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classification problem. If the categorical target takes only two values, the problem is called a binary classification problem. If it takes more than two values, the problem is called a multiclass classification problem.

The classification problem might also be simpler than the regression problem. Consider, for instance, the difference in complexity between a topographic map with a single contour line (say the 100‐foot contour line) and a topographic map with contour lines every 10 feet. The single contour divides the map into the areas that are higher than 100 feet and those that are lower and contains considerably less information than the more detailed contour map. A classifier is trying to compute a single dividing contour without regard for behavior distant from
the decision boundary, whereas regression is trying to draw the whole map.????不懂

Items to Check：
Number of rows and columns
Number of categorical variables and number of unique values for each
Missing values
Summary statistics for attributes and labels

Classification Problems: Detecting Unexploded Mines Using Sonar

待续

时间： 2024-08-08 09:22:07

Understand the data

Understand the data的相关文章

[转]Data Structure Recovery using PIN and PyGraphviz

RESTful Web Services Example in Java with Jersey, Spring

舆情,文本挖掘

【IOS笔记】Creating Custom Content View Controllers

如何优化app，看Facebook如何做

可视化MNIST之降维探索Visualizing MNIST: An Exploration of Dimensionality Reduction

数据架构师的职责

《R in Nutshell》读书笔记(连载)

How to implement an algorithm from a scientific paper