Understand the data

A new data set (problem) is a wrapped gift. It’s full of promise and anticipation at the miracles you can wreak once you’ve solved it. But it remains a  mystery until you’ve opened it. This chapter is about opening up your new data set so you can see what’s inside, get an appreciation for what you’ll be able to do with the data, and start thinking about how you’ll approach model building with it.

Attributes (the variables being used to make predictions) are also known as the
following:
■Predictors
■Features

■Independent variables
■Inputs
Labels are also known as the following:
■Outcomes
■Targets
■Dependent variables
■Responses

Different Types of Attributes and Labels Drive Modeling Choices

The attributes come in two different types: numeric variables and categorical (or factor) variables. Attribute 1 (height) is a numeric variable and is the most usual type of attribute. Attribute 2 is gender and is indicated by the entry Male or Female. This type of attribute is called a categoricalor factor variable. Categorical variables have the property that there’s no order relation between the various values. There’s no sense to Male < Female (despite centuries of squabbling). Categorical variables can be two‐valued, like Male Female, or multivalued, like states (AL, AK, AR . . . WY). Other distinctions can be drawn regarding attributes (integer versus float, for example), but they do not have the same impact on machine learning algorithms. The reason for this is that many machine learning algorithms take numeric attributes only; they cannot handle categorical or factor variables. Penalized regression algorithms deal only with numeric attributes. The same is true for support vector machines, kernel methods, and K‐nearest neighbors.

When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classification problem. If the categorical target takes only two values, the problem is called a binary classification problem. If it takes more than two values, the problem is called a multiclass classification problem.

The classification problem might also be simpler than the regression problem. Consider, for instance, the difference in complexity between a topographic map with a single contour line (say the 100‐foot contour line) and a topographic map with contour lines every 10 feet. The single contour divides the map into the areas that are higher than 100 feet and those that are lower and contains considerably less information than the more detailed contour map. A classifier is trying to compute a single dividing contour without regard for behavior distant from
the decision boundary, whereas regression is trying to draw the whole map.????不懂

Items to Check:
Number of rows and columns
Number of categorical variables and number of unique values for each
Missing values
Summary statistics for attributes and labels

Classification Problems: Detecting Unexploded Mines Using Sonar

待续

时间: 2024-08-08 09:22:07

Understand the data的相关文章

[转]Data Structure Recovery using PIN and PyGraphviz

Source:http://v0ids3curity.blogspot.com/2015/04/data-structure-recovery-using-pin-and.html -------------------------------- Data Structure Recovery using PIN and PyGraphviz ? This is a simple POC PIN tool to recover data structures in dynamically lin

RESTful Web Services Example in Java with Jersey, Spring

Looking to REST? In Java? There's never time for that :), but if you are looking to use an "architectural style consisting of a coordinated set of constraints applied to components, connectors, and data elements, within a distributed hypermedia syste

舆情,文本挖掘

MLE,MAP,EM 和 point estimation 之间的关系是怎样的 和点估计相对应的是区间估计,这个一般入门的统计教材里都会讲.直观说,点估计一般就是要找概率密度曲线上值最大的那个点,区间估计则要寻找该曲线上满足某种条件的一个曲线段. 最大似然和最大后验是最常用的两种点估计方法.以最简单的扔硬币游戏为例,一枚硬币扔了五次,有一次是正面.用最大似然估计,就是以这五次结果为依据,判断这枚硬币每次落地时正面朝上的概率(期望值)是多少时,最有可能得到四次反面一次正面的结果.不难计算得到期望概

【IOS笔记】Creating Custom Content View Controllers

Creating Custom Content View Controllers 自定义内容视图控制器 Custom content view controllers are the heart of your app. You use them to present your app’s unique content. All apps need at least one custom content view controller. Complex apps divide the workl

如何优化app,看Facebook如何做

周四,Facebook Engineering blog 发表了一篇名为<Improving Facebook on Android>博文.博文从四个方面(Performance,Data Efficiency, Networking,Application Size)讲述了Facebook是如何优化app保证其在不同国家不同类型Android设备上都能表现出良好性能的.由于原文内容比较 容易理解,这里就直接给出原文,以使上边链接打不开的同学也能看到.<菜鸟成长史:http://blog

可视化MNIST之降维探索Visualizing MNIST: An Exploration of Dimensionality Reduction

At some fundamental level, no one understands machine learning. It isn’t a matter of things being too complicated. Almost everything we do is fundamentally very simple. Unfortunately, an innate human handicap interferes with us understanding these si

数据架构师的职责

一个数据架构师做什么?只是设计表?那是我能第一时间想的.但事实证明,这并不是.数据架构师的主要职责是: 数据建模与分析(45%) 数据加载和交付(25%) 数据可用性,性能,安全性保证(15%) 数据质量与治理(10%) 其他(5%)   1. 数据建模与分析 了解数据(Understand the data) 在组织机构中理解数据是最重要的职责.如果他在证券机构工作,他需要了解什么是收益率等专业名词.一个不能真正理解数据意义的架构师是无法创造出完整实用的功能. 数据建模(Data modell

《R in Nutshell》 读书笔记(连载)

R in Nutshell 前言 例子(nutshell包) 本书中的例子包括在nutshell的R包中,使用数据,需加载nutshell包 install.packages("nutshell") 第一部分:基础 第一章 批处理(Batch Mode) R provides a way to run a large set of commands in sequence and save the results to a file. 以batch mode运行R的一种方式是:使用系统

How to implement an algorithm from a scientific paper

Author: Emmanuel Goossaert 翻译 This article is a short guide to implementing an algorithm from a scientific paper. I have implemented many complex algorithms from books and scientific publications, and this article sums up what I have learned while se