4、R进行数据分析

R进行数据分析

1. 排序

sort(x, decreasing = ): 返回排序好的数据
order(x, decreasing = ): 返回排序好的数据的索引

例子：

v = c(2, 9, 1, 45, -3, 19, -5, 6)

sort(v) # returns ordered v in decreasing order
结果：
# [1] -5 -3 1 2 6 9 19 45
sort(v, decreasing = FALSE) # orders v in increasing order
结果：
# [1] -5 -3 1 2 6 9 19 45

order(v) # returns order of the indexes in v
结果：
# [1] 7 5 3 1 8 2 6 4

order()可以和更加复杂的数据结构配合使用，从而适应更加复杂的场景
如：

#Imagine that you just want to access Sepal Length and Species.
# You can access those values in different ways:
ir[order(ir$Sepal.Length, decreasing = TRUE),c("Sepal.Length", "Species")][1:5,]

ir[order(ir$Sepal.Length, decreasing = TRUE),][1:5, c("Sepal.Length", "Species")]

2. aggregate()

2.1 分组处理数据

aggregate(X, by, FUN, . . . ,simplify = TRUE)

X is an R object (commonly a data frame)
by is a list of the elements by which you will be grouping your data.
FUN is the function that will be applied to each subset.
simplify is a logical value that indicates if results should be simplified into a vector or a matrix.

举例:

针对单一对象，一种分组
aggregate(ir$Sepal.Length, by= list(ir$Species), FUN=mean)
aggregate(ir$Sepal.Width, by= list(ir$Species), summary)

针对多个对象，一种分组
aggregate(ir[,c("Sepal.Length", "Sepal.Width")], by=list(ir$Species), mean)

针对单一对象，多种分组
mean_by_sp_ind = aggregate(ir$Sepal.Length, by=list(ir$Species, ir$indoor), mean)

2.2 另一版本使用

aggregate(formula, data, FUN, subset)

formula details the in the manner y ~ x or cbind(y1, y2) ~ x1+x2,where y and cbind(y1,y2) are the numeric data to be split and x or x1 + x2 are the grouping variables.
data is the data frame with the variable used in the formula.
FUN is the function that will be applied to each subset.
subset is an optional vector specifying a subset of observations to be used

举例：

aggregate(Sepal.Length ~ Species, data = ir, mean)
针对单一对象，一种分组，相当于：
aggregate(ir$Sepal.Length, by= list(ir$Species), FUN=mean)

aggregate(Sepal.Length ~ Species + indoor, data = ir, mean)
针对单一对象，多种分组，相当于：
mean_by_sp_ind = aggregate(ir$Sepal.Length, by=list(ir$Species, ir$indoor), mean)

aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, ir, mean)
针对多个对象，一种分组，相当于：
aggregate(ir[,c("Sepal.Length", "Sepal.Width")], by=list(ir$Species), mean)

针对所有对象，多个分组
aggregate(. ~ Species + indoor, data = ir, mean)

使用subset筛选满足条件的子集
aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, data = ir, subset = Petal.Width>0.6, mean)

2.3 自定义函数

我们可以定义一次函数，多次使用

举例：

meanX = function(vec, n){mean(head(vec[order(-vec)], n))}
aggregate(ir$Sepal.Length, by = list(ir$Species), FUN= function(x) meanX(x, 5))

也可以定义自己的函数

举例：

aggregate(ir$Sepal.Length, by = list(ir$Species), FUN= function(x) mean(head(x[order(-x)], 5)))

设置默认值

举例：

aggregate(ir$Sepal.Length, by = list(ir$Species),
FUN= function(x, n=5) mean(head(x[order(-x)], n)))

aggregate(ir$Sepal.Length, n=5, by = list(ir$Species),
FUN= function(x, n) mean(head(x[order(-x)], n)))

3. 基本的数据分析

3.2 选取样本了解数据

View(dataset) will show the whole dataset in a new window.
tail(dataset,x) will show the last x intances of dataset in the console.
head(dataset,x) will show the first x instances of dataset in the console.
names(dataset) will return the name of the attributes in the dataset.
str(dataset) will return a summary of of the type of each attribute and the first few values.

3.3 中心性检测

均值：mean()、 colMeans()、 rowMeans()
中位数：median()

3.4 离散分析

标准差: sd()
范围: range()
四分位数：IQR = Q3 − Q1
分位数是将总体的全部数据按大小顺序排列后，处于各等分位置的变量值。如果将全部数据分成相等的两部分，它就是中位数；如果分成四等分，就是四分位数。

3.5 相关性

相关性系数： cor()

4. 高级数据分析 dplyr

These are: filter(), arrange(), select(), mutate(), summarize(), sample_n(), sample_frac(), and group_by().

4.1 filter

得到满足条件的行
filter(data frame, condition).

filter(iris, Species=="setosa") # using dplyr
等效于
iris[iris$Species=="setosa",] # using basic R

4.2 arrange

arrange(data frame, attributes)
将数据帧按照属性排序

升序：
arrange(iris, Sepal.Length) #dplyr
iris[order(iris$Sepal.Length, decreasing = FALSE),] # basic R
iris[order(iris$Sepal.Length),] # basic R

降序：
arrange(iris, desc(Sepal.Length))
iris[order(iris$Sepal.Length, decreasing = TRUE),]
iris[order(-iris$Sepal.Length),]
arrange(iris, Sepal.Length,Sepal.Width)[1:5,]

升序与降序结合：
arrange(iris, Sepal.Length, desc(Sepal.Width))[1:5,]
iris[order(iris$Sepal.Length, -iris$Sepal.Width),][1:5,]

4.3 select

select(data frame, var1,. . . ,varX).
选择满足条件的列，或者去掉某几列

select(ir,Petal.Width, Species)
ir[,c("Petal.Length","Species")]

select(ir, -Species)
ir[, -c(5)]

select可以使用类似通配符的相关功能

select(ir,starts_with("Petal")) #Petal.Length and Petal.Width
select(ir, ends_with("Length")) #Sepal.Length and Petal.Length

4.4 mutate

mutate(data frame, expression)
添加新的列到数据帧

ir = mutate(ir, DoubleSepalL = Sepal.Length*2,
PetalRatio = Petal.Length/Petal.Width)
ir$DoubleSepalL = ir$Sepal.Length*2
ir$PetalRatio = ir$Petal.Length/ir$Petal.Width

4.5 summarize、 summarise_all

summarize(data frame, function(var1,. . . ,varX))

可以调用多个函数内置函数： sd(), min(), max(), median(), sum(), cor() (correlation), n() (length of vector)_, first() (first value), last() (last value) and n_distinct() (number of distinct values in vector).

summarise(ir, avg = mean(Sepal.Length), std= sd(Sepal.Length), total=n())
其中n()求行数

summary(ir)

summarise_all可以处理多个对象

均值
summarise_all(ir[,1:4],mean)
四分位点
summarise_all(ir[,1:4],quantile, probs=0.75)

4.6 Sample_n

乱序、取某几行的样本

sample_n(iris,5)
iris[sample(1:nrow(iris)),][1:5,]

4.7 Sample_frac

乱序、按比例取样本

sample_frac(iris,0.01) # 取1%的样本
等效于：
iris[sample(1:nrow(iris)),][1:ceiling(nrow(iris)*0.01),]

4.8 group_by

group_by(data frame, variable)
一般与其余函数结合使用，如： summarize

summarize(group_by(ir, Species), sd(Petal.Width))
等效于：
aggregate(ir$Petal.Width, by=list(ir$Species), FUN=sd) #Base R

求相关系数
summarize(group_by(ir, Species), r=cor(Sepal.Length, Sepal.Width))

5. piping with dplyr

将一个函数的输出作为另一个函数的输入

group_by(ir, Species) %>% summarise(avg= mean(Petal.Length))
等效于：
summarise(group_by(ir,Species), r=mean(Petal.Length))

piping将多个函数可以直接结合，使用更加有效

non_virg = ir[ir$Species!="virginica", c("Petal.Length")]
sum(non_virg>3.5)
## [1] 45

#B. Using dplyr with no piping
summarise(filter(ir,Species!="virginica",Petal.Length>3.5), n())
## n()
## 1 45

#C. Using dplyr with piping
ir %>% filter(Species!="virginica", Petal.Length>3.5) %>% nrow()
## [1] 45

注意piping只能将不同的函数连接起来，不适用basing R

ir %>%
mutate(petal_w_l = Petal.Width/Petal.Length) %>%
arrange(desc(petal_w_l)) %>%
head(3) %>% select(Species, petal_w_l)

6. 练习

1. 函数、piping、which的结合使用

注意：which.max(by_spc$Sepal.Length_mean) 中的下划线‘_‘是by_spc中列的名字，并没有特殊的含义

summ = c(min = min,max = max,mean = mean,median = median, q2={function(x) quantile(x, 0.25)},q3={function(x) quantile(x, 0.75)})
by_spc=group_by(ir, Species) %>% summarise_all(summ)

by_spc
# A tibble: 3 x 25
  Species Sepal.Length_min Sepal.Width_min Petal.Length_min
  <fct>              <dbl>           <dbl>            <dbl>
1 setosa               4.3             2.3              1
2 versic~              4.9             2                3
3 virgin~              4.9             2.2             4.5
......

a. Which plants have a higher mean sepal length?
by_spc[which.max(by_spc$Sepal.Length_mean),]$Species

b. Which plants have the sample with the smaller petal width?
by_spc[which.min(by_spc$Petal.Width_min),]$Species

2. 获取数据帧的每一个属性的数据类型

使用sapply、class

sapply(choco, class)

3. 获取帧的mode（数据类别中的最大值）

Obtain the mode from all of the nominal attributes in the dataset.

注意 factor 使用

因子就是用于表示一组数据中的类别，可以记录这组数据中的类别名称及类别数目。
实现研究对象的分组、分类计算
参考链接:https://www.zhihu.com/question/48472404

举例：

choco_nominal = choco[sapply(choco,{function (x) is.factor(x)})==TRUE]
sapply(choco_nominal, {function(x) names(which.max(table(x)))})

4. 多少类别

How many distinct companies have been considered? length(unique(choco$Company))

length(unique(choco$Company))

5. 将factor装换成数值类型

去除一些R不能理解的字符（如2,345 转换成 2345，30% 转换成 30），gsub()
将factor转换成number，as.numeric()

如:

Circulation2004= as.numeric(gsub(pattern = ",", replacement="", x=books$Daily.Circulation..2004))

原文地址：https://www.cnblogs.com/Stephanie-boke/p/12541868.html

时间： 2024-10-29 04:00:25

4、R进行数据分析的相关文章

R简单数据分析

眼下大数据口号满天飞,今天拿我微信圈朋友一段时间内分享内容作为数据,用R包的算法实现简单分析. 由于微信没有接口获取数据,暂时只能手动记录数据,主要是做个小尝试,数据获取方式是其次. 1)我们看看微信圈活跃的朋友. PS:知道为何我们的流量烧的这么快了吧?这些小伙伴八成是运营商潜伏过来的余则成,在背后分成还要我们帮忙数钱,后续我会揪出那个人,敬请期待. 2)我们看看微信圈朋友的喜好. PS:喜欢分享链接的小伙伴一般喜欢晒图片,有木有?亲. 3)用图表讲述故事,有图有真相. 微信圈内容占比 PS:

用python调用R做数据分析-准备工作

0.R的介绍 R是自由软件,不带任何担保,在某些条件下你可以将其自由散布,用'license()'或'licence()'来看散布的详细条件. R是个合作计划,有许多人为之做出了贡献,用'contributors()'来看合作者的详细情况,用'citation()'会告诉你如何在出版物中正确地引用R或R程序包,用'demo()'来看一些示范程序,用'help()'来阅读在线帮助文件,或用'help.start()'通过HTML浏览器来看帮助文件. 用'q()'退出R. demo(graphics

R语言数据分析系列之九 - 逻辑回归

R语言数据分析系列之九 -- by comaple.zhang 本节将一下逻辑回归和R语言实现,逻辑回归(LR,LogisticRegression)其实属于广义回归模型,根据因变量的类型和服从的分布可以分为,普通多元线性回归模型,和逻辑回归,逻辑回归是指因变量是离散并且取值范围为{0,1}两类,如果离散变量取值是多项即变为 multi-class classification,所以LR模型是一个二分类模型,可以用来做CTR预测等.那么我们现在来引出逻辑回归如何做二分类问题. 问题引入在多元线

R语言数据分析系列六

R语言数据分析系列六 -- by comaple.zhang 上一节讲了R语言作图,本节来讲讲当你拿到一个数据集的时候怎样下手分析,数据分析的第一步.探索性数据分析. 统计量,即统计学里面关注的数据集的几个指标.经常使用的例如以下:最小值,最大值,四分位数,均值,中位数,众数,方差,标准差.极差,偏度,峰度先来解释一下各个量得含义,浅显就不说了,这里主要说一下不常见的众数:出现次数最多的方差:每一个样本值与均值的差得平方和的平均数标准差:又称均方差,是方差的二次方根.用来衡量一个数据集的

R语言数据分析系列之八

R语言数据分析系列之八 -- by comaple.zhang 再谈多项式回归,本节再次提及多项式回归分析,理解过拟合现象,并深入cross-validation(交叉验证),regularization(正则化)框架,来避免产生过拟合现象,从更加深入的角度探讨理论基础以及基于R如何将理想照进现实. 本节知识点,以及数据集生成 1, ggplot2进行绘图; 2, 为了拟合更复杂的数据数据集采用sin函数加上服从正太分布的随机白噪声数据; 3, poly

R语言数据分析系列之五

R语言数据分析系列之五 -- by comaple.zhang 本节来讨论一下R语言的基本图形展示,先来看一张效果图吧. 这是一张用R语言生成的,虚拟的wordcloud云图,详细实现细节请參见我的github项目:https://github.com/comaple/R-wordcloud.git 好了我们開始今天的旅程吧: 本节用到的包有:RColorBrewer用来生成序列颜色值, plotrix三维图形本节用到的数据集:vcd包中的Arthritis数据集数据集 install.pa

R语言数据分析系列之六

R语言数据分析系列之六 -- by comaple.zhang 上一节讲了R语言作图,本节来讲讲当你拿到一个数据集的时候如何下手分析,数据分析的第一步,探索性数据分析. 统计量,即统计学里面关注的数据集的几个指标,常用的如下:最小值,最大值,四分位数,均值,中位数,众数,方差,标准差,极差,偏度,峰度先来解释一下各个量得含义,浅显就不说了,这里主要说一下不常见的众数:出现次数最多的方差:每个样本值与均值的差得平方和的平均数标准差:又称均方差,是方差的二次方根,用来衡量一个数据集的集中性

R语言数据分析系列之三

R语言数据分析系列之三 -- by comaple.zhang 上次讲了vector这次讲matrix,array,dataframe,ts 数据结构 matrix 矩阵 R语言中矩阵可以理解为是由两个及两个以上的向量组成. 矩阵创建从向量创建 > x <- sample(1:100,16) > x [1] 14 43 89 3 96 58 61 75 33 66 24 54 45 15 6 44 > m <- matrix(x) > m

R语言数据分析系列之四

R语言数据分析系列之四 -- by comaple.zhang 说到统计分析我们就离不开随机变量,所谓随机变量就是数学家们为了更好的拟合现实世界的数据而建立的数学模型.有了她我们甚至可以来预测一个网站未来几天的日访问用户,股票的未来走势等等.那么本节我们来一起探讨以下常用的函数分布,以及流程控制语句. 常见分布有:正态分布(高斯分布),指数分布,beta分布,gamma分布等. 正态分布若随机变量X服从一个数学期望为μ.方差为σ^2的正态分布,记为N(μ,σ^2).其概率密度函数曲线,由正态分

R语言数据分析系列之七

R语言数据分析系列之七 -- by comaple.zhang 回归分析建模是数据分析里面很重要的一个应用之一,即通过使用已有的自变量的值建立某种关系,来预测未知变量(因变量)的值.如果因变量是连续的那就是回归分析,如果因变量为离散的,可以理解为是分类.在机器学习算法中,不管是连续变量预测还是离散的变量预测,我们都称之为有监督学习. 回归分析可以用来做广告点击率预测也可以用来做销量预测,app各种指标预测,或者库存量,分仓铺货预测等.既然如此神奇,那么我们就来看一下回归是如何做到的. 数据集我