R进行数据分析
1. 排序
- sort(x, decreasing = ): 返回排序好的数据
- order(x, decreasing = ): 返回排序好的数据的索引
例子:
v = c(2, 9, 1, 45, -3, 19, -5, 6)
sort(v) # returns ordered v in decreasing order
结果:
# [1] -5 -3 1 2 6 9 19 45
sort(v, decreasing = FALSE) # orders v in increasing order
结果:
# [1] -5 -3 1 2 6 9 19 45
order(v) # returns order of the indexes in v
结果:
# [1] 7 5 3 1 8 2 6 4
order()可以和更加复杂的数据结构配合使用,从而适应更加复杂的场景
如:
#Imagine that you just want to access Sepal Length and Species.
# You can access those values in different ways:
ir[order(ir$Sepal.Length, decreasing = TRUE),c("Sepal.Length", "Species")][1:5,]
ir[order(ir$Sepal.Length, decreasing = TRUE),][1:5, c("Sepal.Length", "Species")]
2. aggregate()
2.1 分组处理数据
aggregate(X, by, FUN, . . . ,simplify = TRUE)
- X is an R object (commonly a data frame)
- by is a list of the elements by which you will be grouping your data.
- FUN is the function that will be applied to each subset.
- simplify is a logical value that indicates if results should be simplified into a vector or a matrix.
举例:
针对单一对象,一种分组
aggregate(ir$Sepal.Length, by= list(ir$Species), FUN=mean)
aggregate(ir$Sepal.Width, by= list(ir$Species), summary)
针对多个对象,一种分组
aggregate(ir[,c("Sepal.Length", "Sepal.Width")], by=list(ir$Species), mean)
针对单一对象,多种分组
mean_by_sp_ind = aggregate(ir$Sepal.Length, by=list(ir$Species, ir$indoor), mean)
2.2 另一版本使用
aggregate(formula, data, FUN, subset)
- formula details the in the manner y ~ x or cbind(y1, y2) ~ x1+x2,where y and cbind(y1,y2) are the numeric data to be split and x or x1 + x2 are the grouping variables.
- data is the data frame with the variable used in the formula.
- FUN is the function that will be applied to each subset.
- subset is an optional vector specifying a subset of observations to be used
举例:
aggregate(Sepal.Length ~ Species, data = ir, mean)
针对单一对象,一种分组,相当于:
aggregate(ir$Sepal.Length, by= list(ir$Species), FUN=mean)
aggregate(Sepal.Length ~ Species + indoor, data = ir, mean)
针对单一对象,多种分组,相当于:
mean_by_sp_ind = aggregate(ir$Sepal.Length, by=list(ir$Species, ir$indoor), mean)
aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, ir, mean)
针对多个对象,一种分组,相当于:
aggregate(ir[,c("Sepal.Length", "Sepal.Width")], by=list(ir$Species), mean)
针对所有对象,多个分组
aggregate(. ~ Species + indoor, data = ir, mean)
使用subset筛选满足条件的子集
aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, data = ir, subset = Petal.Width>0.6, mean)
2.3 自定义函数
- 我们可以定义一次函数,多次使用
举例:
meanX = function(vec, n){mean(head(vec[order(-vec)], n))}
aggregate(ir$Sepal.Length, by = list(ir$Species), FUN= function(x) meanX(x, 5))
- 也可以定义自己的函数
举例:
aggregate(ir$Sepal.Length, by = list(ir$Species), FUN= function(x) mean(head(x[order(-x)], 5)))
- 设置默认值
举例:
aggregate(ir$Sepal.Length, by = list(ir$Species),
FUN= function(x, n=5) mean(head(x[order(-x)], n)))
aggregate(ir$Sepal.Length, n=5, by = list(ir$Species),
FUN= function(x, n) mean(head(x[order(-x)], n)))
3. 基本的数据分析
3.2 选取样本了解数据
- View(dataset) will show the whole dataset in a new window.
- tail(dataset,x) will show the last x intances of dataset in the console.
- head(dataset,x) will show the first x instances of dataset in the console.
- names(dataset) will return the name of the attributes in the dataset.
- str(dataset) will return a summary of of the type of each attribute and the first few values.
3.3 中心性检测
- 均值:mean()、 colMeans()、 rowMeans()
- 中位数:median()
3.4 离散分析
- 标准差: sd()
- 范围: range()
- 四分位数:IQR = Q3 − Q1
分位数是将总体的全部数据按大小顺序排列后,处于各等分位置的变量值。如果将全部数据分成相等的两部分,它就是中位数;如果分成四等分,就是四分位数。
3.5 相关性
- 相关性系数: cor()
4. 高级数据分析 dplyr
These are: filter(), arrange(), select(), mutate(), summarize(), sample_n(), sample_frac(), and group_by().
4.1 filter
得到满足条件的行
filter(data frame, condition).
filter(iris, Species=="setosa") # using dplyr
等效于
iris[iris$Species=="setosa",] # using basic R
4.2 arrange
arrange(data frame, attributes)
将数据帧按照属性排序
升序:
arrange(iris, Sepal.Length) #dplyr
iris[order(iris$Sepal.Length, decreasing = FALSE),] # basic R
iris[order(iris$Sepal.Length),] # basic R
降序:
arrange(iris, desc(Sepal.Length))
iris[order(iris$Sepal.Length, decreasing = TRUE),]
iris[order(-iris$Sepal.Length),]
arrange(iris, Sepal.Length,Sepal.Width)[1:5,]
升序与降序结合:
arrange(iris, Sepal.Length, desc(Sepal.Width))[1:5,]
iris[order(iris$Sepal.Length, -iris$Sepal.Width),][1:5,]
4.3 select
select(data frame, var1,. . . ,varX).
选择满足条件的列,或者去掉某几列
select(ir,Petal.Width, Species)
ir[,c("Petal.Length","Species")]
select(ir, -Species)
ir[, -c(5)]
select可以使用类似通配符的相关功能
select(ir,starts_with("Petal")) #Petal.Length and Petal.Width
select(ir, ends_with("Length")) #Sepal.Length and Petal.Length
4.4 mutate
mutate(data frame, expression)
添加新的列到数据帧
ir = mutate(ir, DoubleSepalL = Sepal.Length*2,
PetalRatio = Petal.Length/Petal.Width)
ir$DoubleSepalL = ir$Sepal.Length*2
ir$PetalRatio = ir$Petal.Length/ir$Petal.Width
4.5 summarize、 summarise_all
summarize(data frame, function(var1,. . . ,varX))
可以调用多个函数内置函数: sd(), min(), max(), median(), sum(), cor() (correlation), n() (length of vector)_, first() (first value), last() (last value) and n_distinct() (number of distinct values in vector).
summarise(ir, avg = mean(Sepal.Length), std= sd(Sepal.Length), total=n())
其中n()求行数
summary(ir)
summarise_all可以处理多个对象
均值
summarise_all(ir[,1:4],mean)
四分位点
summarise_all(ir[,1:4],quantile, probs=0.75)
4.6 Sample_n
乱序、取某几行的样本
sample_n(iris,5)
iris[sample(1:nrow(iris)),][1:5,]
4.7 Sample_frac
乱序、按比例取样本
sample_frac(iris,0.01) # 取1%的样本
等效于:
iris[sample(1:nrow(iris)),][1:ceiling(nrow(iris)*0.01),]
4.8 group_by
group_by(data frame, variable)
一般与其余函数结合使用,如: summarize
summarize(group_by(ir, Species), sd(Petal.Width))
等效于:
aggregate(ir$Petal.Width, by=list(ir$Species), FUN=sd) #Base R
求相关系数
summarize(group_by(ir, Species), r=cor(Sepal.Length, Sepal.Width))
5. piping with dplyr
将一个函数的输出作为另一个函数的输入
group_by(ir, Species) %>% summarise(avg= mean(Petal.Length))
等效于:
summarise(group_by(ir,Species), r=mean(Petal.Length))
piping将多个函数可以直接结合,使用更加有效
non_virg = ir[ir$Species!="virginica", c("Petal.Length")]
sum(non_virg>3.5)
## [1] 45
#B. Using dplyr with no piping
summarise(filter(ir,Species!="virginica",Petal.Length>3.5), n())
## n()
## 1 45
#C. Using dplyr with piping
ir %>% filter(Species!="virginica", Petal.Length>3.5) %>% nrow()
## [1] 45
注意piping只能将不同的函数连接起来,不适用basing R
ir %>%
mutate(petal_w_l = Petal.Width/Petal.Length) %>%
arrange(desc(petal_w_l)) %>%
head(3) %>% select(Species, petal_w_l)
6. 练习
1. 函数、piping、which的结合使用
注意:which.max(by_spc$Sepal.Length_mean) 中的下划线‘_‘是by_spc中列的名字,并没有特殊的含义
summ = c(min = min,max = max,mean = mean,median = median, q2={function(x) quantile(x, 0.25)},q3={function(x) quantile(x, 0.75)})
by_spc=group_by(ir, Species) %>% summarise_all(summ)
by_spc
# A tibble: 3 x 25
Species Sepal.Length_min Sepal.Width_min Petal.Length_min
<fct> <dbl> <dbl> <dbl>
1 setosa 4.3 2.3 1
2 versic~ 4.9 2 3
3 virgin~ 4.9 2.2 4.5
......
a. Which plants have a higher mean sepal length?
by_spc[which.max(by_spc$Sepal.Length_mean),]$Species
b. Which plants have the sample with the smaller petal width?
by_spc[which.min(by_spc$Petal.Width_min),]$Species
2. 获取数据帧的每一个属性的数据类型
使用sapply、class
sapply(choco, class)
3. 获取帧的mode(数据类别中的最大值)
Obtain the mode from all of the nominal attributes in the dataset.
注意 factor 使用
- 因子就是用于表示一组数据中的类别,可以记录这组数据中的类别名称及类别数目。
- 实现研究对象的分组、分类计算
参考链接:https://www.zhihu.com/question/48472404
举例:
choco_nominal = choco[sapply(choco,{function (x) is.factor(x)})==TRUE]
sapply(choco_nominal, {function(x) names(which.max(table(x)))})
4. 多少类别
How many distinct companies have been considered? length(unique(choco$Company))
length(unique(choco$Company))
5. 将factor装换成数值类型
- 去除一些R不能理解的字符(如2,345 转换成 2345,30% 转换成 30),gsub()
- 将factor转换成number,as.numeric()
如:
Circulation2004= as.numeric(gsub(pattern = ",", replacement="", x=books$Daily.Circulation..2004))
原文地址:https://www.cnblogs.com/Stephanie-boke/p/12541868.html