R语言中aggregate函数

前言

这个函数的功能比较强大,它首先将数据进行分组(按行),然后对每一组数据进行函数统计,最后把结果组合成一个比较nice的表格返回。根据数据对象不同它有三种用法,分别应用于数据框(data.frame)、公式(formula)和时间序列(ts):

    aggregate(x, by, FUN, ..., simplify = TRUE)
    aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
    aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)

语法

    aggregate(x, ...)

    ## S3 method for class ‘default‘:
    aggregate((x, ...))

    ## S3 method for class ‘data.frame‘:
    aggregate((x, by, FUN, ..., simplify = TRUE))

    ## S3 method for class ‘formula‘:
    aggregate((formula, data, FUN, ...,
              subset, na.action = na.omit))

    ## S3 method for class ‘ts‘:
    aggregate((x, nfrequency = 1, FUN = sum, ndeltat = 1,
              ts.eps = getOption("ts.eps"), ...))

    ###细节查看  ?aggregate

Example1

我们通过 mtcars 数据集的操作对这个函数进行简单了解。mtcars 是不同类型汽车道路测试的数据框类型数据:

    > str(mtcars)
    ‘data.frame‘: 32 obs. of 11 variables:
    $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
    $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
    $ disp: num 160 160 108 258 360 ...
    $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
    $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
    $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
    $ qsec: num 16.5 17 18.6 19.4 17 ...
    $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
    $ am : num 1 1 1 0 0 0 0 0 0 0 ...
    $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
    $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

先用attach函数把mtcars的列变量名称加入到变量搜索范围内,然后使用aggregate函数按cyl(汽缸数)进行分类计算平均值:

    > attach(mtcars)
    > aggregate(mtcars, by=list(cyl), FUN=mean)
    Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
    1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
    2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
    3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

by参数也可以包含多个类型的因子,得到的就是每个不同因子组合的统计结果:

    > aggregate(mtcars, by=list(cyl, gear), FUN=mean)

    Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
    1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000
    2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000
    3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333
    4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000
    5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000
    6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000
    7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000
    8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000

公式(formula)是一种特殊的R数据对象,在aggregate函数中使用公式参数可以对数据框的部分指标进行统计:

    > aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)
    cyl gear mpg hp
    1 4 3 21.500 97.0000
    2 6 3 19.750 107.5000
    3 8 3 15.050 194.1667
    4 4 4 26.925 76.0000
    5 6 4 19.750 116.5000
    6 4 5 28.200 102.0000
    7 6 5 19.700 175.0000
    8 8 5 15.400 299.5000

上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子组合对 cbind(mpg,hp) 数据进行操作。aggregate在时间序列数据上的应用请参考R的函数说明文档。

Example2

    ## Compute the averages for the variables in ‘state.x77‘, grouped
    ## according to the region (Northeast, South, North Central, West) that
    ## each state belongs to.
    aggregate(state.x77, list(Region = state.region), mean)

    ## Compute the averages according to region and the occurrence of more
    ## than 130 days of frost.
    aggregate(state.x77,
              list(Region = state.region,
                   Cold = state.x77[,"Frost"] > 130),
              mean)
    ## (Note that no state in ‘South‘ is THAT cold.)

    ## example with character variables and NAs
    testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
                         v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
    by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
    by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
    aggregate(x = testDF, by = list(by1, by2), FUN = "mean")

    # and if you want to treat NAs as a group
    fby1 <- factor(by1, exclude = "")
    fby2 <- factor(by2, exclude = "")
    aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")

    ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
    aggregate(weight ~ feed, data = chickwts, mean)
    aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
    aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
    aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)

    ## Dot notation:
    aggregate(. ~ Species, data = iris, mean)
    aggregate(len ~ ., data = ToothGrowth, mean)

    ## Often followed by xtabs():
    ag <- aggregate(len ~ ., data = ToothGrowth, mean)
    xtabs(len ~ ., data = ag)

    ## Compute the average annual approval ratings for American presidents.
    aggregate(presidents, nfrequency = 1, FUN = mean)
    ## Give the summer less weight.
    aggregate(presidents, nfrequency = 1,
              FUN = weighted.mean, w = c(1, 1, 0.5, 1))

Example3

    ------------------------------------------------------
    #load data
    data <- ChickWeight
    head(data)
      weight Time Chick Diet
    1     42    0     1    1
    2     51    2     1    1
    3     59    4     1    1
    4     64    6     1    1
    5     76    8     1    1
    6     93   10     1    1

    #dimension of the data
    dim(data)
    [1] 578   4

    #how many chickens
    unique(data$Chick)
     [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
    50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48

    #how many diets
    unique(data$Diet)
    [1] 1 2 3 4
    Levels: 1 2 3 4

    #how many time points
    unique(data$Time)
     [1]  0  2  4  6  8 10 12 14 16 18 20 21

    library(ggplot2)
    ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
           geom_line() +
           geom_point()

    ------------------------------------------------------

    ## S3 method for class ‘data.frame‘
    ## aggregate(x, by, FUN, ..., simplify = TRUE)

    #find the mean weight depending on diet
    aggregate(data$weight, list(diet = data$Diet), mean)
      diet        x
    1    1 102.6455
    2    2 122.6167
    3    3 142.9500
    4    4 135.2627

    #aggregate on time
    aggregate(data$weight, list(time=data$Time), mean)
       time         x
    1     0  41.06000
    2     2  49.22000
    3     4  59.95918
    4     6  74.30612
    5     8  91.24490
    6    10 107.83673
    7    12 129.24490
    8    14 143.81250
    9    16 168.08511
    10   18 190.19149
    11   20 209.71739
    12   21 218.68889

    #use a different function
    aggregate(data$weight, list(time=data$Time), sd)
       time         x
    1     0  1.132272
    2     2  3.688316
    3     4  4.495179
    4     6  9.012038
    5     8 16.239780
    6    10 23.987277
    7    12 34.119600
    8    14 38.300412
    9    16 46.904079
    10   18 57.394757
    11   20 66.511708
    12   21 71.510273

    #we could also aggregate on time and diet
    head(aggregate(data$weight,
                   list(time = data$Time, diet = data$Diet),
                   mean
                  )
        )
      time diet        x
    1    0    1 41.40000
    2    2    1 47.25000
    3    4    1 56.47368
    4    6    1 66.78947
    5    8    1 79.68421
    6   10    1 93.05263
    tail(aggregate(data$weight,
                   list(time = data$Time, diet = data$Diet),
                   mean
                  )
        )
       time diet        x
    43   12    4 151.4000
    44   14    4 161.8000
    45   16    4 182.0000
    46   18    4 202.9000
    47   20    4 233.8889
    48   21    4 238.5556

    #to see the weights over time across different diets
    ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
                 facet_wrap(~Diet) +
                 guides(col=guide_legend(ncol=3))

    ------------------------------------------------------

Example4

The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

    # Get a count of number of subjects in each category (sex*condition)
    cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
    cdata
    #>   sex condition subject
    #> 1   F   aspirin       5
    #> 2   M   aspirin       9
    #> 3   F   placebo      12
    #> 4   M   placebo       4

    # Rename "subject" column to "N"
    names(cdata)[names(cdata)=="subject"] <- "N"
    cdata
    #>   sex condition  N
    #> 1   F   aspirin  5
    #> 2   M   aspirin  9
    #> 3   F   placebo 12
    #> 4   M   placebo  4

    # Sort by sex first
    cdata <- cdata[order(cdata$sex),]
    cdata
    #>   sex condition  N
    #> 1   F   aspirin  5
    #> 3   F   placebo 12
    #> 2   M   aspirin  9
    #> 4   M   placebo  4

    # We also keep the __before__ and __after__ columns:
    # Get the average effect size by sex and condition
    cdata.means <- aggregate(data[c("before","after","change")],
                             by = data[c("sex","condition")], FUN=mean)
    cdata.means
    #>   sex condition   before     after    change
    #> 1   F   aspirin 11.06000  7.640000 -3.420000
    #> 2   M   aspirin 11.26667  5.855556 -5.411111
    #> 3   F   placebo 10.13333  8.075000 -2.058333
    #> 4   M   placebo 11.47500 10.500000 -0.975000

    # Merge the data frames
    cdata <- merge(cdata, cdata.means)
    cdata
    #>   sex condition  N   before     after    change
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000

    # Get the sample (n-1) standard deviation for "change"
    cdata.sd <- aggregate(data["change"],
                          by = data[c("sex","condition")], FUN=sd)
    # Rename the column to change.sd
    names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
    cdata.sd
    #>   sex condition change.sd
    #> 1   F   aspirin 0.8642916
    #> 2   M   aspirin 1.1307569
    #> 3   F   placebo 0.5247655
    #> 4   M   placebo 0.7804913

    # Merge
    cdata <- merge(cdata, cdata.sd)
    cdata
    #>   sex condition  N   before     after    change change.sd
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913

    # Calculate standard error of the mean
    cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
    cdata
    #>   sex condition  N   before     after    change change.sd change.se
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916 0.3865230
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655 0.1514867
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569 0.3769190
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913 0.3902456

If you have NA’s in your data and wish to skip them, use na.rm=TRUE:

    cdata.means <- aggregate(data[c("before","after","change")],
                             by = data[c("sex","condition")],
                             FUN=mean, na.rm=TRUE)
    cdata.means
    #>   sex condition   before     after    change
    #> 1   F   aspirin 11.06000  7.640000 -3.420000
    #> 2   M   aspirin 11.26667  5.855556 -5.411111
    #> 3   F   placebo 10.13333  8.075000 -2.058333
    #> 4   M   placebo 11.47500 10.500000 -0.975000

原文地址:https://www.cnblogs.com/HISAK/p/11770568.html

时间: 2024-10-24 12:26:54

R语言中aggregate函数的相关文章

R语言中apply函数

前言 刚开始接触R语言时,会听到各种的R语言使用技巧,其中最重要的一条就是不要用循环,效率特别低,要用向量计算代替循环计算. 那么,这是为什么呢?原因在于R的循环操作for和while,都是基于R语言本身来实现的,而向量操作是基于底层的C语言函数实现的,从性能上来看,就会有比较明显的差距了.那么如何使用C的函数来实现向量计算呢,就是要用到apply的家族函数,包括apply, sapply, tapply, mapply, lapply, rapply, vapply, eapply等. 目录

R语言中 fitted()和predict()的区别

fitted是拟合值,predict是预测值.模型是基于给定样本的值建立的,在这些给定样本上做预测就是拟合.在新样本上做预测就是预测. 你可以找一组数据试试,结果如何.fit<-lm(weight~height,data=women)fitted(fit) predict(fit,newdata=data.frame(height=90))##将90代入看结果如何这是R in action中的例子 R语言中 fitted()和predict()的区别,布布扣,bubuko.com R语言中 fi

C语言中system()函数的用法总结(转)

system()函数功能强大,很多人用却对它的原理知之甚少先看linux版system函数的源码: 1 #include <sys/types.h> 2 #include <sys/wait.h> 3 #include <errno.h> 4 #include <unistd.h> 5 6 int system(const char * cmdstring) 7 { 8 pid_t pid; 9 int status; 10 11 12 if(cmdstri

C语言中qsort函数用法

C语言中qsort函数用法-示例分析  本文实例汇总介绍了C语言中qsort函数用法,包括针对各种数据类型参数的排序,非常具有实用价值非常具有实用价值. 分享给大家供大家参考.C语言中的qsort函数包含在<stdlib.h>的头文件里,本文中排序都是采用的从小到大排序. 一.对int类型数组排序 int num[100]; int cmp ( const void *a , const void *b ) { return *(int *)a - *(int *)b; } qsort(num

(转)C语言中Exit函数的使用

C语言中Exit函数的使用 exit() 结束当前进程/当前程序/,在整个程序中,只要调用 exit ,就结束return() 是当前函数返回,当然如果是在主函数main, 自然也就结束当前进程了,如果不是,那就是退回上一层调用.在多个进程时.如果有时要检测上进程是否正常退出的.就要用到上个进程的返回值.. exit(1)表示进程正常退出. 返回 1;exit(0)表示进程非正常退出. 返回 0.进程环境与进程控制(1): 进程的开始与终止 1. 进程的开始:C程序是从main函数开始执行, 原

C语言中access函数

int access(const char *filename, int amode); amode参数为0时表示检查文件的存在性,如果文件存在,返回0,不存在,返回-1. 这个函数还可以检查其它文件属性: 06     检查读写权限 04     检查读权限 02     检查写权限 01     检查执行权限 00     检查文件的存在性而这个就算这个文件没有读权限,也可以判断这个文件存在于否存在返回0,不存在返回-1 windows下_mkdir函数 #include<direct.h>

C语言中malloc函数的理解

在C语言中malloc函数主要是用在堆内存的申请上,使用malloc函数时,函数会返回一个void *类型的值,这个值就是你申请的堆内存的首地址:为什么返回的地址是一个void *类型的地址呢?首先我们要先弄明白,到底void是一个什么类型呢?很多C语言的初学者认为,void类型就是空类型,就是没有类型,但是实际上这种认知是扯淡的.因为空白的意思是可以容纳百物(讲C语言突然讲出了道家的思想,自己这么流弊的吗?哈哈哈哈哈嗝),既然可以容纳百物,也就是void型其实就是万能型,它可以指代任意类型.其

R 语言中 data table 的相关,内存高效的 增量式 data frame

面对的是这样一个问题,不断读入一行一行数据,append到data frame上,如果用dataframe,  rbind() ,可以发现数据大的时候效率明显变低. 原因是 每次bind 都是一次重新整个数据集的重新拷贝 这个链接有人测试了各种方案,似乎给出了最优方案 http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner library(data.table) d

rugarch包与R语言中的garch族模型

来源:http://www.dataguru.cn/article-794-1.html rugarch包是R中用来拟合和检验garch模型的一个包.该包最早在http://rgarch.r-forge.r-project.org上发布,现已发布到CRAN上.简单而言,该包主要包括四个功能: 拟合garch族模型 garch族模型诊断 garch族模型预测 模拟garch序列 拟合序列分布 下面分别说一下. 一.拟合garch族模型 拟合garch族模型分三个步骤:(1)通过ugarchspec