Sampling Distributions and Central Limit Theorem in R(转)

The Central Limit Theorem (CLT), and the concept of the sampling distribution, are critical for understanding why statistical inference works. There are at least a handful of problems that require you to invoke the Central Limit Theorem on every ASQ Certified Six Sigma Black Belt (CSSBB) exam. The CLT says that if you take many repeated samples from a population, and calculate the averages or sum of each one, the collection of those averages will be normally distributed… and it doesn’t matter what the shape of the source distribution is!

I wrote some R code to help illustrate this principle for my students. This code allows you to choose a sample size (n), a source distribution, and parameters for that source distribution, and generate a plot of the sampling distributions of the mean, sum, and variance. (Note: the sampling distribution for the variance is a Chi-square distribution!)

sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
   r <- 10000  # Number of replications/samples - DO NOT ADJUST
   # This produces a matrix of observations with
   # n columns and r rows. Each row is one sample:
   my.samples <- switch(src.dist,
	"E" = matrix(rexp(n*r,param1),r),
	"N" = matrix(rnorm(n*r,param1,param2),r),
	"U" = matrix(runif(n*r,param1,param2),r),
	"P" = matrix(rpois(n*r,param1),r),
	"C" = matrix(rcauchy(n*r,param1,param2),r),
        "B" = matrix(rbinom(n*r,param1,param2),r),
	"G" = matrix(rgamma(n*r,param1,param2),r),
	"X" = matrix(rchisq(n*r,param1),r),
	"T" = matrix(rt(n*r,param1),r))
   all.sample.sums <- apply(my.samples,1,sum)
   all.sample.means <- apply(my.samples,1,mean)
   all.sample.vars <- apply(my.samples,1,var)
   par(mfrow=c(2,2))
   hist(my.samples[1,],col="gray",main="Distribution of One Sample")
   hist(all.sample.sums,col="gray",main="Sampling Distributionnof
	the Sum")
   hist(all.sample.means,col="gray",main="Sampling Distributionnof the Mean")
   hist(all.sample.vars,col="gray",main="Sampling Distributionnof
	the Variance")
}

There are 9 population distributions to choose from: exponential (E), normal (N), uniform (U), Poisson (P), Cauchy (C), binomial (B), gamma (G), Chi-Square (X), and the Student’s t distribution (t). Note also that you have to provide either one or two parameters, depending upon what distribution you are selecting. For example, a normal distribution requires that you specify the mean and standard deviation to describe where it’s centered, and how fat or thin it is (that’s two parameters). A Chi-square distribution requires that you specify the degrees of freedom (that’s only one parameter). You can find out exactly what distributions require what parameters by going here:http://en.wikibooks.org/wiki/R_Programming/Probability_Distributions.

Here is an example that draws from an exponential distribution with a mean of 1/1 (you specify the number you want in the denominator of the mean):

sdm.sim(50,src.dist="E",param1=1)

The code above produces this sequence of plots:

You aren’t allowed to change the number of replications in this simulation because of the nature of the sampling distribution: it’s a theoretical model that describes the distribution of statistics from an infinite number of samples. As a result, if you increase the number of replications, you’ll see the mean of the sampling distribution bounce around until it converges on the mean of the population. This is just an artifact of the simulation process: it’s not a characteristic of the sampling distribution, because to be a sampling distribution, you’ve got to have an infinite number of samples. Watkins et al. have a great description of this effect that all statistics instructors should be aware of. I chose 10,000 for the number of replications because 1) it’s close enough to infinity to ensure that the mean of the sampling distribution is the same as the mean of the population, but 2) it’s far enough away from infinity to not crash your computer, even if you only have 4GB or 8GB of memory.

Here are some more examples to try. You can see that as you increase your sample size (n), the shapes of the sampling distributions become more and more normal, and the variance decreases, constraining your estimates of the population parameters more and more.

sdm.sim(10,src.dist="E",1)
sdm.sim(50,src.dist="E",1)
sdm.sim(100,src.dist="E",1)
sdm.sim(10,src.dist="X",14)
sdm.sim(50,src.dist="X",14)
sdm.sim(100,src.dist="X",14)
sdm.sim(10,src.dist="N",param1=20,param2=3)
sdm.sim(50,src.dist="N",param1=20,param2=3)
sdm.sim(100,src.dist="N",param1=20,param2=3)
sdm.sim(10,src.dist="G",param1=5,param2=5)
sdm.sim(50,src.dist="G",param1=5,param2=5)
sdm.sim(100,src.dist="G",param1=5,param2=5)

转自:http://www.r-bloggers.com/sampling-distributions-and-central-limit-theorem-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29
时间: 2024-10-12 08:56:02

Sampling Distributions and Central Limit Theorem in R(转)的相关文章

加州大学伯克利分校Stat2.2x Probability 概率初步学习笔记: Section 4 The Central Limit Theorem

Stat2.2x Probability(概率)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Academia.edu) Summary Standard Error The standard error of a random variable $X$ is defined by $$SE(X)=\sqrt{E((X-E(X))^2)}$$ $SE$ measures the rough size

第七届R语言会议 - 小记

第七届中国R语言会议 小记 R语言始于1993年,并在1995年首次发布,后来经过2000年R-1.0,2004年R-2.0和2013年R-3.0,以61%的得票荣登世界编程或同积累语言排行榜榜首,远超Python,SQL,SAS,JAVA,Excel和Rapidminer. R最初是一种基于数学的脚本语言,前身为S语言.但是论数学,远不及专业的Matlab和SAS:论脚本功能,又远不及Python和Perl.但是,随着R语言的发展,在功能强大的IDE,RStudio和R本身开源的双重帮助下,R

[Math Review] Statistics Basic: Sampling Distribution

Inferential Statistics Generalizing from a sample to a population that involves determining how far sample statistics are likely to vary from each other and from the population parameter. Sampling Distribution The sampling distribution of a statistic

(转)Awesome Courses

Awesome Courses  Introduction There is a lot of hidden treasure lying within university pages scattered across the internet. This list is an attempt to bring to light those awesome courses which make their high-quality material i.e. assignments, lect

学习大数据第五天:最小二乘法的Python实现(二)

1.numpy.random.normal numpy.random.normal numpy.random.normal(loc=0.0, scale=1.0, size=None) Draw random samples from a normal (Gaussian) distribution. The probability density function of the normal distribution, first derived by De Moivre and 200 ye

[Math Review] Statistics Basic: Estimation

Two Types of Estimation One of the major applications of statistics is estimating population parameters from sample statistics. There are types of estimation: Point Estimate: the value of sample statistics Point estimates of average height with multi

SDGB 7844 HW 3: Capture-Recapture Method

SDGB 7844 HW 3: Capture-Recapture MethodDue: 10/31Submit two files through Blackboard: (a) .Rmd R Markdown file with answers and codeand (b) Word document of knitted R Markdown file. Your file should be named as follows:“HW3-[Full Name]-[Class Time]”

History of Monte Carlo Methods - Part 1

History of Monte Carlo Methods - Part 1 Some time ago in June 2013 I gave a lab tutorial on Monte Carlo methods at Microsoft Research. These tutorials are seminar-talk length (45 minutes) but are supposed to be light, accessible to a general computer

加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: FINAL

Stat2.3x Inference(统计推断)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Academia.edu) ADDITIONAL PRACTICE FOR THE FINAL In the following problems you will be asked to choose one of the four options (A)-(D). The options are sta