[Johns Hopkins] R Programming 作業 Week 2 - Air Pollution

Introduction

For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web site.

Data

The zip file containing the data can be downloaded here:

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

  • Date: the date of the observation in YYYY-MM-DD format (year-month-day)
  • sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
  • nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and create the directory ‘specdata’. Once you have unzipped the zip file, do not make any modifications to the files in the ‘specdata’ directory. In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

Part 1

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows

pollutantmean <- function(directory, pollutant, id = 1:332) {
        ## ‘directory‘ is a character vector of length 1 indicating
        ## the location of the CSV files

        ## ‘pollutant‘ is a character vector of length 1 indicating
        ## the name of the pollutant for which we will calculate the
        ## mean; either "sulfate" or "nitrate".

        ## ‘id‘ is an integer vector indicating the monitor ID numbers
        ## to be used

        ## Return the mean of the pollutant across all monitors list
        ## in the ‘id‘ vector (ignoring NA values)
        ## NOTE: Do not round the result!
}

You can see some example output from this function. The function that you write should be able to match this output. Please save your code to a file named pollutantmean.R.

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. A prototype of this function follows

complete <- function(directory, id = 1:332) {
        ## ‘directory‘ is a character vector of length 1 indicating
        ## the location of the CSV files

        ## ‘id‘ is an integer vector indicating the monitor ID numbers
        ## to be used

        ## Return a data frame of the form:
        ## id nobs
        ## 1  117
        ## 2  1041
        ## ...
        ## where ‘id‘ is the monitor ID number and ‘nobs‘ is the
        ## number of complete cases
}

ou can see some example output from this function. The function that you write should be able to match this output. Please save your code to a file named complete.R. To run the submit script for this part, make sure your working directory has the file complete.R in it.

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

corr <- function(directory, threshold = 0) {
        ## ‘directory‘ is a character vector of length 1 indicating
        ## the location of the CSV files

        ## ‘threshold‘ is a numeric vector of length 1 indicating the
        ## number of completely observed observations (on all
        ## variables) required to compute the correlation between
        ## nitrate and sulfate; the default is 0

        ## Return a numeric vector of correlations
        ## NOTE: Do not round the result!
}

For this function you will need to use the ‘cor’ function in R which calculates the correlation between two vectors. Please read the help page for this function via ‘?cor’ and make sure that you know how to use it.
You can see some example output from this function. The function that you write should be able to match this output. Please save your code to a file named corr.R. To run the submit script for this part, make sure your working directory has the file corr.R in it.

--------------------------------------------------------------作答區------------------------------------------------------------------------

可以直接點選連結下載檔案再行解壓縮

或是自訂R的 get_specdata()函數來執行上述步驟

# 設立get_specdata()
get_specdata <- function(dest_file) {
  specdata_url <- "https://storage.googleapis.com/jhu_rprg/specdata.zip"   #擷取檔案下載的url
  download.file(specdata_url, destfile = dest_file)                 #以download.file下載,destfile = 指定位置 *註:此處~會為R主程式的wd
  unzip(dest_file)                                            #unzip檔案至Rstudio的wd
}
get_specdata("~/specdata.zip")

#可指定解壓位置的get_specdata()
get_specdata <- function(dest_file, ex_dir) {
  specdata_url <- "https://storage.googleapis.com/jhu_rprg/specdata.zip"
  download.file(specdata_url, destfile = dest_file)            
  unzip(dest_file, exdir = ex_dir)                            #exdir為指定位置*註:此處~會為R主程式的wd
}
get_specdata("~/specdata.zip", "D:/R/Project")

pollutantmean()

pollutantmean <- function(directory,pollutant,id = 1:332) {
  CSV_files_dir <- list.files(directory, full.names = T)  #將茲目標料夾中的files,匯成list
  dataf <-data.frame()
  for(i in id){
    dataf <- rbind(dataf,read.csv(CSV_files_dir[i]))      #rbind將for迴圈的資料綁成新row
  }
  mean(dataf[,pollutant],na.rm = T)                       #所有row的 指定column做計算
}

另一種參考

pollutantmean <- function(directory, pollutant, id= 1:332){
  pollutants = c()                                            #設立空vector用於接數據
  filenames = list.files(directory)                           #此處沒有用 full_name參數,只會有files name 

  for(i in id){
    filepath=paste(directory,"/" ,filenames[i], sep="")       #將檔名與路徑貼起來,製成完整路徑fliepath
    data = read.csv(filepath, header = TRUE)                  #讀取目標檔案及其header,存至data
    pollutants = c(pollutants, data[,pollutant])              #將每筆數據加長至vector中,存至pollutants
  }
  pollutants_mean = mean(pollutants, na.rm=TRUE)              #計算並存至pollutants_mean

  pollutants_mean                                             #回報
}
練習
pollutantmean("specdata", "sulfate", 1:10)
[1] 4.064
pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706
pollutantmean("specdata", "sulfate", 34)
[1] 1.477
pollutantmean("specdata", "nitrate")
[1] 1.703

complete()

complete <- function(directory, id = 1:332) {
  CSV_files <- list.files(directory, full.names = TRUE)
  datadf <- data.frame()
  for (i in id) {
    moni_i <- read.csv(CSV_files[i])
    nobs <- sum(complete.cases(moni_i))      #complete.cases()可得是否為complete的邏輯vector,sum()加總True值
    tmpdf <- data.frame(i, nobs)             #將測站ID及其結果存成 df
    datadf <- rbind(datadf, tmpdf)           #將新的資料綁至新row
  }
  colnames(datadf) <- c("id", "nobs")        #將column賦名
  datadf                                     #回報
}

輸出data frame

練習
查看指定感測器中,具有完整資訊的筆數cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310))   #cc5中有"id" "nobs" 兩columns
print(cc$nobs)                                                #nobs的 vector

[1] 228 148 124 165 104 460 232
查看指定感測器中,具有完整資訊的筆數cc <- complete("specdata", 54)                               #cc中有"id" "nobs" 兩columns
print(cc$nobs)                                               #nobs的 vector
[1] 219
隨機抽樣查看10組感測器,具有完整資訊的筆數set.seed(42)
cc <- complete("specdata", 332:1)                            #cc中有 "id" "nobs"兩columns *row是反讀,但此處沒差
use <- sample(332, 10)                                       #332中亂數取10個成  use vector
print(cc[use, "nobs"])                                       #第 use row 的 "nobs"

[1] 711 135  74 445 178  73  49   0 687 237

corr()

corr <- function(directory, threshold = 0) {                           #門檻defalut = 0
  CSV_files <- list.files(directory, full.names = TRUE)
  dat <- vector(mode = "numeric", length = 0)                          #設置空的numeric vector
  for (i in 1:length(CSV_files)) {
    moni_i <- read.csv(CSV_files[i])                                   #此處沒有指定id,直接以length讀長度
    csum <- sum((!is.na(moni_i$sulfate)) & (!is.na(moni_i$nitrate)))   #獲得兩側相都沒na測值的True數量
    if (csum > threshold) {                                            #超出門檻的
      tmp <- moni_i[which(!is.na(moni_i$sulfate)), ]                   #留下sulfate是True的
      submoni_i <- tmp[which(!is.na(tmp$nitrate)), ]                   #再留下nitrate是True的
      dat <- c(dat, cor(submoni_i$sulfate, submoni_i$nitrate))         #將cor()值綁長至dat vector 中
    }
  }
  dat
}

輸出numeric vector

練習
從排序完成的相關係數中,隨機抽樣5組,並四捨五入至小數點下第四位cr <- corr("specdata")
cr <- sort(cr)
set.seed(868)
out <- round(cr[sample(length(cr), 5)], 4)
print(out)

[1] 0.2688 0.1127 -0.0085 0.4586 0.0447
資料完整數大於129筆的資料組數,其相關係數排序完成後隨機抽樣5組,並四捨五入至小數點下第四位cr <- corr("specdata", 129)
cr <- sort(cr)
n <- length(cr)
set.seed(197)
out <- c(n, round(cr[sample(n, 5)], 4))
print(out)

[1] 243.0000   0.2540   0.0504  -0.1462  -0.1680   0.5969
資料完整度大於2000筆的資料組數,與資料完整度大於1000筆的資料,其相關係數排序完成後以四捨五入呈現至小數點下第四位cr <- corr("specdata", 2000)
n <- length(cr)
cr <- corr("specdata", 1000)
cr <- sort(cr)
print(c(n, round(cr, 4)))

[1]  0.0000 -0.0190  0.0419  0.1901

原文地址:https://www.cnblogs.com/pyleu1028/p/10357970.html

时间: 2024-10-05 05:36:52

[Johns Hopkins] R Programming 作業 Week 2 - Air Pollution的相关文章

Coursera系列-R Programming (John Hopkins University)-Programming Assignment 3

经过断断续续一个月的学习,R语言这门课也快接近尾声了.进入Week 4,作业对于我这个初学者来说感到越发困难起来.还好经过几天不断地摸索和试错,最终完整地解决了问题. 本周的作业Assignment 3是处理一个来自美国Department of Health and Human Services的一个文件,叫“outcome-of-care-measures.csv”.里面储存了美国50个州4000多家医院的几个常见疾病的死亡率.具体说来是30-day mortality and readmi

Coursera系列-R Programming (John Hopkins University)-课件案例

课件里介绍了一个很实用又能学到很多知识点的例子.并且Roger老师可是用了40分钟的视频亲力亲为.所以这里我把课件和视频知识整理一下会比较更清晰地解析这个案例. 视频链接: https://www.youtube.com/watch?v=VE-6bQvyfTQ&feature=youtu.be Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. Reading in the 1999 data

Coursera系列-R Programming第三周-词法作用域

完成R Programming第三周 这周作业有点绕,更多地是通过一个缓存逆矩阵的案例,向我们示范[词法作用域 Lexical Scopping]的功效.但是作业里给出的函数有点绕口,花费了我们蛮多心思. Lexical Scopping: The value of free variables are searched for in the environment where the function was defined. 因此 make.power<-function(n){ pow<

作業系統的類型

依處理方式而分 整批處理作業系統(Batch processing OS) 處理方式:將欲處理的資料或程式整批集中,置於如卡片.紙帶.磁帶.磁碟等儲存在媒體內,當要處理時,CPU才會到媒體中讀取資料後加以處理. 優點:整批處理作業系統是第一代作業系統,比起沒有作業系統時,更有效地改進了電腦的作業效率,減少大部份人工操作的比率. 缺點:因為CPU速度遠較存取資料的I/O為快,所以此方式CPU常被閒置,效率低. 分時作業系統(Time-sharing OS) 處理方式:CPU每次分配給各程序式(pr

Azure SQL作業

由於要定期去刪除比較久的資料,礙於Azure SQL DB目前無法直接創建作業,目前找到一種方式就是通過local的SQL SERVER來執行AZURE SQL指令. 步驟如下: SQL Server Management Studio 2014 ,登錄local之後,打開SQL Server Agent 新增作業,輸入名稱: 新增步驟,輸入名稱為「Create list」: 類型選擇「作業系統(CmdExec)」,執行身份「SQL Server Agent服務賬戶」, 命令為: sqlcmd

R Programming week 3-Loop functions

Looping on the Command Line Writing for, while loops is useful when programming but not particularly easy when working interactively on the command line. There are some functions which implement looping to make life easier lapply: Loop over a list an

R Programming week1-Reading Data

Reading Data There are a few principal functions reading data into R. read.table, read.csv, for reading tabular data readLines, for reading lines of a text file source, for reading in R code files (inverse of dump) dget, for reading in R code files (

R Programming week1-Data Type

Objects R has five basic or “atomic” classes of objects: character numeric (real numbers) integer complex logical (True/False) The most basic object is a vector A vector can only contain objects of the same class BUT: The one exception is a list, whi

R Programming week 3-Debugging

Something’s Wrong! Indications that something’s not right message: A generic notification/diagnostic message produced by the message function;execution of the function continues warning: An indication that something is wrong but not necessarily fatal