Quiz 1
Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names is here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
How many properties are worth $1,000,000 or more?
Question 2
Use the data you loaded from Question 1. Consider the variable FES in the code book. Which of the "tidy data" principles does this variable violate?
- Tidy data has variable values that are internally consistent.
- Tidy data has one variable per column. (Answer)
- Tidy data has no missing values.
- Tidy data has one observation per row.
Question 3
Download the Excel spreadsheet on Natural Gas Aquisition Program here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx
Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:dat
What is the value of:sum(dat$Zip*dat$Ext,na.rm=T)
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)
Question 4
Read the XML data on Baltimore restaurants from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
How many restaurants have zipcode 21231?
Question 5
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv
using the fread() command load the data into an R objectDT
Which of the following is the fastest way to calculate the average value of the variablepwgtp15
broken down by sex using the data.table package?
quiz1.R
# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz1‘)
# Question 1
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
destfile = "q1.csv",
method = "curl")
# read csv file
q1 <- read.csv(file = "q1.csv", header = TRUE)
# count properties with value more than 1,000,000
nrow(q1[which(q1$VAL == 24),])
# Question 3
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx",
destfile = "q3.xlsx",
method = "curl")
# load xlsx library
library(xlsx)
# read xlsx file
colIndex <- 7:15
rowIndex <- 18:23
dat <- read.xlsx("q3.xlsx", sheetIndex = 1, header = TRUE, colIndex = colIndex, rowIndex = rowIndex)
# sum
sum(dat$Zip*dat$Ext,na.rm=T)
# Question 4
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml",
destfile = "q4.xml",
method = "curl")
# load library
library(XML)
# read xml file
q4 <- xmlTreeParse(file = "q4.xml", useInternalNodes = TRUE)
# get root node
rootNode <- xmlRoot(q4)
# sum the zipcode == 21231
sum(xpathSApply(doc = rootNode, path = "//zipcode", fun = xmlValue) == 21231)
# Question 5
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv",
destfile = "q5.csv",
method = "curl")
# load library
library(data.table)
# read csv file using fread
DT <- fread("q5.csv", header = TRUE)
# best way to calculate mean by sex
DT[,mean(pwgtp15),by=SEX]
Quiz 2
Question 1
Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: this is the url you want "https://api.github.com/users/jtleek/repos"). Use this data to find the time that the datasharing repo was created. What time was it created? This tutorial may be useful (https://github.com/hadley/httr/blob/master/demo/oauth2-github.r). You may also need to run the code in the base R package and not R studio.
Question 2
The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL. Download the American Community Survey data and load it into an R object calledacs
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv
Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?
Question 3
Using the same data frame you created in the previous problem, what is the equivalent function to unique(acs$AGEP)
Question 4
How many characters are in the 10th, 20th, 30th and 100th lines of HTML from this page:
http://biostat.jhsph.edu/~jleek/contact.html
(Hint: the nchar() function in R may be helpful)
Question 5
Read this data set into R and report the sum of the numbers in the fourth of the nine columns.
https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for
Original source of the data: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for
(Hint this is a fixed width file format)
quiz2.R
# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz2‘)
# Question 1
# load library
library(httr)
library(httpuv)
# github oauth settings
myapp <- oauth_app("github",
key = "9ac78a3b36b3862ed435",
secret = "e1c3be242976383cc5f28dae177d41eef5445e3f")
# get oauth credentials
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
# use api
gtoken <- config(token = github_token)
req <- GET("https://api.github.com/users/jtleek/repos", gtoken)
stop_for_status(req)
# convert unstructured json to structured json
library(jsonlite)
jsondata <- fromJSON(toJSON(content(req)))
# find out the created time of datasharing repo
subset(jsondata, name == "datasharing", select = c(created_at))
# Question 2
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv",
destfile = "q2.csv",
method = "curl")
# load library
library(sqldf)
# read data from file
acs <- read.csv("q2.csv", header = TRUE)
# select the probability weights pwgtp1 with ages less than 50
sqldf("select pwgtp1 from acs where AGEP < 50")
# Question3
# same sql as unique(acs$AGEP)
sqldf("select DISTINCT AGEP from acs")
# Question 4
# read from html
q4 <- url("http://biostat.jhsph.edu/~jleek/contact.html")
htmlCode <- readLines(q4, n = 100)
# find numbers of characters of line 10th, 20th, 30th, 100th
nchar(htmlCode[10])
nchar(htmlCode[20])
nchar(htmlCode[30])
nchar(htmlCode[100])
# Question 5
# read fixed width format
q5 <- read.fwf(file = "https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for",
skip = 4,
widths = c(12, 7,4, 9,4, 9,4, 9,4))
# find the sum of forth column
sum(q5[, 4])
Quiz 3
Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names is here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. which(agricultureLogical) What are the first 3 values that result?
Question 2
Using the jpeg package read in the following picture of your instructor into R
https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg
Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data? (some Linux systems may produce an answer 638 different for the 30th quantile)
Question 3
Load the Gross Domestic Product data for the 190 ranked countries in this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Load the educational data from this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so United States is last). What is the 13th country in the resulting data frame?
Original data sources:
http://data.worldbank.org/data-catalog/GDP-ranking-table
http://data.worldbank.org/data-catalog/ed-stats
Question 4
What is the average GDP ranking for the "High income: OECD" and "High income: nonOECD" group?
Question 5
Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries are Lower middle income but among the 38 nations with highest GDP?
quiz3.R
# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz3‘)
# Question 1
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
destfile = "q1.csv",
method = "curl")
# read csv file
q1 <- read.csv("q1.csv", header = TRUE)
# load library
library(plyr)
library(dplyr)
# create a logical vector
q1 <- mutate(q1, agricultureLogical=factor((ACR == 3 & AGS == 6), levels = c(TRUE, FALSE)))
# show the first 3 row names which the logical value are TRUE
head(row.names(q1[which(q1$agricultureLogical == TRUE),]), 3)
# Question 2
# download jpeg file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg",
destfile = "q2.jpg",
method = "curl")
# load library
library(jpeg)
# read jpeg file
q2 <- readJPEG(source = "q2.jpg",
native = TRUE)
# show the 30th and 80th quantiles
quantile(q2, c(0.3, 0.8))
# Question 3
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv",
destfile = "q3_1.csv",
method = "curl")
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv",
destfile = "q3_2.csv",
method = "curl")
# load the datasets
q3_1 <- read.csv("q3_1.csv", header = TRUE, skip = 3, sep = ",")
q3_2 <- read.csv("q3_2.csv", header = TRUE)
# reshaping data
q3_1 <- q3_1[2:191, c(1,2,4,5)]
rownames(q3_1) <- NULL
q3_1 <- rename(q3_1, CountryCode = X)
# merge two datasets
q3_merge <- join(q3_1, q3_2)
# show the number of matches
sum(!is.na(unique(q3_merge$Ranking)))
# convert the data type of Ranking
q3_merge$Ranking <- as.numeric(as.character(q3_merge$Ranking))
# show the 13th country after sort decending
q3_merge <- arrange(q3_merge, desc(Ranking))
q3_merge[13,3]
# Question 4
# Group q3_merge by Income.Group
income_group <- group_by(q3_merge, Income.Group)
summarise(income_group, avg = mean(Ranking, na.rm = TRUE))
# Question 5
# cut Ranking into 5 quantile groups
q3_merge$RankingGroup <- cut(q3_merge$Ranking, breaks = 5)
# make a table vs Income.Group
table(q3_merge$RankingGroup, q3_merge$Income.Group)
Quiz 4
Question 1
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names is here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list?
Question 2
Load the Gross Domestic Product data for the 190 ranked countries in this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?
Original data sources: http://data.worldbank.org/data-catalog/GDP-ranking-table
Question 3
In the data set from Question 2 what is a regular expression that would allow you to count the number of countries whose name begins with "United"? Assume that the variable with the country names in it is named countryNames. How many countries begin with United?
Question 4
Load the Gross Domestic Product data for the 190 ranked countries in this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
Load the educational data from this data set:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv
Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?
Original data sources:
http://data.worldbank.org/data-catalog/GDP-ranking-table
http://data.worldbank.org/data-catalog/ed-stats
Question 5
You can use the quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon‘s stock price and get the times the data was sampled.
library(quantmod) amzn = getSymbols("AMZN",auto.assign=FALSE) sampleTimes = index(amzn)
How many values were collected in 2012? How many values were collected on Mondays in 2012?
quiz4.R
# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz4‘)
# Question 1
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
destfile = "q1.csv",
method = "curl")
# load csv file
q1 <- read.csv("q1.csv", header = TRUE)
# apply strsplit() and check the 123th element
strsplit(x = names(q1), split = "wgtp")[[123]]
# Question 2
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv",
destfile = "q2.csv",
method = "curl")
# load library
library(plyr)
library(dplyr)
# load csv file
q2 <- read.csv("q2.csv", header = TRUE, skip = 3)
# reshape data
q2 <- q2[2:191, c(1,2,4,5)]
rownames(q2) <- NULL
q2 <- rename(q2, CountryCode = X)
names(q2) <- gsub(pattern = "\\.",
replacement = "",
x = names(q2))
names(q2) <- tolower(names(q2))
# remove the comma in column usdollars
q2$usdollars <- gsub(pattern = ",",
replacement = "",
x = q2$usdollars)
# convert usdollars to numeric
q2$usdollars <- as.numeric(as.character(q2$usdollars))
# calculate the average of usdollars
mean(x = q2$usdollars, na.rm = TRUE)
# Quesetion 3
# change the column name to match the question
q2 <- rename(q2, countryNames = economy)
# find countries whose name begins with "United"
grep(pattern = "^United", x = q2$countryNames)
# count the countries whose name begins with "United
sum(grepl(pattern = "^United",
x = q2$countryNames) == TRUE)
# Question 4
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv",
destfile = "q4.csv",
method = "curl")
# load two csv file
q4_1 <- q2
q4_2 <- read.csv("q4.csv", header = TRUE)
# reshape the data
names(q4_2) <- tolower(gsub(pattern = "\\.",
replacement = "",
x = names(q4_2)))
# merge the data
q4_merge <- join(x = q4_1, y = q4_2)
# count the fiscal year end in June
sum(grepl(pattern = "^[Ff]iscal(.*)[Yy]ear(.*)[Ee]nd(.)*[Jj]une",
x = q4_merge$specialnotes) == TRUE)
# Question 5
# launch given code
library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)
# create a data fram from amzn
amzn <- data.frame(amzn)
# count the values collected in 2012
sum(format(as.Date(x = rownames(amzn), format = "%Y-%m-%d"), "%Y") == 2012)
# count the values collected in Mondays 2012
sum(format(as.Date(x = rownames(amzn), format = "%Y-%m-%d"), "%Y%a") == "2012Mon")
所有这次的代码来自:http://mhwong2007.logdown.com/
谢谢!