[Coursera] Getting and Cleaning Data Quiz

Quiz 1

Question 1

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

How many properties are worth $1,000,000 or more?

Question 2

Use the data you loaded from Question 1. Consider the variable FES in the code book. Which of the "tidy data" principles does this variable violate?

  • Tidy data has variable values that are internally consistent.
  • Tidy data has one variable per column. (Answer)
  • Tidy data has no missing values.
  • Tidy data has one observation per row.

Question 3

Download the Excel spreadsheet on Natural Gas Aquisition Program here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:
dat 
What is the value of:
sum(dat$Zip*dat$Ext,na.rm=T)
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)

Question 4

Read the XML data on Baltimore restaurants from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

How many restaurants have zipcode 21231?

Question 5

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

using the fread() command load the data into an R object
DT 
Which of the following is the fastest way to calculate the average value of the variable
pwgtp15 
broken down by sex using the data.table package?

quiz1.R

# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz1‘)

# Question 1
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
destfile = "q1.csv",
method = "curl")

# read csv file
q1 <- read.csv(file = "q1.csv", header = TRUE)

# count properties with value more than 1,000,000
nrow(q1[which(q1$VAL == 24),])

# Question 3
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx",
destfile = "q3.xlsx",
method = "curl")

# load xlsx library
library(xlsx)

# read xlsx file
colIndex <- 7:15
rowIndex <- 18:23
dat <- read.xlsx("q3.xlsx", sheetIndex = 1, header = TRUE, colIndex = colIndex, rowIndex = rowIndex)

# sum
sum(dat$Zip*dat$Ext,na.rm=T)

# Question 4
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml",
destfile = "q4.xml",
method = "curl")

# load library
library(XML)

# read xml file
q4 <- xmlTreeParse(file = "q4.xml", useInternalNodes = TRUE)

# get root node
rootNode <- xmlRoot(q4)

# sum the zipcode == 21231
sum(xpathSApply(doc = rootNode, path = "//zipcode", fun = xmlValue) == 21231)

# Question 5
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv",
destfile = "q5.csv",
method = "curl")

# load library
library(data.table)

# read csv file using fread
DT <- fread("q5.csv", header = TRUE)

# best way to calculate mean by sex
DT[,mean(pwgtp15),by=SEX]

Quiz 2

Question 1

Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: this is the url you want "https://api.github.com/users/jtleek/repos"). Use this data to find the time that the datasharing repo was created. What time was it created? This tutorial may be useful (https://github.com/hadley/httr/blob/master/demo/oauth2-github.r). You may also need to run the code in the base R package and not R studio.

Question 2

The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL. Download the American Community Survey data and load it into an R object called
acs

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?

Question 3

Using the same data frame you created in the previous problem, what is the equivalent function to unique(acs$AGEP)

Question 4

How many characters are in the 10th, 20th, 30th and 100th lines of HTML from this page:

http://biostat.jhsph.edu/~jleek/contact.html

(Hint: the nchar() function in R may be helpful)

Question 5

Read this data set into R and report the sum of the numbers in the fourth of the nine columns.

https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for

Original source of the data: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

(Hint this is a fixed width file format)

quiz2.R
# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz2‘)

# Question 1
# load library
library(httr)
library(httpuv)

# github oauth settings
myapp <- oauth_app("github",
key = "9ac78a3b36b3862ed435",
secret = "e1c3be242976383cc5f28dae177d41eef5445e3f")

# get oauth credentials
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)

# use api
gtoken <- config(token = github_token)
req <- GET("https://api.github.com/users/jtleek/repos", gtoken)
stop_for_status(req)

# convert unstructured json to structured json
library(jsonlite)
jsondata <- fromJSON(toJSON(content(req)))

# find out the created time of datasharing repo
subset(jsondata, name == "datasharing", select = c(created_at))

# Question 2
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv",
destfile = "q2.csv",
method = "curl")

# load library
library(sqldf)

# read data from file
acs <- read.csv("q2.csv", header = TRUE)

# select the probability weights pwgtp1 with ages less than 50
sqldf("select pwgtp1 from acs where AGEP < 50")

# Question3
# same sql as unique(acs$AGEP)
sqldf("select DISTINCT AGEP from acs")

# Question 4
# read from html
q4 <- url("http://biostat.jhsph.edu/~jleek/contact.html")
htmlCode <- readLines(q4, n = 100)

# find numbers of characters of line 10th, 20th, 30th, 100th
nchar(htmlCode[10])
nchar(htmlCode[20])
nchar(htmlCode[30])
nchar(htmlCode[100])

# Question 5
# read fixed width format
q5 <- read.fwf(file = "https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for",
skip = 4,
widths = c(12, 7,4, 9,4, 9,4, 9,4))

# find the sum of forth column
sum(q5[, 4])

Quiz 3

Question 1

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Create a logical vector that identifies the households on greater than 10 acres who sold more than $10,000 worth of agriculture products. Assign that logical vector to the variable agricultureLogical. Apply the which() function like this to identify the rows of the data frame where the logical vector is TRUE. which(agricultureLogical) What are the first 3 values that result?

Question 2

Using the jpeg package read in the following picture of your instructor into R

https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg

Use the parameter native=TRUE. What are the 30th and 80th quantiles of the resulting data? (some Linux systems may produce an answer 638 different for the 30th quantile)

Question 3

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. How many of the IDs match? Sort the data frame in descending order by GDP rank (so United States is last). What is the 13th country in the resulting data frame?

Original data sources: 
http://data.worldbank.org/data-catalog/GDP-ranking-table 
http://data.worldbank.org/data-catalog/ed-stats

Question 4

What is the average GDP ranking for the "High income: OECD" and "High income: nonOECD" group?

Question 5

Cut the GDP ranking into 5 separate quantile groups. Make a table versus Income.Group. How many countries are Lower middle income but among the 38 nations with highest GDP?

quiz3.R

# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz3‘)

# Question 1
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
destfile = "q1.csv",
method = "curl")

# read csv file
q1 <- read.csv("q1.csv", header = TRUE)

# load library
library(plyr)
library(dplyr)

# create a logical vector
q1 <- mutate(q1, agricultureLogical=factor((ACR == 3 & AGS == 6), levels = c(TRUE, FALSE)))

# show the first 3 row names which the logical value are TRUE
head(row.names(q1[which(q1$agricultureLogical == TRUE),]), 3)

# Question 2
# download jpeg file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fjeff.jpg",
destfile = "q2.jpg",
method = "curl")

# load library
library(jpeg)

# read jpeg file
q2 <- readJPEG(source = "q2.jpg",
native = TRUE)

# show the 30th and 80th quantiles
quantile(q2, c(0.3, 0.8))

# Question 3
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv",
destfile = "q3_1.csv",
method = "curl")
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv",
destfile = "q3_2.csv",
method = "curl")

# load the datasets
q3_1 <- read.csv("q3_1.csv", header = TRUE, skip = 3, sep = ",")
q3_2 <- read.csv("q3_2.csv", header = TRUE)

# reshaping data
q3_1 <- q3_1[2:191, c(1,2,4,5)]
rownames(q3_1) <- NULL
q3_1 <- rename(q3_1, CountryCode = X)

# merge two datasets
q3_merge <- join(q3_1, q3_2)

# show the number of matches
sum(!is.na(unique(q3_merge$Ranking)))

# convert the data type of Ranking
q3_merge$Ranking <- as.numeric(as.character(q3_merge$Ranking))

# show the 13th country after sort decending
q3_merge <- arrange(q3_merge, desc(Ranking))
q3_merge[13,3]

# Question 4
# Group q3_merge by Income.Group
income_group <- group_by(q3_merge, Income.Group)
summarise(income_group, avg = mean(Ranking, na.rm = TRUE))

# Question 5
# cut Ranking into 5 quantile groups
q3_merge$RankingGroup <- cut(q3_merge$Ranking, breaks = 5)

# make a table vs Income.Group
table(q3_merge$RankingGroup, q3_merge$Income.Group)

Quiz 4

Question 1

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list?

Question 2

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?

Original data sources: http://data.worldbank.org/data-catalog/GDP-ranking-table

Question 3

In the data set from Question 2 what is a regular expression that would allow you to count the number of countries whose name begins with "United"? Assume that the variable with the country names in it is named countryNames. How many countries begin with United?

Question 4

Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?

Original data sources: 
http://data.worldbank.org/data-catalog/GDP-ranking-table 
http://data.worldbank.org/data-catalog/ed-stats

Question 5

You can use the quantmod (http://www.quantmod.com/) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon‘s stock price and get the times the data was sampled.

library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)

How many values were collected in 2012? How many values were collected on Mondays in 2012?

quiz4.R
# set working directory
setwd(‘/home/mhwong/Coursera/Getting And Cleaning Data/Quiz4‘)

# Question 1
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",
destfile = "q1.csv",
method = "curl")

# load csv file
q1 <- read.csv("q1.csv", header = TRUE)

# apply strsplit() and check the 123th element
strsplit(x = names(q1), split = "wgtp")[[123]]

# Question 2
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv",
destfile = "q2.csv",
method = "curl")

# load library
library(plyr)
library(dplyr)

# load csv file
q2 <- read.csv("q2.csv", header = TRUE, skip = 3)

# reshape data
q2 <- q2[2:191, c(1,2,4,5)]
rownames(q2) <- NULL
q2 <- rename(q2, CountryCode = X)
names(q2) <- gsub(pattern = "\\.",
replacement = "",
x = names(q2))
names(q2) <- tolower(names(q2))

# remove the comma in column usdollars
q2$usdollars <- gsub(pattern = ",",
replacement = "",
x = q2$usdollars)

# convert usdollars to numeric
q2$usdollars <- as.numeric(as.character(q2$usdollars))

# calculate the average of usdollars
mean(x = q2$usdollars, na.rm = TRUE)

# Quesetion 3
# change the column name to match the question
q2 <- rename(q2, countryNames = economy)

# find countries whose name begins with "United"
grep(pattern = "^United", x = q2$countryNames)

# count the countries whose name begins with "United
sum(grepl(pattern = "^United",
x = q2$countryNames) == TRUE)

# Question 4
# download file from server
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv",
destfile = "q4.csv",
method = "curl")

# load two csv file
q4_1 <- q2
q4_2 <- read.csv("q4.csv", header = TRUE)

# reshape the data
names(q4_2) <- tolower(gsub(pattern = "\\.",
replacement = "",
x = names(q4_2)))

# merge the data
q4_merge <- join(x = q4_1, y = q4_2)

# count the fiscal year end in June
sum(grepl(pattern = "^[Ff]iscal(.*)[Yy]ear(.*)[Ee]nd(.)*[Jj]une",
x = q4_merge$specialnotes) == TRUE)

# Question 5
# launch given code
library(quantmod)
amzn = getSymbols("AMZN",auto.assign=FALSE)
sampleTimes = index(amzn)

# create a data fram from amzn
amzn <- data.frame(amzn)

# count the values collected in 2012
sum(format(as.Date(x = rownames(amzn), format = "%Y-%m-%d"), "%Y") == 2012)

# count the values collected in Mondays 2012
sum(format(as.Date(x = rownames(amzn), format = "%Y-%m-%d"), "%Y%a") == "2012Mon")

所有这次的代码来自:http://mhwong2007.logdown.com/

谢谢!

时间: 2024-10-21 06:36:39

[Coursera] Getting and Cleaning Data Quiz的相关文章

Coursera machine learning 第二周 quiz 答案 Octave/Matlab Tutorial

https://www.coursera.org/learn/machine-learning/exam/dbM1J/octave-matlab-tutorial Octave Tutorial 5 试题 1. Suppose I first execute the following Octave commands: A = [1 2; 3 4; 5 6]; B = [1 2 3; 4 5 6]; Which of the following are then valid Octave com

data cleaning

Cleaning data in Python Table of Contents Set up environments Data analysis packages in Python Clean data in Python Load dataset into Spyder Subset Drop data Transform data Create new variables Rename variables Merge two datasets Handle missing value

Coursera-Getting and Cleaning Data-week1-课程笔记

Coursera-Getting and Cleaning Data-week1 Sunday, January 11, 2015 课程概述 Getting and Cleaning Data是Coursera数据科学专项的第三门课,有中文翻译.但是由于中文区讨论没有英文区热闹,以及资料积累,强烈建议各位同时选报中文项目和英文项目,可以互相匹配学习. Week1的课程概括下来,主要介绍了getting and cleaning data的目的,即从不同数据源里获得整洁数据集(Tidy Data)

Coursera-Getting and Cleaning Data-week4-R语言中的正则表达式以及文本处理

Coursera-Getting and Cleaning Data-Week4 Thursday, January 29, 2015 补上第四周笔记,以及本次课程总结. 第四周课程主要针对text进行处理.里面包括 1.变量名的处理 2.正则表达式 3.日期处理(参见swirl lubridate包练习) 首先,变量名的处理,奉行两个原则,1)统一大小写tolower/toupper:2)去掉在导入数据时,因为特殊字符导致的合并变量 3)不要重复:4)少用代码缩写 使用的函数包括 替换查找:

Coursera-Getting and Cleaning Data-Week3-dplyr+tidyr+lubridate的组合拳

Coursera-Getting and Cleaning Data-Week3 Wednesday, February 04, 2015 好久不写笔记了,年底略忙.. Getting and Cleaning Data第三周其实没什么好说的,一个quiz,一个project,加一个swirl. 基本上swirl已经把第三周的内容都概括进去了.就是dplyr, tidyr以及lubridate包的学习和使用.其中dplyr专注于选择/筛选,tidyr关注于数据重塑型,二lubridate是我目前

【Repost】A Practical Intro to Data Science

Are you a interested in taking a course with us? Learn about our programs or contact us at [email protected]. There are plenty of articles and discussions on the web about what data science is, what qualitiesdefine a data scientist, how to nurture th

Coursera-Getting and Cleaning Data-Week2-课程笔记

Coursera-Getting and Cleaning Data-Week2 Saturday, January 17, 2015 课程概述 week2主要是介绍从各个来源读取数据.包括MySql,HDF5,webpage,API等,范围还蛮广. 因为工作和日常中没有直接到数据库拿数据的权限,所以暂时只是走马观花看了一遍,然后做quiz时再看一遍,留个概念即可. 下面的笔记因为主要是我自己留着以后做备查的,所以可能有些地方有点泄题.强烈建议想要好好学习,认真学习的人,优先先看一下视频,然后跟

R语言爬虫初尝试-基于RVEST包学习

R语言爬虫初尝试-基于RVEST包学习 Thursday, February 26, 2015 在学完coursera的getting and Cleaning data后,继续学习用R弄爬虫网络爬虫.主要用的还是Hadley Wickham开发的rvest包.再次给这位矜矜业业开发各种好用的R包的大神奉上膝盖 查阅资料如下: rvest的github rvest自身的帮助文档 rvest + CSS Selector 网页数据抓取的最佳选择-戴申: 里面有提及如何快速获得html的位置.看完这

370 门免费编程与计算机科学在线课程

简评:这篇文章为大家整理出 370 门精选的免费高质量编程计算机科学类的课程(涵盖程序语言.人工智能.深度学习与机器学习等热门话题).这370 门课程是从 Class Central数据库里面的 7000 门课程挑选出来的,每个课程的 Rating(评价)也是由该网站上获取下来的平均值. 370 门课程里面根据难易程度被分为: 入门 中级 进阶 所收录的大部分教程都已经更新完毕了,你可以按照自己的节奏(Self Paced)随时观看学习,有小部分教程还在持续更新至中,当然了,它们全都是免费的!