CMPT 459.1-19. Programming Assignment

---
title: "CMPT 459.1-19. Programming Assignment 1"
subtitle: "FIFA 19Players"
author: "Name - Student ID"
output: html_notebook
---
### Introduction
The data has detailed attributes for every player registered in
the latest edition of FIFA 19 database, obtained scraping the
website “sofifa.com”. Each instance is a different player, and
the attributes give basic information about the players and
their football skills. Basic pre-processing was done and Goal
Keepers were removed for this assignment.
Please look here for the original data overview and attributes’
descriptions:
- https://www.kaggle.com/karangadiya/fifa19
And here to get a better view of the information:
- https://sofifa.com/
---
### First look
**[Task 1]**: Load the dataset, completing the code below (keep
the dataframe name as **fifa**)
```{r}
# Loading
fifa <- read.csv("fifa.csv")
```
**[Checkpoint 1]**: How many rows and columns exist?
```{r}
cat(ifelse(all(dim(fifa) == c(16122, 68)), "Correct results!",
"Wrong results.."))
```
---
**[Task 2]**: Give a very brief overview of the types of each
attribute and their values. **HINT**: Functions *str*, *table*,
*summary*.
```{r}
# Overview
str(fifa)
```
**[Checkpoint 2]**: Were functions used to display data types
and give some idea of the information of the attributes?
---
### Data Cleaning
Functions suggested to use on this part: *ifelse*, *substr*,
*nchar*, *str_split*, *map_dbl*.
Five attributes need to be cleaned.
- **Value**: Remove euro character, deal with ending
"K" (thousands) and "M" (millions), define missing values and
make it numeric.
- **Wage**: Same as above.
- **Release.Clause**: Same as above.
- **Height**: Convert to "cm" and make it numeric.
- **Weight**: Remove "lbs" and make it numeric.
**[Task 3]**: The first 3 of the 5 attributes listed above that
need to be cleaned are very alike. Create only one function to
clean them the same way. This function should get the vector of
attribute values as parameter and return it cleaned, so use it
three times, each with one of the columns. **Encode zeroes or
blank as NA.**
```{r}
# Function used to clean attributes
library(stringr)
attr_fix <- function(attribute){
cleaned_attribute = str_split(attribute, gsub, pattern=‘€‘,
replacement=‘‘)

return(cleaned_attribute)
}
# Cleaning attributes
fifa$Value <- attr_fix(fifa$Value)
fifa$Wage <- attr_fix(fifa$Wage)
fifa$Release.Clause <- attr_fix(fifa$Release.Clause)
```
**[Checkpoint 3]**: How many NA values?
```{r}
cat(ifelse(sum(is.na(fifa))==1779, "Correct results!", "Wrong
results.."))
```
---
**[Task 4]**: Clean the other two attributes. **Hint**: To
convert to "cm" use http://www.sengpielaudio.com/calculatorbodylength.htm.
```{r}
# Cleaning attribute Weight:
```
```{r}
# Cleaning attribute Height:
```
**[Checkpoint 4]**: What are the mean values of these two
columns?
```{r}
cat(ifelse(all(c(round(mean(fifa[,8]),4)==164.1339,
round(mean(fifa[,7]),4)==180.3887)), "Correct results!", "Wrong
results.."))
```
---
### Missing Values
**[Task 5]**: What columns have missing values? List them below
(Replace <ANSWER HERE>). Impute (so do not remove) values
missing (that is all NA found) and explain the reasons for the
method used. Suggestion: MICE imputation based on random
forests .R package mice: https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3074241/, Use *set.seed(1)*. **HINT**: Remember to
not use "ID" nor "International.Reputation" for the imputation,
if MAR (Missing at Random) is considered. Also later remember to
put them back to the "fifa" dataframe.
Columns with missing values:
- <ANSWER HERE>
- <ANSWER HERE>
- ...
```{r}
# Handling NA values
```
```{r}
# Putting columns not used on imputation back into "fifa"
dataframe
```
**[Checkpoint 5]**: How many instances have at least one NA? It
should be 0 now. How many columns are there? It should be 68
(remember to put back "ID" and "International.Reputation").
```{r}
cat(ifelse(all(sum(is.na(fifa))==0, ncol(fifa)==68), "Correct
results!", "Wrong results.."))
```
---
### Feature Engineering
**[Task 6]**: Create a new attribute called "Position.Rating"
that has the rating value of the position corresponding to the
player. For example, if the player has the value "CF" on the
attribute "Position", then "Position.Rating" should have the
number on the "CF" attribute. **After that, remove the
"Position" attribute from the data**.
```{r}
# Creating the attribute "Position.Rating"

```
```{r}
# Removing the attribute "Position"
```
**[Checkpoint 6]**: What‘s the mean of the "Position.Rating"
attribute created? How many columns are there in the dataframe?
It should be 68 (remember to remove "Position").
```{r}
cat(ifelse(all(c(round(mean(fifa$Position.Rating),5) ==
66.87067, ncol(fifa)==68)), "Correct results!", "Wrong
results.."))
```
---
### Dimension Reduction
**[Task 7]**: Performe PCA (Principal Component Analysis) on the
columns representing ratings of positions (that is, attributes:
LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM,
RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB). Show the
summary of the components obtained. **Keep the minimum number of
components to have at least 98.50% of the variance explained by
them.**. Remove the columns used for PCA. **HINT**: Function
*prcomp*, remember to center and scale.
```{r}
# Perform PCA
# Show Summary
```
```{r}
# Put the components back into "fifa" dataframe
# Remove original columns used for PCA
```
**[Checkpoint 7]**: How many columns exist in the dataset? It
should be 45.
```{r}
cat(ifelse(ncol(fifa)==45, "Correct results!", "Wrong
results.."))
```
**[Bonus]**: Use the code below to see which columns influenced
the most each component graphically. Replace "fifa.pca" with the
object result from the use of *prcomp* function.
```{r}
library(factoextra)
fviz_pca_var(fifa.pca,
col.var = "contrib", # Color by contributions to
the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
```
---
### Binarization
**[Task 8]**: Perform binarization on the following categorical
attributes: "Preferred.Foot" and "Work.Rate". **HINT**: R
package "dummies", function *dummy.data.frame*.
```{r}
# Binarize categorical attributes
```
**[Checkpoint 8]**: How many columns exist in the dataset? It
should be 54.
```{r}
cat(ifelse(ncol(fifa)==54, "Correct results!", "Wrong
results.."))
```
---
### Normalization
**[Task 9]**: Remove attribute "ID" from "fifa" dataframe, save
attribute "International.Reputation" on vector named "IntRep"
and then also remove "International.Reputation" from "fifa"
dataframe. Perform z-score normalization on "fifa", except for
columns that came from PCA. Finally combine the normalized
attributes with those from PCA saving on "fifa" dataframe.
**HINT**: Function *scale*.
```{r}
# Normalize with Z-Score
```
**[Checkpoint 9]**: How many columns exist in the dataset? It
should be 52. What‘s the mean of all the means of the
attributes? Should be around zero.
```{r}
cat(ifelse(ncol(fifa)==52, "Correct results!", "Wrong
results.."))
```
---
### K-Means
**[Task 9]**: Perform K-Means for values of K ranging from 2 to
15. Find the best number of clusters for K-means clustering,
based on the silhouette score. Report the best number of
clusters and the silhouette score for the corresponding
clustering (Replace <ANSWER HERE> below). How strong is the
discovered cluster structure? (Replace <ANSWER HERE> below) Use
"set.seed(1)". **HINT**: Function *kmeans* (make use of
parameters *nstart* and *iter.max*) and *silhouette* (from
package "cluster").
```{r}
# K-Means and Silhouette scores
```
Results found:
- Best number of clusters: <ANSWER HERE>
- Silhouette score: <ANSWER HERE>
- How strong is the cluster? <ANSWER HERE>
**[Checkpoint 9]**: Are there silhouette scores for K-Means with
K ranging from 2 to 15? Were the best K and correspondent
silhouette score reported?
---
**[Task 10]**: Perform K-means with the K chosen and get the
resulting groups. Try out several pairs of attributes and
produce scatter plots of the clustering from task 9 for these
pairs of attributes. By inspecting these plots, determine a pair
of attributes for which the clusters are relatively wellseparated
and submit the corresponding scatter plot.
```{r}
# K-Means for best K and Plot
```
**[Checkpoint 10]**: Is there at least one plot showing two
attributes and the groups (colored or circled) reasonably
separated?
---
### Hierarchical Clustering
**[Task 11]**: Sample randomly 1% of the data (set.seed(1)).
Perform hierarchical cluster analysis on the dataset using the
algorithms complete linkage, average linkage and single linkage.
Plot the dendrograms resulting from the different methods (three
methods should be applied on the same 1% sample). Discuss the
commonalities and differences between the three dendrograms and
try to explain the reasons leading to the differences (Replace
the <ANSWER HERE> below).
```{r}
# Sample and calculate distances
```
```{r}
# Complete
```
```{r}
# Average
```
```{r}
# Single
```
Discussion:
- <ANSWER HERE>
**[Checkpoint 11]**: Does the discussion show commonalities and
differences between the three dendrograms and explain the
differences?
---
### Clustering comparison
**[Task 12]**: Now perform hierarchical cluster analysis on the
**ENTIRE dataset** using the algorithms complete linkage,
average linkage and single linkage. Cut all of the three
dendrograms from task 11 to obtain a flat clustering with the
number of clusters determined as the best number in task 9.
To perform an external validation of the clustering results, use
the vector "IntRep"" created. What is the Rand Index for the
best K-means clustering? And what are the values of the Rand
Index for the flat clusterings obtained in this task from
complete linkage, average linkage and single linkage? Discuss
the results (Replace <ANSWER HERE> below). **HINT**: Function
*cluster_similarity* from package "clusteval".
```{r}
# Hierarchical Clusterings (Complete, Average and Single)
```
```{r}
# Flat Clusterings
```
```{r}
# Cluster Similarities
```
Discussion:
- <ANSWER HERE>
**[Checkpoint 12]**: Does the discussion include relevant
comparison of the clusters and makes sense?

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:99515681[email protected]

微信:codinghelp

原文地址:https://www.cnblogs.com/wxyst/p/10353935.html

时间: 2024-11-09 11:37:13

CMPT 459.1-19. Programming Assignment的相关文章

C200 Programming Assignment

C200 Programming Assignment № 8Computer ScienceSchool of Informatics, Computing, and EngineeringMarch 30, 2019ContentsIntroduction 2Problem 1: Newton Raphson Root Finding Algorithm 4Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Algorithm Part I:Programming Assignment(2)

问题描述: Programming Assignment 2: Randomized Queues and Deques Write a generic data type for a deque and a randomized queue. The goal of this assignment is to implement elementary data structures using arrays and linked lists, and to introduce you to g

Programming Assignment 3 : Pattern Recognition

这周的这个问题是在给定一系列的点中寻找多点共线的模式. 计算机视觉重要的两个部分:特征检测(Feature Dectection)和模式识别(Pattern Recognition).特征检测提取出图片的重要特征,模式识别发掘出这些特征中的模式.这里探究的点共线的问题在现实生活中也有很多应用,比如统计数据分析. Problem. 从二维平面上的N个互不相同的点中,绘制出每个(最多)连接的4个或4个以上点集合的线段. Point data type. 给定的Point类型的API public c

Programming Assignment 4: 8 Puzzle

The Problem. 求解8数码问题.用最少的移动次数能使8数码还原. Best-first search.使用A*算法来解决,我们定义一个Seach Node,它是当前搜索局面的一种状态,记录了从初始到达当前状态的移动次数和上一个状态.初始化时候,当前状态移动次数为0,上一个状态为null,将其放入优先级队列,通过不断的从优先级队列中取出Seach Node去扩展下一级的状态,直到找到目标状态.对于优先级队列中优先级的定义我们可以采用:Hamming priority function 和

Algorithm Part I:Programming Assignment(3)

问题描述: Programming Assignment 3: Pattern Recognition Write a program to recognize line patterns in a given set of points. Computer vision involves analyzing patterns in visual images and reconstructing the real-world objects that produced them. The pr

Programming Assignment 1: WordNet

题目地址:http://coursera.cs.princeton.edu/algs4/assignments/wordnet.html 1. 题目阅读 WordNet定义 WordNet是指一个包含唯一根的有向无环图,图中每一组词表示同一集合,每一条边v→w表示w是v的上位词.和树不同的地方是,每一个子节点可以有许多父节点. 输入格式 同义词表 文件中每行包含一次同义名词.首先是序号:然后是词,用空格分开.若为词组,则使用下划线连接词组.最后是同义名词的注释 36,AND_circuit AN

Algorithms: Design and Analysis, Part 1 - Programming Assignment #1

自我总结: 1.编程的思维不够,虽然分析有哪些需要的函数,但是不能比较好的汇总整合 2.写代码能力,容易挫败感,经常有bug,很烦心,耐心不够好 题目: In this programming assignment you will implement one or more of the integer multiplication algorithms described in lecture. To get the most out of this assignment, your pro

Coursera系列-R Programming (John Hopkins University)-Programming Assignment 3

经过断断续续一个月的学习,R语言这门课也快接近尾声了.进入Week 4,作业对于我这个初学者来说感到越发困难起来.还好经过几天不断地摸索和试错,最终完整地解决了问题. 本周的作业Assignment 3是处理一个来自美国Department of Health and Human Services的一个文件,叫“outcome-of-care-measures.csv”.里面储存了美国50个州4000多家医院的几个常见疾病的死亡率.具体说来是30-day mortality and readmi

第二周:神经网络的编程基础----------2、编程作业常见问题与答案(Programming Assignment FAQ)

Please note that when you are working on the programming exercise you will find comments that say "# GRADED FUNCTION: functionName". Do not edit that comment. The function in that code block will be graded. 1) What is a Jupyter notebook? A Jupyt