Homework 1 INF 552, Instructor: Mohammad Reza

Homework 1 INF 552, Instructor: Mohammad Reza Rajati
1. Vertebral Column Data Set
This Biomedical data set was built by Dr. Henrique da Mota during a medical residence
period in Lyon, France. Each patient in the data set is represented in the data set
by six biomechanical attributes derived from the shape and orientation of the pelvis
and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle,
sacral slope, pelvic radius and grade of spondylolisthesis. The following convention is
used for the class labels: DH (Disk Hernia), Spondylolisthesis (SL), Normal (NO) and
Abnormal (AB). In this exercise, we only focus on a binary classification task NO=0
and AB=1.
(a) Download the Vertebral Column Data Set from: https://archive.ics.uci.
edu/ml/datasets/Vertebral+Column.
(b) Pre-Processing and Exploratory data analysis:
i. Make scatterplots of the independent variables in the dataset. Use color to
show Classes 0 and 1.
ii. Make boxplots for each of the independent variables. Use color to show
Classes 0 and 1 (see ISLR p. 129).
iii. Select the first 70 rows of Class 0 and the first 140 rows of Class 1 as the
training set and the rest of the data as the test set.
(c) Classification using KNN on Vertebral Column Data Set
i. Write code for k-nearest neighbors with Euclidean metric (or use a software
package).
ii. Test all the data in the test database with k nearest neighbors. Take decisions
by majority polling. Plot train and test errors in terms of k for
k ∈ {208, 205, . . . , 7, 4, 1, } (in reverse order). You are welcome to use smaller
increments of k. Which k
is the most suitable k among those values? Calculate
the confusion matrix, true positive rate, true negative rate, precision,
and F-score when k = k.
1
iii. Since the computation time depends on the size of the training set, one may
only use a subset of the training set. Plot the best test error rate,
2 which
is obtained by some value of k, against the size of training set, when the
size of training set is N ∈ {10, 20, 30, . . . , 210}.
3 Note: for each N, select
your training set by choosing the first bN/3c rows of Class 0 and the first
N bN/3c rows of Class 1 in the training set you creatd in 1(b)iii. Also, for
each N, select the optimal k from a set starting from k = 1, increasing by 5.
For example, if N = 200, the optimal k is selected from {1, 6, 11, . . . , 196}.
This plot is called a Learning Curve.
Let us further explore some variants of KNN.
1We will learn in the lectures what these mean, for now research how they are computed and compute
them.
2Obviously, use the test data you created in 1(b)iii
3For extra practice, you are welcome to choose smaller increments of N.
1
Homework 1 INF 552, Instructor: Mohammad Reza Rajati
(d) Replace the Euclidean metric with the following metrics4 and test them. Summarize
the test errors (i.e., when k = k) in a table. Use all of your training data
and select the best k when {1, 6, 11, . . . , 196}.
i. Minkowski Distance:
A. which becomes Manhattan Distance with p = 1.
B. with log10(p) ∈ {0.1, 0.2, 0.3, . . . , 1}. In this case, use the k
you found
for the Manhattan distance in 1(d)iA. What is the best log10(p)?
C. which becomes Chebyshev Distance with p → ∞
ii. Mahalanobis Distance.5
(e) The majority polling decision can be replaced by weighted decision, in which the
weight of each point in voting is proportional to its distance from the query/test
data point. In this case, closer neighbors of a query point will have a greater
influence than neighbors which are further away. Use weighted voting with Euclidean,
Manhattan, and Chebyshev distances and report the best test errors when
k ∈ {1, 6, 11, 16, . . . , 196}.
(f) What is the lowest training error rate you achieved in this exercise?
4You can use sklearn.neighbors.DistanceMetric. Research what each distance means.
5Mahalanobis Distance requires inverting the covariance matrix of the data. When the covariance matrix
is singular or ill-conditioned, the data live in a linear subspace of the feature space. In this case, the features
have to be transformed into a reduced feature set in the linear subspace, which is equivalent to using a
pseudoinverse instead of an inverse.

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]

微信:codinghelp

原文地址:https://www.cnblogs.com/rrrrrhelper/p/10327710.html

时间: 2024-10-24 18:58:58

Homework 1 INF 552, Instructor: Mohammad Reza的相关文章

Homework 6 INF 552

Homework 6 INF 552,1. Supervised, Semi-Supervised, and Unsupervised Learning(a) Download the Breast Cancer Wisconsin (Diagnostic) Data Set from:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. Download the data in htt

Project ECON 427

Project ECON 427, 1. Predicting Stock Price MovementsThe goal of this project is to predict stock pricesby applying machine learning techniquesto data from StockTwits, a social media platform for investors. We extractfeatures from textual data, and f

HDU 1074 Doing Homework 状压DP

Problem Description Ignatius has just come back school from the 30th ACM/ICPC. Now he has a lot of homework to do. Every teacher gives him a deadline of handing in the homework. If Ignatius hands in the homework after the deadline, the teacher will r

HDU 1074 Doing Homework(状压dp)

Doing Homework Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) Total Submission(s): 6299    Accepted Submission(s): 2708 Problem Description Ignatius has just come back school from the 30th ACM/ICPC. Now he has a l

Doing Homework 状态压缩DP

Doing Homework 题目抽象:给出n个task的name,deadline,need.  每个任务的罚时penalty=finish-deadline;   task不可以同时做.问按怎样的顺序做使得penalty最小.同时输出顺序.如果有多个满足条件的顺序,按字典序输出. 1 #include <iostream> 2 #include <cstdio> 3 #include <cstring> 4 #include <cmath> 5 #inc

Doing Homework HDU - 1074

Ignatius has just come back school from the 30th ACM/ICPC. Now he has a lot of homework to do. Every teacher gives him a deadline of handing in the homework. If Ignatius hands in the homework after the deadline, the teacher will reduce his score of t

HDU1789 Doing Homework again 【贪心】

Doing Homework again Time Limit: 1000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) Total Submission(s): 6878    Accepted Submission(s): 4096 Problem Description Ignatius has just come back school from the 30th ACM/ICPC. Now he h

HDU 1074:Doing Homework(状压DP)

http://acm.hdu.edu.cn/showproblem.php?pid=1074 Doing Homework Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) Total Submission(s): 7704    Accepted Submission(s): 3484 Problem Description Ignatius has just come bac

Facebook Hacker Cup 2015 Round 1 Homework(附带测试数据)

题目描述: Homework10 points Your first-grade math teacher, Mr. Book, has just introduced you to an amazing new concept - primes! According to your notes, a prime is a positive integer greater than 1 that is divisible by only 1 and itself. Primes seem fun