coursera:machine learing--code-3

Informally, the C parameter is a positive value that controls the penalty for misclassi?ed training examples. A large C parameter tells the SVM to try to classify all the examples correctly. C plays a role similar to 1 , where is the regularization parameter that we were using previously for logistic regression.

Most SVM software packages (including svmTrain.m) automatically add the extra featurex0 = 1 for you and automatically take care of learning the intercept term ?0. So when passing your training data to the SVM software, there is no need to add this extra feature x0 = 1 yourself. In particular, in Octave/MATLAB your code should be working with training examples x2Rn (rather than x:Rn+1); for example, in the ?rst example dataset x:R2.

To ?nd non-linear decision boundaries with the SVM, we need to ?rst implement a Gaussian kernel. You can think of the Gaussian kernel as a similarity function that measures the “distance” between a pair of examples, (x(i),x (j)). The Gaussian kernel is also parameterized by a bandwidth parameter, , which determines how fast the similarity metric decreases (to 0) as the examples are further apart.

sim=exp(-(x1-x2)‘*(x1-x2)/2/(sigma*sigma));

By using the Gaussian kernel with the SVM, you will be able to learn a non-linear decision boundary that can perform reasonably well for the dataset.

Figure 5 shows the decision boundary found by the SVM with a Gaussian kernel. The decision boundary is able to separate most of the positive and negative examples correctly and follows the contours of the dataset well.

use cv to determine the best C and sigma:

params = [0.01, 0.03, 0.1, 0.3 ,1 ,3 ,10, 30]‘;
minErr = 1;
indexC = -1;
indexSigma = -1;
for i = 1:size(params,1)
    for j = 1:size(params,1)
        C=params(i);
        sigma=params(j);
        model = svmTrain(X,y,C,@(x1,x2) gaussianKernel(x1,x2,sigma));
        predictions = svmPredict(model,Xval);
        err = mean(double(predictions~=yval));
        if err < minErr
            minErr = err;
            indexC = i;
            indexSigma = j;
        end
    end
end
C = params(indexC);
sigma = params(indexSigma);

  

You will be training a classi?er to classify whether a given email, x, is spam (y = 1) or non-spam ( y = 0). In particular, you need to convert each email into a feature vector x 2 Rn. The following parts of the exercise will walk you through how such a feature vector can be constructed from an email.

Therefore, one method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string “httpaddr” to indicate that a URL was present.

This has the e?ect of letting the spam classi?er make a classi?cation decision based on whether any URL was present, rather than whether a speci?c URL was present. This typically improves the performance of a spam classi?er, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.

idx = -1;
for i = 1:length(vocabList)
    if strcmp(vocalList{i},str) == 1
        idx = 1;
        break;
    end
end

if idx~=-1
    word_indices = [word_indices;idx];
end

  For this exercise, we have chosen only the most frequently occuring words as our set of words considered (the vocabulary list). Since words that occur rarely in the training set are only in a few emails, they might cause the model to over?t our training set. The complete vocabulary list is in the ?le vocab.txt and also shown in Figure 10. Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.

for i = 1:length(word_indices)
    x(word_indices(i)) = 1;
end

  

Your task in this optional (ungraded) exercise is to download the original ?les from the public corpus and extract them. After extracting them, you should run the processEmail4 and emailFeatures functions on each email to extract a feature vector from each email. This will allow you to build a dataset X, y of examples. You should then randomly divide up the dataset into a training set, a cross validation set and a test set.

While you are building your own dataset, we also encourage you to try building your own vocabulary list (by selecting the high frequency words that occur in the dataset) and adding any additional features that you think might be useful.

时间: 2024-10-05 09:46:59

coursera:machine learing--code-3的相关文章

Coursera Machine Learning 学习笔记(一)

之前就对Machine Learning很感兴趣,假期得闲看了Coursera Machine Learning 的全部课程,整理了笔记以便反复体会. I. Introduction (Week 1) - What is machine learning 对于机器学习的定义,并没有一个被一致认同的答案. Arthur Samuel (1959) 给出对机器学习的定义: 机器学习所研究的是赋予计算机在没有明确编程的情况下仍能学习的能力. Samuel设计了一个西洋棋游戏,他让程序自己跟自己下棋,并

Coursera machine learning 第二周 quiz 答案 Octave/Matlab Tutorial

https://www.coursera.org/learn/machine-learning/exam/dbM1J/octave-matlab-tutorial Octave Tutorial 5 试题 1. Suppose I first execute the following Octave commands: A = [1 2; 3 4; 5 6]; B = [1 2 3; 4 5 6]; Which of the following are then valid Octave com

coursera:machine learing--code-6

Anomaly detection In this exercise, you will implement an anomaly detection algorithm to detect anomalous behavior in server computers. The features measure the throughput (mb/s) and latency (ms) of response of each server. While your servers were op

azure Machine learing studio 使用示例之 - 使用线性回归算法完成预测评估

本文演示如何使用azure studio基于云测试数据创建第一个machine learning的experiment,算法选择是线性回归.首先要有一个azure studio账号,登录后进入dashboard. 创建一个BLANK的Experiment 添加测试数据 , 搜索Price, 选择Automibile price data(Raw) 把这个模块拖到右边的data item位置 搜索Project column模块,拖到右边,在Project columns中点击 'Launch c

Coursera machine learning 第二周 编程作业 Linear Regression

必做: [*] warmUpExercise.m - Simple example function in Octave/MATLAB[*] plotData.m - Function to display the dataset[*] computeCost.m - Function to compute the cost of linear regression[*] gradientDescent.m - Function to run gradient descent 1.warmUpE

Coursera - Machine Learning, Stanford: Week 1

Welcome and introduction Overview Reading Log 9/9 videos and quiz completed; 10/29 Review; Note 1.1 Welcome 1) What is machine learning? Machine learning is the science of getting compters to learn, without being explicitly programmed. 1.2 Introducti

【Coursera - machine learning】 Linear regression with one variable-quiz

Question 1 Consider the problem of predicting how well a student does in her second year of college/university, given how well they did in their first year. Specifically, let x be equal to the number of "A" grades (including A-. A and A+ grades)

【投稿】Machine Learing With Spark Note 3:构建分类器

本文为数盟特约作者投稿,欢迎转载,请注明出处"数盟社区"和作者 博主简介:段石石,1号店精准化推荐算法工程师,主要负责1号店用户画像构建,喜欢钻研点Machine Learning的黑科技,对Deep Learning感兴趣,喜欢玩kaggle.看9神,对数据和Machine Learning有兴趣咱们可以一起聊聊,个人博客: hacker.duanshishi.com Spark构建分类器 在本章中,我们会了解基本的分类器以及在Spark如何使用,以及一套如何对model进行评价.调

Coursera Machine Learning 学习笔记(十三)

VI. Logistic Regression (Week 3) - Classification 在分类问题中,我们所尝试预测的是结果是否属于某一类(例如正确或错误).分类问题的例子有:判断一封电子邮件是否是垃圾邮件:判断一封电子邮件是否是垃圾邮件:判断一次金融交易是否是欺诈等等. 我们从二元的分类问题开始讨论. 我们将因变量(dependent variable)可能属于的两个类分别称为负向类(negative class)和正向类(positive class),则因变量 - Hypoth