Reading Notes for Statistical Learning Theory

Let‘s continue the discussion of reading Vapnik‘s book Statistical Learning Theory. In the very beginning of the book, Vapnik first described two fundamental approaches in pattern recognition: the parametric estimation approach and the non-parametric estimation approach. Before introducing the non-parametric approach, which the support vector machine belongs to, Vapnik first address the following three beliefs that the philosophy of parametric approach stands for (Page 4 to 5):

The existence of a function defined by a limited number of parameters a good approximation to the desired function;
The normal law for most real-life problems;
The maximum likelihood method is a good tool for estimating parameters

In my opinion, the first condition should be required for all machine learning approaches, no matter whether they are parametric or non-parametric. For the second point, it is based on the central limit theorem, which stated that, the distribution of a large set of independent random variables approximate a Gaussian distribution. If we first pre-process the dataset to normalise the data points to centre the mean into the origin, the Gaussian distribution becomes a normal distribution. This is why the term "normal law" is used. In my opinion, it may be better to just highly the assumption of independence, since the assumption of independence is more fundamental. The Gaussian distribution, is however, only a special situation of this condition. Regarding to the third point, the statement seems a little too absolute, as maximum likelihood estimation does not stand for the whole world.

Despite we see some limitations in the statements, as long as we keep on reading the book, the author just wanted to use these assumptions that many methods followed to highlight the limitations of parametric approaches. We know that expects in parametric learning approach may be able to argue the points, as debates are often usual in the academic world.

Then Vapnik started to introduce the Perception algorithm in 1958, and the empirical risk minimisation (ERM) criterion that used for machine learning. It is of interest to note that the ERM is used to measure the error referring to the training samples, while our real problem of machine learning is the estimate the unobserved behaviours in the test dataset. There is a problem of overfitting, which occurs when the training samples is not enough. The overfitting problem occurs when the training samples is small and thus the model fit the training samples but lack of generalisation (achieving poor performance for the test dataset).

Exactly as what I thought, the next problem the author addressed is the generalisation of the algorithm. Then the very important VC dimension theory is mentioned. The basic motivation of the VC dimension relates to density estimation. We know that, due to the law of large numbers, the relative frequency of an event approximates to its real probability, when the size of the samples approaches to infinity. However, since our training dataset is always finite in reality. This drives the author to consider constructing a more general theory to estimate the capability of density estimation of a training dataset, the so call VC dimension. The motivation of support vector machine is that the machine with the lowest VC dimension is the best.

Then the author presents the main principal of designing a learning machine based on a limited size of dataset. The main principal is that, for density estimation, if we may directly estimate a specific density we needed, rather than inducting this density by first estimating the more general densities that the specific density depends on.
For example, if we can estimate the condition probability, we may not need to estimate the probability of the condition and the probability of the event under all conditions. More important, with limited information such as a small size training dataset, we may only allow to estimate a more specific density. On the other hand, the problem we are going to solve is to predict the class of unobservable samples, which requires the machine should be able to generate a solution more general than its training dataset or a specific test point. The machine should have the capability to estimate all samples in the feature space.

时间： 2024-08-25 07:58:58

Reading Notes for Statistical Learning Theory的相关文章

Reading “Statistical Learning Theory”

To start this post, it is the best to first let me tell you a story that had happened in the year 2013. It was in the summer of 2013, when I was near finish the study of PhD, preparing the final seminar and the submission of my thesis for external re

摘录-Introduction to Statistical Learning Theory(统计机器学习导论)

机器学习目标:(二分类) 经验风险: 过度拟合: 经验风险最小化: 结构风险最小化: 正则: 特点: 误差错误估计错误: 误差上界分析: R(g)的经验风险上界: 对错误分类的误差F定义(值域[0或1]): F和R的关系: 关于F的Hoe不等式: 意义: 统一上界: 与Hoe的差异: 增长函数: VC维: VC维无限的函数族: 证明:将给定的点进行+-+-划分,如果有连续++或--的点在中间添加一个新点,

Machine Learning Algorithms Study Notes(3)--Learning Theory

Machine Learning Algorithms Study Notes 高雪松 @雪松Cedro Microsoft MVP 本系列文章是Andrew Ng 在斯坦福的机器学习课程 CS 229 的学习笔记. Machine Learning Algorithms Study Notes 系列文章介绍 3 Learning Theory 3.1 Regularization and model selection 模型选择问题:对于一个学习问题,可以有多种模型选择.比如要拟合一组样本点,

?统计学习精要(The Elements of Statistical Learning)?课堂笔记（二）

继续一周一次的课堂笔记 :D 昨天去晚了站着听讲,感觉好好啊,注意各种集中.想想整个教室里面就是我和老师是站着的,自豪感油然而生. 第二次课讲的东西依旧比较简单,是这本书第二章的前半部分.作为一个好久之前已经预习过的孩子,我表示万分的得意(最小二乘法难道不是三四年前就学过的?话说以后我再面人的时候,就让他推导最小二乘估计量,嘻嘻...考验一下基本功). ------------原谅我的废话,笔记开始------------ 简单预测方法:最小二乘法(以下沿用计量经济学的习惯,简称OLS) OLS

?统计学习精要(The Elements of Statistical Learning)?课堂笔记（一）

前两天微博上转出来的,复旦计算机学院的吴立德吴老师在开?统计学习精要(The Elements of Statistical Learning)?这门课,还在张江...大牛的课怎能错过,果断请假去蹭课...为了减轻心理压力,还拉了一帮同事一起去听,eBay浩浩荡荡的十几人杀过去好不壮观!总感觉我们的人有超过复旦本身学生的阵势,五六十人的教室坐的满满当当,壮观啊. 这本书正好前阵子一直在看,所以才会屁颠屁颠的跑过去听.确实是一本深入浅出讲data mining models的好书.作者网站上提供免

?统计学习精要(The Elements of Statistical Learning)?课堂笔记（三）

照例文章第一段跑题,先附上个段子(转载的哦~): I hate CS people. They don't know linear algebra but want to teach projective geometry. They don't know any probability but want to use graphical models. They don't understand stats at all but still do machine learning like c

Tomcat Reading Notes

HTTP the client who initiates a transcation by establishing a connection and seding an HTTP request. the web server is in no position to contact a clinet or make a callback connection to the client. either client or the server can terminate a connect

More 3D Graphics (rgl) for Classification with Local Logistic Regression and Kernel Density Estimates (from The Elements of Statistical Learning)（转）

This post builds on a previous post, but can be read and understood independently. As part of my course on statistical learning, we created 3D graphics to foster a more intuitive understanding of the various methods that are used to relax the assumpt

reading notes -- A Report from the Trenches

Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches ABSTRACT 一个知识库(KB) 是一个集合,包含有概念,实例和关系. 论文中描述了一个工业级使用的知识库,从建立维护到使用的全过程.尤其是建立,更新和组织一个大型的知识库,以及其大量的应用. 一.INTRODUCTION 知识库及知识图谱的应用大概有:DBLP, Google Scholar, Internet Movie Databas