分层贝叶斯学习

频率推理(Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics

This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based.

Other than frequentistic inference, the main alternative approach to statistical inference is Bayesian inference, while another is fiducial inference.

two major differences in the frequentist and Bayesian approaches to inference that are not included in the above consideration of the interpretation of probability:

····In a frequentist approach to inference, unknown parameters are often, but not always, treated as having fixed but unknown values that are not capable of being treated as random variates in any sense, and hence there is no way that probabilities can be associated with them. making operational decisions and estimating parameters with or without confidence intervals. Frequentist inference is based solely on the (one set of) evidence

In contrast, a Bayesian approach to inference does allow probabilities to be associated with unknown parameters, where these probabilities can sometimes have a frequency probability interpretation as well as a Bayesian one. The Bayesian approach allows these probabilities to have an interpretation as representing the scientist‘s belief that given values of the parameter are true。

Bayesian inference is explicitly based on the evidence and prior opinion, which allows it to be based on multiple sets of evidence.

···While "probabilities" are involved in both approaches to inference, the probabilities are associated with different types of things. The result of a Bayesian approach can be a probability distribution for what is known about the parameters given the results of the experiment or study. The result of a frequentist approach is either a "true or false" conclusion from a significance test or a conclusion in the form that a given sample-derived confidence interval covers the true value: either of these conclusions has a given probability of being correct, where this probability has either a frequency probability interpretation or a pre-experiment interpretation.

Example: A frequentist does not say that there is a 95% probability that the true value of a parameter lies within a confidence interval, saying instead that 95% of confidence intervals contain the true value.

Efron‘s comparative adjectives
  Bayes Frequentist
  • Basis
  • Resulting Characteristic
  • _
  • Ideal Application
  • Target Audience
  • Modeling Characteristic
  • Belief (prior)
  • Principled Philosophy
  • One distribution
  • Dynamic (repeated sampling)
  • Individual (subjective)
  • Aggressive
  • Behavior (method)
  • Opportunistic Methods
  • Many distributions (bootstrap?)
  • Static (one sample)
  • Community (objective)
  • Defensive

概率与似然

A probability refers to variable data for a fixed hypothesis while a likelihood refers to variable hypotheses for a fixed set of data.

Each fixed set of observational conditions is associated with a probability distribution and each set of observations can be interpreted as a sample from that distribution – the frequentist view of probability.

Alternatively a set of observations may result from sampling any of a number of distributions (each resulting from a set of observational conditions). The probabilistic relationship between a fixed sample and a variable distribution (resulting from a variable hypothesis) is termed likelihood – a Bayesian view of probability。

The principle says that all of the information in a sample is contained in the likelihood function, which is accepted as a valid probability distribution by Bayesians (but not by frequentists).

many statisticians accept the cautionary words of statistician George Box, "All models are wrong, but some are useful."

Bayes’ theorem

The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options.

Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability , the survival probability will be updated with the occurrence of y, the event in which a hypothetical controversial serum is created which, as believed by some, increases survival in cardiac patients.

In order to make updated probability statements about  , given the occurrence of event y, we must begin with a model providing a joint probability distribution for  and y. This can be written as a product of the two distributions that are often referred to as the prior distribution  and the sampling distribution respectively:

Using the basic property of conditional probability, the posterior distribution will yield:

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes‘ theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to incorporate the updated belief,  , in appropriate and solvable ways.

Exchangeability

Finite exchangeability

If are independent and identically distributed, then they are exchangeable, but not conversely true。比如:一个盒子里有篮球和红球。那么先拿到红球和先拿到篮球的概率都是1/2。

But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first draw is 0, and is not equal to the probability that the red ball is selected in the second draw which is equal to 1/2 ).

Thus, and  are not independent。

Infinite exchangeability

Hierarchical models

Components

Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely:

1. Hyperparameter: parameter of the prior distribution

2. Hyperprior: distribution of a Hyperparameter

Say a random variable Y follows a normal distribution with parameters θ as the mean and 1 as the variance, that is . The parameter  has a prior distribution given by a normal distribution with mean  and variance 1, i.e.  . Furthermore,  follows another distribution given, for example, by the standard normal distribution. The parameter is called the hyperparameter, while its distribution given by  is an example of a hyperprior distribution.

The notation of the distribution of Y changes as another parameter is added, i.e.. If there is another stage, say,  follows another normal distribution with mean and variance  , meaning  ,    and  can also be called hyperparameters while their distributions are hyperprior distributions as well.

Framework

Let be an observation and  a parameter governing the data generating process for  .

Assume further that the parameters  are generated exchangeably from a common population, with distribution governed by a hyperparameter .
The Bayesian hierarchical model contains the following stages:

The likelihood, as seen in stage I is  , with  as its prior distribution. Note that the likelihood depends on only through  .

The prior distribution from stage I can be broken down into:

[from the definition of conditional probability]

With as its hyperparameter with hyperprior distribution, .

Thus, the posterior distribution is proportional to:

[using Bayes’ Theorem]

Example

To further illustrate this, consider the example: A teacher wants to estimate how well a male student did in his SAT. He uses information on the student’s high school grades and his current grade point average (GPA) to come up with an estimate. His current GPA, denoted by , has a likelihood given by some probability function with parameter  , i.e.  . This parameter  is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter , which is the high school grade of the student.That is,  . Moreover, the hyperparameter  follows its own distribution given by , a hyperprior. To solve for the SAT score given information on the GPA,

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, the use of hyperpriors gives more information to make more accurate beliefs in the behavior of a parameter.

2-stage hierarchical model

In general, the joint posterior distribution of interest in 2-stage hierarchical models is:

3-stage hierarchical model

For 3-stage hierarchical models, the posterior distribution is given by:

时间: 2024-11-08 01:00:04

分层贝叶斯学习的相关文章

分层贝叶斯模型——应用

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model 该篇文章通过分层贝叶斯模型学习利用单一训练样本来学习完成分类任务,模型通过影响一个类别的均值和方差,可以将已经学到的类别信息用到新的类别当中.模型能够发现如何组合一组类别,将其归属为一个有意义的父类.对一个对象进行分类需要知道在一个合适的特征空间中每一维度的均值和方差,这是一种基于相似性的度量方法,均值代表类别的基本标准,逆方差对应于类别相似性矩阵的各维度权重.O

机器学习笔记 贝叶斯学习(上)

机器学习笔记(一) 今天正式开始机器学习的学习了,为了激励自己学习,也为了分享心得,决定把自己的学习的经验发到网上来让大家一起分享. 贝叶斯学习 先说一个在著名的MLPP上看到的例子,来自于Josh Tenenbaum 的博士论文,名字叫做数字游戏. 用我自己的话叙述就是:为了决定谁洗碗,小明和老婆决定玩一个游戏.小明老婆首先确定一种数的性质C,比如说质数或者尾数为3:然后给出一系列此类数在1至100中的实例D= {x1,...,xN} :最后给出任意一个数x请小明来预测x是否在D中.如果小明猜

分层贝叶斯模型——结构

分层贝叶斯模型 对于一个随机变量序列$Y_{1},...,Y_{n} $,如果在任意排列顺序$\pi $下,其概率密度都满足$p(y_{1},...,y_{n})=p(y_{\pi_{1}},...,y_{\pi_{n}}) $,那么称这些变量是可交换的.当我们缺乏区分这些随机变量的信息时,可交换性是$p(y_{1},...,y_{n}) $的一个合理属性.在这种情况下,各个随机变量可以看作是从一个群体中独立采样的结果,群体的属性可以用一个固定的未知参数$\phi $来描述,即: $$\phi\

贝叶斯学习--极大后验概率假设和极大似然假设

在机器学习中,通常我们感兴趣的是在给定训练数据 D 时,确定假设空间 H 中的最佳假设. 所谓最佳假设,一种办法是把它定义为在给定数据 D 以及 H 中不同假设的先验概率的有关知识条件下的最可能(most probable)假设. 贝叶斯理论提供了计算这种可能性的一种直接的方法.更精确地讲,贝叶斯法则提供了一种计算假设概率的方法,它基于假设的先验概率.给定假设下观察到不同数据的概率.以及观察的数据本身. 要精确地定义贝叶斯理论,先引入一些记号. 1.P ( h )来代表还没有训练数据前,假设 h

贝叶斯学习

在进行参数估计的时候, 常用到最大似然估计,其形式很简单,对于含有N个样本的训练数据集DN,假设样本独立同分布,分布参数为,则似然概率定义如下: ???? 简单说就是参数为时训练集出现的概率,然后我们根据不同的分布形式求导,得到参数的最有值使得似然概率最大. 贝叶斯学习过程不同之处在于,一开始并不试图去求解一个最优的参数值,而是假设参数本身符合某个分布,即先验概率p()(例如高斯分布,只要知道均值和方差就能确定下来),利用训练数据集所得到的信息就可以得到参数的条件概率分布p()(条件概率的用途后

朴素贝叶斯学习

朴素贝叶斯,为什么叫"朴素",就在于是假定所有的特征之间是"独立同分布"的.这样的假设肯定不是百分百合理的,在现实中,特征与特征之间肯定还是存在千丝万缕的联系的,但是假设特征之间是"独立同分布",还是有合理性在里面,而且针对某些特定的任务,用朴素贝叶斯得到的效果还不错,根据"实践是检验真理的唯一标准",这个模型就具备意义了.这其实和那个"马尔科夫"假设有类似的地方. 朴素贝叶斯的一个思想是,根据现有的一些材

分层贝叶斯模型——采样算法

1. 蒙特卡洛估计 若$\theta$是要估计的参数,$y_{1},...,y_{n}$是从分布$p(y_{1},...,y_{n}|\theta) $中采样的样本值,假定我们从后验分布$p(\theta|y_{1},...,y_{n})$中独立随机采样$S$个$\theta$值,则$$ \theta^{(1)},...,\theta^{(S)}\sim^{i.i.d.}p(\theta|y_{1},...,y_{n}) $$ 那么我们就能够通过样本$\{\theta^{(1)},...,\th

机器学习笔记——贝叶斯学习

概率 理解概率最简单的方式就是把它们想像成韦恩图中的元素.首先你有一个包括所有可能输出(例如一个实验的)的全集,现在你对其中的一些子集感兴趣,即一些事件.假设我们在研究癌症,所以我们观察人们看他们是否患有癌症.在研究中,假设我们把所有参与者当成我们的全集,然后对任何一个个体来说都有两种可能的结论,患有或没有癌症.我们可以把我们的全集分成两个事件:事件"患有癌症的人"(表示为A),和"不患有癌症的人"(表示为-A).我们可以构建一张如下的图: 那么一个随机选择的人患有

朴素贝叶斯学习笔记

naivebayes   朴素贝叶斯分类器原理 公式分解: 1.p(word|categroy)=p(分类category的文档中出现word的文档总数)/分类category总文档数 p(word|categroy)意思为在category分类中word出现的概率 2.p(doc|categroy)=p(word1|categroy)*p(word2|categroy)*...*p(wordn|categroy) p(doc|categroy)文档属于某个分类的概率 3.p(categroy|