最大似然估计实例 | Fitting a Model by Maximum Likelihood (MLE)

参考:Fitting a Model by Maximum Likelihood

最大似然估计是用于估计模型参数的,首先我们必须选定一个模型,然后比对有给定的数据集,然后构建一个联合概率函数,因为给定了数据集,所以该函数就是以模型参数为自变量的函数,通过求导我们就能得到使得该函数值(似然值)最大的模型参数了。

Maximum-Likelihood Estimation (MLE) is a statistical technique for estimating model parameters. It basically sets out to answer the question: what model parameters are most likely to characterise a given set of data? First you need to select a model for the data. And the model must have one or more (unknown) parameters. As the name implies, MLE proceeds to maximise a likelihood function, which in turn maximises the agreement between the model and the data.

Most illustrative examples of MLE aim to derive the parameters for a probability density function (PDF) of a particular distribution. In this case the likelihood function is obtained by considering the PDF not as a function of the sample variable, but as a function of distribution’s parameters. For each data point one then has a function of the distribution’s parameters. The joint likelihood of the full data set is the product of these functions. This product is generally very small indeed, so the likelihood function is  normally replaced by a log-likelihood function. Maximising either the likelihood or log-likelihood function yields the same results, but the latter is just a little more tractable!

Fitting a Normal Distribution

Let’s illustrate with a simple example: fitting a normal distribution. First we generate some data.

> set.seed(1001)

>

> N <- 100

>

> x <- rnorm(N, mean = 3, sd = 2)

>

> mean(x)

[1] 2.998305

> sd(x)

[1] 2.288979

Then we formulate the log-likelihood function.

> LL <- function(mu, sigma) {

+     R = dnorm(x, mu, sigma)

+     #

+     -sum(log(R))

+ }

And apply MLE to estimate the two parameters (mean and standard deviation) for which the normal distribution best describes the data.

> library(stats4)

>

> mle(LL, start = list(mu = 1, sigma=1))

Call:

mle(minuslogl = LL, start = list(mu = 1, sigma = 1))

Coefficients:

      mu    sigma

2.998305 2.277506

Warning messages:

1: In dnorm(x, mu, sigma) : NaNs produced

2: In dnorm(x, mu, sigma) : NaNs produced

3: In dnorm(x, mu, sigma) : NaNs produced

Those warnings are a little disconcerting! They are produced when negative values are attempted for the standard deviation.

> dnorm(x, 1, -1)

  [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

 [30] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

 [59] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

 [88] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

There are two ways to sort this out. The first is to apply constraints on the parameters. The mean does not require a constraint but we insist that the standard deviation is positive.

> mle(LL, start = list(mu = 1, sigma=1), method = "L-BFGS-B", lower = c(-Inf, 0),

      upper = c(Inf, Inf))

Call:

mle(minuslogl = LL, start = list(mu = 1, sigma = 1), method = "L-BFGS-B",

    lower = c(-Inf, 0), upper = c(Inf, Inf))

Coefficients:

      mu    sigma

2.998304 2.277506

This works because mle() calls optim(), which has a number of optimisation methods. The default method is BFGS. An alternative, the L-BFGS-B method, allows box constraints.

The other solution is to simply ignore the warnings. It’s neater and produces the same results.

> LL <- function(mu, sigma) {

+     R = suppressWarnings(dnorm(x, mu, sigma))

+     #

+     -sum(log(R))

+ }

>

> mle(LL, start = list(mu = 1, sigma=1))

Call:

mle(minuslogl = LL, start = list(mu = 1, sigma = 1))

Coefficients:

      mu    sigma

2.998305 2.277506

The maximum-likelihood values for the mean and standard deviation are damn close to the corresponding sample statistics for the data. Of course, they do not agree perfectly with the values used when we generated the data: the results can only be as good as the data. If there were more samples then the results would be closer to these ideal values.

A note of caution: if your initial guess for the parameters is too far off then things can go seriously wrong!

> mle(LL, start = list(mu = 0, sigma=1))

Call:

mle(minuslogl = LL, start = list(mu = 0, sigma = 1))

Coefficients:

      mu    sigma

 51.4840 226.8299

Fitting a Linear Model

Now let’s try something a little more sophisticated: fitting a linear model. As before, we generate some data.

> x <- runif(N)

> y <- 5 * x + 3 + rnorm(N)

We can immediately fit this model using least squares regression.

> fit <- lm(y ~ x)

>

> summary(fit)

Call:

lm(formula = y ~ x)

Residuals:

     Min       1Q   Median       3Q      Max

-1.96206 -0.59016 -0.00166  0.51813  2.43778

Coefficients:

            Estimate Std. Error t value Pr(>|t|)

(Intercept)   3.1080     0.1695   18.34   <2e-16 ***

x             4.9516     0.2962   16.72   <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8871 on 98 degrees of freedom

Multiple R-squared:  0.7404,    Adjusted R-squared:  0.7378

F-statistic: 279.5 on 1 and 98 DF,  p-value: < 2.2e-16

The values for the slope and intercept are very satisfactory. No arguments there. We can superimpose the fitted line onto a scatter plot.

> plot(x, y)

> abline(fit, col = "red")

Pushing on to the MLE for the linear model parameters. First we need a likelihood function. The model is not a PDF, so we can’t proceed in precisely the same way that we did with the normal distribution. However, if you fit a linear model then you want the residuals to be normally distributed. So the likelihood function fits a normal distribution to the residuals.

LL <- function(beta0, beta1, mu, sigma) {

    # Find residuals

    #

    R = y - x * beta1 - beta0

    #

    # Calculate the likelihood for the residuals (with mu and sigma as parameters)

    #

    R = suppressWarnings(dnorm(R, mu, sigma))

    #

    # Sum the log likelihoods for all of the data points

    #

    -sum(log(R))

}

One small refinement that one might make is to move the logarithm into the call to dnorm().

LL <- function(beta0, beta1, mu, sigma) {

    R = y - x * beta1 - beta0

    #

    R = suppressWarnings(dnorm(R, mu, sigma, log = TRUE))

    #

    -sum(R)

}

It turns out that the initial guess is again rather important and a poor choice can result in errors. We will return to this issue a little later.

> fit <- mle(LL, start = list(beta0 = 3, beta1 = 1, mu = 0, sigma=1))

Error in solve.default(oout$hessian) :

  system is computationally singular: reciprocal condition number = 3.01825e-22

> fit <- mle(LL, start = list(beta0 = 5, beta1 = 3, mu = 0, sigma=1))

Error in solve.default(oout$hessian) :

  Lapack routine dgesv: system is exactly singular: U[4,4] = 0

But if we choose values that are reasonably close then we get a decent outcome.

> fit <- mle(LL, start = list(beta0 = 4, beta1 = 2, mu = 0, sigma=1))

> fit

Call:

mle(minuslogl = LL, start = list(beta0 = 4, beta1 = 2, mu = 0,

    sigma = 1))

Coefficients:

     beta0      beta1         mu      sigma

 3.5540217  4.9516133 -0.4459783  0.8782272

The maximum-likelihood estimates for the slope (beta1) and intercept (beta0) are not too bad. But there is a troubling warning about NANs being produced in the summary output below.

> summary(fit)

Maximum likelihood estimation

Call:

mle(minuslogl = LL, start = list(beta0 = 4, beta1 = 2, mu = 0,

    sigma = 1))

Coefficients:

        Estimate Std. Error

beta0  3.5540217        NaN

beta1  4.9516133  0.2931924

mu    -0.4459783        NaN

sigma  0.8782272  0.0620997

-2 log L: 257.8177

Warning message:

In sqrt(diag([email protected])) : NaNs produced

It stands to reason that we actually want to have the zero mean for the residuals. We can apply this constraint by specifying mu as a fixed parameter. Another option would be to simply replace mu with 0 in the call to dnorm(), but the alternative is just a little more flexible.

> fit <- mle(LL, start = list(beta0 = 2, beta1 = 1.5, sigma=1), fixed = list(mu = 0),

             nobs = length(y))

> summary(fit)

Maximum likelihood estimation

Call:

mle(minuslogl = LL, start = list(beta0 = 2, beta1 = 1.5, sigma = 1),

    fixed = list(mu = 0), nobs = length(y))

Coefficients:

       Estimate Std. Error

beta0 3.1080423 0.16779428

beta1 4.9516164 0.29319233

sigma 0.8782272 0.06209969

-2 log L: 257.8177

The resulting estimates for the slope and intercept are rather good. And we have standard errors for these parameters as well.

How about assessing the overall quality of the model? We can look at the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). These can be used to compare the performance of different models for a given set of data.

> AIC(fit)

[1] 263.8177

> BIC(fit)

[1] 271.6332

> logLik(fit)

‘log Lik.‘ -128.9088 (df=3)

Returning now to the errors mentioned above. Both of the cases where the call to mle() failed resulted from problems with inverting the Hessian Matrix. With the implementation of mle() in the stats4 package there is really no way to get around this problem apart from having a good initial guess. In some situations though, this is just not feasible. There are, however, alternative implementations of MLE which circumvent this problem. The bbmle package has mle2() which offers essentially the same functionality but includes the option of not inverting the Hessian Matrix.

> library(bbmle)

>

> fit <- mle2(LL, start = list(beta0 = 3, beta1 = 1, mu = 0, sigma = 1))

>

> summary(fit)

Maximum likelihood estimation

Call:

mle2(minuslogl = LL, start = list(beta0 = 3, beta1 = 1, mu = 0,

    sigma = 1))

Coefficients:

      Estimate Std. Error z value  Pr(z)   

beta0 3.054021   0.083897 36.4019 <2e-16 ***

beta1 4.951617   0.293193 16.8886 <2e-16 ***

mu    0.054021   0.083897  0.6439 0.5196   

sigma 0.878228   0.062100 14.1421 <2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-2 log L: 257.8177

Here mle2() is called with the same initial guess that broke mle(), but it works fine. The summary information for the optimal set of parameters is also more extensive.

Fitting a linear model is just a toy example. However, Maximum-Likelihood Estimation can be applied to models of arbitrary complexity. If the model residuals are expected to be normally distributed then a log-likelihood function based on the one above can be used. If the residuals conform to a different distribution then the appropriate density function should be used instead of dnorm().

原文地址:https://www.cnblogs.com/leezx/p/8635355.html

时间: 2024-08-08 02:57:20

最大似然估计实例 | Fitting a Model by Maximum Likelihood (MLE)的相关文章

【机器学习算法-python实现】最大似然估计(Maximum Likelihood)

1.背景 最大似然估计是概率论中常常涉及到的一种统计方法.大体的思想是,在知道概率密度f的前提下,我们进行一次采样,就可以根据f来计算这个采样实现的可能性.当然最大似然可以有很多变化,这里实现一种简单的,实际项目需要的时候可以再更改. 博主是参照wiki来学习的,地址请点击我 这里实现的是特别简单的例子如下(摘自wiki的最大似然) 离散分布,离散有限参数空间[编辑] 考虑一个抛硬币的例子.假设这个硬币正面跟反面轻重不同.我们把这个硬币抛80次(即,我们获取一个采样并把正面的次数记下来,正面记为

最大似然估计 (Maximum Likelihood Estimation), 交叉熵 (Cross Entropy) 与深度神经网络

最近在看深度学习的"花书" (也就是Ian Goodfellow那本了),第五章机器学习基础部分的解释很精华,对比PRML少了很多复杂的推理,比较适合闲暇的时候翻开看看.今天准备写一写很多童鞋们w未必完全理解的最大似然估计的部分. 单纯从原理上来说,最大似然估计并不是一个非常难以理解的东西.最大似然估计不过就是评估模型好坏的方式,它是很多种不同评估方式中的一种.未来准备写一写最大似然估计与它的好朋友们,比如说贝叶斯估计 (Beyasian Estimation), 最大后验估计(Max

最大似然估计总结

from http://blog.csdn.net/yanqingan/article/details/6125812 最大似然估计学习总结------MadTurtle   1. 作用 在已知试验结果(即是样本)的情况下,用来估计满足这些样本分布的参数,把可能性最大的那个参数作为真实的参数估计. 2. 离散型 设为离散型随机变量,为多维参数向量,如果随机变量相互独立且概率计算式为P{,则可得概率函数为P{}=,在固定时,上式表示的概率:当已知的时候,它又变成的函数,可以把它记为,称此函数为似然

参数估计:最大似然估计、贝叶斯估计与最大后验估计

简介: 在概率统计中有两种主要的方法:参数统计和非参数统计(或者说参数估计和非参数估计). 其中,参数估计是概率统计的一种方法.主要在样本知道情况下,一般知道或假设样本服从某种概率分布,但不知到具体参数(或者知道具体模型,但不知道模型的参数). 参数估计就是通过多次试验,观察其结果,利用结果推出参数的大概值. (当你推出参数的极大可能值时,就相当于知道了分布及其参数情况,就可以利用它来推测其他样例出现的概率了. 这属于应用了) 参数估计的方法有多种,这里我们分析三种基于概率的方法,分别是最大似然

最大似然估计的复习(转)

转自:http://blog.csdn.net/yanqingan/article/details/6125812 最大似然估计学习总结------MadTurtle   1. 作用 在已知试验结果(即是样本)的情况下,用来估计满足这些样本分布的参数,把可能性最大的那个参数作为真实的参数估计. 2. 离散型 设为离散型随机变量,为多维参数向量,如果随机变量相互独立且概率计算式为P{,则可得概率函数为P{}=,在固定时,上式表示的概率:当已知的时候,它又变成的函数,可以把它记为,称此函数为似然函数

最大似然估计 (MLE) 最大后验概率(MAP)

1) 最大似然估计 MLE 给定一堆数据,假如我们知道它是从某一种分布中随机取出来的,可是我们并不知道这个分布具体的参,即“模型已定,参数未知”.例如,我们知道这个分布是正态分布,但是不知道均值和方差:或者是二项分布,但是不知道均值. 最大似然估计(MLE,Maximum Likelihood Estimation)就可以用来估计模型的参数.MLE的目标是找出一组参数,使得模型产生出观测数据的概率最大: 其中就是似然函数,表示在参数下出现观测数据的概率.我们假设每个观测数据是独立的,那么有 为了

最大似然估计 (MLE)与 最大后验概率(MAP)在机器学习中的应用

最大似然估计 MLE 给定一堆数据,假如我们知道它是从某一种分布中随机取出来的,可是我们并不知道这个分布具体的参,即“模型已定,参数未知”. 例如,对于线性回归,我们假定样本是服从正态分布,但是不知道均值和方差:或者对于逻辑回归,我们假定样本是服从二项分布,但是不知道均值,逻辑回归公式得到的是因变量y的概率P = g(x), x为自变量,通过逻辑函数得到一个概率值,y对应离散值为0或者1,Y服从二项分布,误差项服从二项分布,而非高斯分布,所以不能用最小二乘进行模型参数估计,可以用极大似然估计来进

转载-最大似然估计学习总结

下面是转载http://blog.csdn.net/yanqingan/article/details/6125812博客的内容 最大似然估计学习总结   1. 作用 在已知试验结果(即是样本)的情况下,用来估计满足这些样本分布的参数,把可能性最大的那个参数作为真实的参数估计. 2. 离散型 设为离散型随机变量,为多维参数向量,如果随机变量相互独立且概率计算式为P{,则可得概率函数为P{}=,在固定时,上式表示的概率:当已知的时候,它又变成的函数,可以把它记为,称此函数为似然函数.似然函数值的大

最大似然估计和最大后验估计

本文出处:http://www.cnblogs.com/liliu/archive/2010/11/22/1883702.html http://www.cnblogs.com/liliu/archive/2010/11/24/1886110.html 最大似然估计(Maximum likelihood estimation) 最大似然估计提供了一种给定观察数据来评估模型参数的方法,即:“模型已定,参数未知”.简单而言,假设我们要统计全国人口的身高,首先假设这个身高服从服从正态分布,但是该分布的