Let $S={(x^{(i)}, y^{(i)})}_{i=1}^m$ be a training set of i.i.d. examples from unknow distribution. The standard probabilistic interpretation of linear regression states that
$$ y^{(i)} = \theta^T x^{(i)} + \varepsilon^{(i)}, \qquad i=1, \dots, m $$
where the $\varepsilon^{(i)}$ are i.i.d. “noise” variables with independent $\mathcal N(0, \sigma^2)$ distributions. It follows that $y^{(i)} - \theta^T x^{(i)} \sim \mathcal N(0, \sigma^2) $, or equivalently,
$$ P(y^{(i)} | x^{(i)} = \frac{1}{\sqrt{2\pi} \sigma} \text{exp}(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}) $$
In Bayesian linear regression, we assume that a prior distribution over parameters is also given; a typical choice, for instance, is $\theta \sim \mathcal N(0, \tau^2 I)$. Using Bayes’s rule, we obtain the parameter posterior,
\begin{equation} p(\theta, | S) =\frac{p(\theta) p(S | \theta)}{\int_{\theta’} p(\theta’) p(S | \theta’) d\theta’} = \frac{p(\theta) \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}, \theta)}{\int_{\theta’} p(\theta’) \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}, \theta’) d\theta’} \label{ppost}\end{equation}
Assuming the same noise model on testing points as on training points, the “output” of Bayesian linear regression on a new test point $x_*$ is not just a single guess “$y_*$”, but rather an entire probability distribution over possible outputs, knows as the posterior predictive distribution:
\begin{equation}p(y_* | x_* , S) = \int_{\theta} p(y_* | x_* , \theta ) p(\theta | S) d\theta \label{postd}\end{equation}
For many types of models, the integrals in (\ref{ppost}) and (\ref{postd}), are difficult to compute, and hence, we often resort to approximations, such as maximum a posteriori MAP estimation. MAP1, MAP2. Also you can see Regularization and Model selection.
$$\hat{\theta} = \text{arg max}_{\theta} p(\theta, | S) =\text{arg max}_{\theta} \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}, \theta)$$
In the case of Bayesian linear regression, however, the integrals actually are tractable! In particular, for Bayesian linear regression, one can show that (in 2.1.1 The standard linear model: http://www.gaussianprocess.org/gpml/)
$$ \theta | S \sim \mathcal N(\frac{1}{\sigma^2}A^{-1}X^Ty, A^{-1}) $$
$$ y_* | x_*, S \sim \mathcal (\frac{1}{\sigma^2}x_*^TA^{-1}X^Ty, x_*^TA^{-1}x_* + \sigma^2) $$
where $A = 1/\sigma^2 X^TX + 1/\tau^2 I$. the derivation of these formulas is somewhat involved. Nonetheless, from these equations, we get at least a flavor of what Bayesian models are all about: the posterior distribution over the test output $y_*$ for a test input $x_*$ is a gaussian distribution – this distribution reflects the uncertainty in our predictions $y_* = \theta^Tx_* + \varepsilon_*$ arising from both the randomness in $\varepsilon_*$ and the uncertainty in our choice of parameter $\theta$. In contrast, classical probabilistic linear regression models estimate parameters $\theta$ directly from the training data but provide no estimate of how reliable these learned parameters may be.
原文地址:https://www.cnblogs.com/eliker/p/11309896.html