信息度量
1. 独立与马尔可夫链
独立(Independence)
对于两个随机变量\(X\)和\(Y\),若对所有的\((x, y) \in \mathcal{X} \times \mathcal{Y}\),都有
\[
p(x, y) = p(x)p(y)
\]
则称\(X\)和\(Y\)独立,记为\(X \perp Y\)。
\(p(x), p(y), p(x, y)\)分别是\(\text{Pr}(X=x), \text{Pr}(Y=y), \text{Pr}(X=x, Y=y)\)的简写。
相互独立(Mutual Independence)
给定随机变量\(X_{1}, \cdots, X_{n}\),若对于所有的\((x_1, \cdots, x_{n}) \in \mathcal{X}_{1} \times \cdots \times \mathcal{X}_{n}\),都有:
\[
p(x_{1}, \cdots, x_{n}) = p(x_{1})\cdots p(x_{n})
\]
则\(X_{1}, \cdots, X_{n}\)相互独立。
两两独立(Pairwise Independence)
随机变量\(X_{1}, \cdots, X_{n}\)两两独立,若对于所有的\(1 \le i \lt j \le n\),\(X_{i}\)和\(X_{j}\)独立。
相互独立可以推出两两独立。
条件独立(Conditional Independence)
对于随机变量\(X, Y, Z\),若:
\[
p(x,y,z)p(y) = p(x,y)p(y,z)
\]
则称\(X\)与\(Z\)在给定\(Y\)的条件下独立,记作\(X \perp Z \mid Y\)或\(X \rightarrow Y \rightarrow Z\)
马尔可夫链(Markov Chain)
对于随机变量\(X_{1}, \cdots, X_{n}\)(\(n \ge 3\)),\(X_{1} \rightarrow \cdots \rightarrow X_{n}\)构成马尔可夫链,若:
\[
p(x_{1},\cdots,x_{n})p(x_{2})\cdots p(x_{n-1}) = p(x_{1},x_{2})\cdots p(x_{n-1},x_{n})
\]
马尔可夫链的等价定义:
- \(p(x_{1},\cdots,x_{n})=\begin{cases}p(x_{1})p(x_{2}|x_{1})\cdots p(x_{n}|x_{n-1}) & \text{if}\ \ p(x_{1})\cdots p(x_{n-1}) > 0\\ 0 & \text{otherwise}\end{cases}\),
- \(p(x_{t}|x_{1},\cdots,x_{t-1})=p(x_{t}|x_{t-1})\),其中\(1 \le t \le n\)
性质:
若\(X_{1} \rightarrow \cdots \rightarrow X_{n}\)是马尔可夫链, 则\(X_{n} \rightarrow \cdots \rightarrow X_{1}\)也是马尔可夫链,
性质:马尔可夫子链(Markov Subchains)
\(X_{1} \rightarrow \cdots \rightarrow X_{n}\)是马尔可夫链, \(\mathcal{N}_{n} = \left\{1, 2, \cdots, n\right\}\),对于\(\mathcal{N}_{n}\)的子集\(\alpha\),用\(X_{\alpha}\)表示\(\left\{X_{i}: i \in \alpha\right\}\)。给定\(\mathcal{N}_{n}\)的不相交子集\(\alpha_{1}, \cdots, \alpha_{m}\),若对于所有的\(k_{j} \in \alpha_{j}, j = 1, \cdots, m\),\(k_{1} \lt \cdots \lt k_{m}\),则\(X_{\alpha_{1}}\rightarrow\cdots\rightarrow X_{\alpha_{m}}\)构成一个马尔可夫链。
2. 香农信息度量
熵(entropy):
随机变量\(X\)的熵定义为:\(\displaystyle H(X) = -\sum_{x\in \mathcal{X}}p(x)\log p(x) = \sum_{x \in \mathcal{X}}p(x)\log \frac{1}{p(x)}\)
称\(\displaystyle \log \frac{1}{p(X)}\)是\(X\)的信息量,则熵是信息量的期望,即\(H(X) = E \log\frac{1}{p(X)}\)
示例:二元随机变量的熵
\(X \sim \text{Bernoulli}(p)\),则\(H(p) = p \times \log \frac{1}{p} + (1-p) \times \log \frac{1}{1-p}\)。\(H(p)\)是关于\(p\)的函数,函数在\(p = 0.5\)处取最大值。
联合熵(joint entropy):
随机变量\(X, Y\)的联合熵定义为:\(\displaystyle H(X, Y) = -\sum_{x,y}p(x,y)\log p(x,y) = \sum_{x,y}p(x,y)\log \frac{1}{p(x,y)}\)
\(\log \frac{1}{p(X,Y)}\)是二元组\((X, Y)\)的信息量。
条件熵(conditional entropy):
对于随机变量\(X, Y\),\(Y\)在给定\(X\)条件下的条件熵定义为:
\[
\begin{align*}
H(Y|X)
&= \sum_{x}p(x)H(Y|X=x)\&= \sum_{x}p(x)\sum_{y}p(y|x)\log \frac{1}{p(y|x)}\&= \sum_{x,y}p(x,y)\log \frac{1}{p(y|x)}\&= E\log \frac{1}{p(Y|X)}
\end{align*}
\]
联合熵与条件熵的关系:\(H(X,Y)=H(X)+H(Y|X) = H(Y) + H(X|Y)\)
\(\displaystyle H(X,Y|Z,W=w,S=s,U) = \sum_{x,y,z,u}p(x,y,z,u|w,s)\log \frac{1}{p(x,y|z,w,s,u)}\)
互信息(mutual information):
随机变量\(X,Y\)之间的互信息定义为:\(\displaystyle I(X;Y) = \sum_{x,y}p(x,y)\log \frac{p(x,y)}{p(x)p(y)} = E \log \frac{p(X,Y)}{p(X)p(Y)}\)
互信息与条件熵的关系:
\(H(X) = H(X|Y) + I(X;Y)\)
\(H(Y) = H(Y|X) + I(X;Y)\)
条件互信息(conditional mutual information):
对于随机变量\(X, Y, Z\),\(X,Y\)在给定\(Z\)条件下的条件互信息定义为:
\[
\begin{align*}
I(X;Y|Z)
&= \sum_{z}p(z)\sum_{x,y}p(x,y|z)\log\frac{p(x,y|z)}{p(x|z)p(y|z)} \&= \sum_{x,y,z}p(x,y,z)\log\frac{p(x,y|z)}{p(x|z)p(y|z)}\&= E\log\frac{p(X,Y|Z)}{p(X|Z)p(Y|Z)}
\end{align*}
\]
\(\displaystyle I(X;Y|Z=z,V)=\sum_{x,y,v}p(x,y,v|z)\log\frac{p(x,y|z,v)}{p(x|z,v)p(y|z,v)}\)
3. 链式规则
\(\displaystyle H(X_{1}, \dots, X_{n})=\sum_{i=1}^{n}H(X_{i} \mid X_{1}, \dots, X_{i-1})\)
\(\displaystyle H(X_{1}, \dots, X_{n} \mid Y)=\sum_{i=1}^{n}H(X_{i} \mid X_{1}, \dots, X_{i-1},Y)\)
\(\displaystyle I(X_{1}, \dots, X_{n};Y) = \sum_{i=1}^{n}I(X_{i};Y|X_{1}, \dots, X_{i-1})\)
\(\displaystyle I(X_{1}, \dots, X_{n};Y\mid Z) = \sum_{i=1}^{n}I(X_{i};Y|X_{1}, \dots, X_{i-1}, Z)\)
4. 信息散度
信息散度/KL距离/相对熵:
在同一个字典\(\mathcal{X}\)上的两个分布\(p\)与\(q\)之间的信息散度(informational divergence)定义为:
\[
D(p \parallel q) = \sum_{x \in \mathcal{X}}p(x) \log \frac{p(x)}{q(x)} = E_{p}\log \frac{p(X)}{q(X)}
\]
\(\displaystyle I(X;Y) = D(p(x,y)\parallel p(x)q(x))\)
性质:
对于同一个字典\(\mathcal{X}\)上的两个分布\(p\)和\(q\):
\[
\begin{align*}
D(p\parallel q)
&= \sum_{x \in \mathcal{X}}p(x) \log \frac{p(x)}{q(x)}\&= \log e \sum_{x \in \mathcal{X}}p(x) \ln \frac{p(x)}{q(x)}\&\ge \log e \sum_{x \in \mathcal{X}} p(x) (1 - \frac{q(x)}{p(x)})\&= \log e\sum_{x \in \mathcal{X}}(p(x) - q(x))\&= 0
\end{align*}
\]
取得等号当且仅当\(p = q\)
度量(metric)
函数\(\rho(x, y)\)是一个度量函数,若对于所有的\(x, y\):
- \(\rho(x, y) \ge 0\)
- \(\rho(x, y) = \rho(y, x)\)
- \(\rho(x, y) = 0\)当且仅当\(x = y\)
- \(\rho(x, y) + \rho(y, z) \ge \rho(x, z)\)
例子:
\(\rho(X, Y) = H(X|Y) + H(Y|X)\)满足条件1,2,4,若将\(X = Y\)定义为存在一个从\(X\)到\(Y\)的一一映射,则条件3也满足。
条件4:
\[
\begin{align*}
\rho(X,Z)
&= H(X|Z) + H(Z|X)\&= I(X;Y|Z) + H(X|Y,Z) + I(Y;Z|X) + H(Z|X,Y)\&\le H(Y|Z) + H(X|Y) + H(Y|X)+H(Z|Y)\&= H(X|Y) + H(Y|X) + H(Y|Z) + H(Z|Y)\&= \rho(X,Y) + \rho(Y,Z)
\end{align*}
\]
基本不等式
Logarithm Inequality:\(\displaystyle \ln x \le x - 1 \Leftrightarrow \ln x \ge 1 - \frac{1}{x}\)
Jensen Inequality:\(f\)是凸函数,\(\lambda_i \ge 0\)且\(\sum \lambda_i = 1\),则\(\displaystyle f\left(\sum \lambda_ix_i\right) \le \sum \lambda_i f(x_i)\)
Relative Inequality:\(\displaystyle \sum_i p_i \log \frac{p_i}{q_i} \ge 0\),等号成立当且仅当\(p_i = q_i\)
Log-Sum Inequality:\(\displaystyle \sum u_{i} \log \frac{u_i}{v_i} \ge \left(\sum u_{i}\right) \log \frac{\sum u_{i}}{\sum v_{i}}\),等号成立当且仅当\(\displaystyle \frac{u_{i}}{v_{i}} = constant\)
关于信息度量的一些不等式
- \(H(X) \ge 0\),等号成立当且仅当\(X\)是确定的。证明:\(H(X) = I(X;X) = D(p(x,x)\parallel p(x) p(x)) \ge 0\)
- \(H(Y|X) \ge 0\),等号成立当且仅当\(Y\)是\(X\)的一个函数。证明:\(H(Y|X) = I(Y;Y|X) = D(p(y,y|x)\parallel p(y|x)p(y|x))\ge 0\)
- \(I(X;Y) \ge 0\),等号成立当且仅当\(X\)与\(Y\)独立
- \(I(X;Y|Z) \ge 0\),等号成立当且仅当\(X\)与\(Y\)在给定\(Z\)时条件独立
定理:
\(H(Y|X) \le H(Y)\),等号成立当且仅当\(X\)与\(Y\)独立。证明:\(H(Y) = H(Y|X) + I(X;Y) \ge H(Y|X)\)
定理:
\(\displaystyle H(X_1, X_2, \dots, X_n) \le \sum_{i=1}^{n} H(X_i)\),等号成立当且仅当\(X_i\)相互独立。证明:\(\displaystyle H(X_1, \dots, X_n) = \sum_{i=1}^{n}H(X_{i}|X_{1}, \dots, X_{i-1}) \le \sum_{i=1}^{n}H(X_i)\)
定理:
\(I(X;Y,Z) \ge I(X;Y)\),等号成立当且仅当\(X \rightarrow Y \rightarrow Z\)构成马尔可夫链。证明:\(I(X;Y,Z) = I(X;Y) + I(X;Z|Y) \ge I(X;Y)\)
定理:
若\(U \rightarrow X \rightarrow Y \rightarrow V\)构成一个马尔可夫链,则\(I(X;Y) \ge I(U;V)\)。证明:由于\(U\rightarrow X\rightarrow Y\)是马尔可夫链,所以\(I(X;Y) = I(U,X;Y)=I(U;Y)+I(X;Y|U) \ge I(U;Y)\);同理,\(I(U;Y) \ge I(U;V)\)。
定理:
对于随机变量\(X\),当\(X\)服从均匀分布时,熵取得最大值,即\(H(X) \le \log \left|\mathcal{X}\right|\)。证明:设\(u(x)\)是\(\mathcal{X}\)上的均匀分布,\(D(p(x)\parallel u(x)) \ge 0\)。
Fano‘s Inequality:
\(X\)是随机变量,\(\hat{X}\)是对\(X\)的估计(\(X, \hat{X} \in \mathcal{X}\)),出错的概率是\(P_e = \text{Pr}(X \neq \hat{X})\),则:
\[
H(X\mid \hat{X}) \le h_b(P_e) + P_e \log (\left|\mathcal{X}\right|-1)
\]
证明:定义\(Y = 1\cdot\left\{X \neq \hat{X}\right\}\),则\(\text{Pr}(Y=1) = P_e, \text{Pr}(Y=0) = 1 - P_e, H(Y) = h_{b}(P_e)\)
\[
\begin{align*}
H(X|\hat{X})
&= H(X|\hat{X}) + H(Y|X,\hat{X})\&= H(X,Y|\hat{X})\&= H(Y|\hat{X})+H(X|Y,\hat{X})\&=H(Y|\hat{X}) + \text{Pr}(Y=1)H(X|Y=1,\hat{X})\&\le H(Y) + \text{Pr}(Y=1)\sum_{\hat{x} \in \mathcal{X}}\text{Pr}(\hat{X}=\hat{x})H(X|Y=1,\hat{X}=\hat{x})\&\le H(Y) + \text{Pr}(Y=1)\sum_{\hat{x} \in \mathcal{X}}\text{Pr}(\hat{X}=\hat{x})\log (\left|\mathcal{X}\right|-1)\&= h_b(P_e) + P_e\log (\left|\mathcal{X}\right|-1)\\end{align*}
\]
平稳信源的熵率
离散时间信源(discrete-time information source):\(\left\{X_{k}: k \ge 1\right\}\)
熵率(entropy rate):\(\left\{X_{k}\right\}\)的熵率定义为:\(H_X=\displaystyle \lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, X_2, \cdots, X_{n})\),若极限存在。
例子:
\(\left\{X_{k}\right\}\)是一个\(\text{i.i.d}\)信源,用\(X\)表示任何一个时间步的随机变量,则:
\[
\lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, \cdots, X_{n})
= \lim_{n\rightarrow \infty}\frac{n\cdot H(X)}{n}
= H(X)
\]
熵率存在,熵率是\(H(X)\)。
例子:
\(\left\{X_{k}\right\}\)是一个信源,各个\(X_k\)相互独立,且\(H(X_{k}) = k\),则:
\[
\lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, \cdots, X_{n})
= \lim_{n\rightarrow \infty}\frac{n+1}{2}
\]
熵率不存在。
平稳信源(stationary information source):对于一个信源\(\left\{X_{k}\right\}\),若对于任意的\(m, l \ge 1\),\(X_1, X_2, \cdots, X_m\)与\(X_{1+l}, X_{2+l}, \cdots, X_{m+l}\)具有相同的联合概率分布,则称之为平稳信源。
定义:\(\displaystyle H_X^{'} = \lim_{n\rightarrow \infty}H(X_n|X_1, X_2, \dots, X_{n-1})\)
定理:平稳信源\(\left\{X_k\right\}\)的熵率\(H_X\)存在且\(H_X = H_{X}^{'}\)
证明:\(H(X_n|X_1, X_2, \dots, X_{n-1}) \le H(X_n|X_2, \dots, X_{n-1})=H(X_{n-1}|X_1, X_2, \dots, X_{n-2})\),令\(a_n = H(X_n|X_1, X_2, \dots, X_{n-1})\),则序列单调递减且存在下界,故极限存在。
\(\displaystyle H_{X}^{'} = \lim_{n \rightarrow \infty}a_{n} = \lim_{n\rightarrow n}\frac{\sum_{i=1}^{n}a_i}{n}=\lim_{n\rightarrow \infty} \frac{1}{n}\sum_{i=1}^{n}H(X_i|X_1, X_2, \dots, X_{i-1}) = \lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, \dots, X_n) = H_{X}\)
原文地址:https://www.cnblogs.com/hitgxz/p/12116121.html