N-gram
u 基础概念
-------------------------------------------------
条件概率的基本定理:
设A,B 是两个事件,且A不是不可能事件,则称
为在事件A发生的条件下,事件B发生的条件概率
-------------------------------------------------
P(W) = P(w1, w2, w3,...wn)
P(W) = N-gram = LM
P(W)=P(w1,w2,w3,…wn) = P(w1)P(w2|w1)P(w3|w1,w2)…P(wn|
w1…wn-1)
P(wn|w1,...wn-1) = C(w1,w2,...wn)/C(w1,w2,...wn-1)
故ngram只是一阶马尔科夫链
不足:
1.统计时需要计算大量的语句的概率,计算量随着n的增加爆炸性增长
2.数据稀疏,不可能有足够多的语句来统计概率
引入马尔科夫假设:假设要估算的状态只和其前i-1个状态有关
故:P(wn|w1,...wn) ≈P(wn|wn-i+1,...wn-1)
采用最大似然估计(MLE)来计算概率
涉及到一系列的打折平滑算法
P(wn|wn-i+1,...wn-1) = C(wn-i+1,...wn)/ C(wn-i+1,...wn-1)
u 基于SRILM的LM的概率计算方式
P(word1, word2, word3)
# if has (word3| word1, word2){
# returnP(word3| word1, word2);
# }else if has (word2| word1){
# returnbackOff(word2| word1) * P(word3| word2);
# }else{
# return P(word3| word2);
# }
# if has (word2| word1){
# return P(word2| word1);
# }else{
# return backOff(word1) * P(word2); #Make sure OOV change to <unk>
# }
u 基于SRILM的LM的PPL计算方式
ngram -ppl TESTSET.seg-order 3 -lm LM.ARPA
--------------------------------------------------------
file TESTSET.seg: 2000 sentences, 52388 words, 249 OOVs
0 zeroprobs, logprob= -105980 ppl= 90.6875 ppl1= 107.805
TESTSET.seg 测试集的分词文件
LM.ARPA 待评估的语言模型
--------------------------------------------------------
ppl=10^{-{logP(T)}/{Sen+Word}}
ppl1=10^{-{logP(T)}/Word}