##20170801
##notes for lec2-2.pdf about language model
Evaluating a Language Model
Intuition about Perplexity
Evaluating N‐grams with Perplexity
Sparsity is Always a Problem
Dealing with Sparsity
General approach: modify observed counts to improve estimates
– Discounting: allocate probability mass for unobserved
events by discounting counts for observed events
– Interpolation: approximate counts of N‐gram using
combination of estimates from related denser histories
– Back‐off: approximate counts of unobserved N‐gram based
on the proportion of back‐off events (e.g., N‐1 gram)
Add‐One Smoothing
? We have V words in the vocabulary, N is the number of words
in the training set
? Smooth observed counts by adding one to all the counts and
renormalize
– Unigram case:
– Bigram case:
? More general case: add‐α, when α is added instead of one.
Linear Interpolation
Tuning Hyperparameters
? Both add‐α and linear interpolation have hyperparameters.
? The selection of their values is crucial for the smoothing
The selection of their values is crucial for the smoothing
performance
? Their values are tuned to maximize the likelihood of held‐out
data
– For linear interpolation, we will use EM to find optimal
parameters (in few lectures)
Kneser‐Ney Smoothing
? Observed n‐grams occur more in the training data than in the
new data
? Absolute discounting: count*(x)=count(x)‐d
P ad ( w i | w i ? 1 ) =
count ( w i , w i ? 1 ) ? d
+ α P ? ( w i )
count ( w i ? 1 )
? Distribute the remaining mass based on the skewness in the
distribution of the lower order N‐gram (i.e., the number of
words it can follow)
P ? ( w i ) ∝ | w i ? 1 : count ( w i , w i ? 1 ) > 0 |
? Kneser‐Ney is repeatedly proven as very successful estimator