当你的分类模型有数百个或数千个特征,由于是文本分类的情况下,许多(如果不是大多数)的特点是低信息量的,这是一个不错的选择。这些特征对所有类都是通用的,因此在分类过程中作出很小贡献。个别是无害的,但汇总的话,低信息量的特征会降低性能。
通过消除噪声数据给你的模型清晰度,这样就去除了低信息量特征。它可以把你从过拟合和维数灾难中救出来。当你只使用更高的信息特征,可以提高性能,同时也降低了模型的大小,从而导致伴随着更快的训练和分类的是,使用更少的内存的大小。删除特征似乎直觉错了,但请等你看到结果。
高信息量特征的选择
用同样的evaluate_classifier方法在以前的文章上使用二元组分类,我用10000最具信息量的词得到了以下的结果:
evaluating best word features accuracy: 0.93 pos precision: 0.890909090909 pos recall: 0.98 neg precision: 0.977777777778 neg recall: 0.88 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
把这个与使用了所有单词作为特征的第一篇文章中的情感分类相比:
evaluating single word features accuracy: 0.728 pos precision: 0.651595744681 pos recall: 0.98 neg precision: 0.959677419355 neg recall: 0.476 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 avoids = True pos : neg = 11.7 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 astounding = True pos : neg = 10.3 : 1.0 fascination = True pos : neg = 10.3 : 1.0 idiotic = True neg : pos = 9.8 : 1.0
只用最好的10000个词,accuracy就超过了20%和POS precision增加了近24%,而负召回提高40%以上。这些都是巨大的增加,没有减少,POS召回和NEG精度甚至略有增加。下面是我得到这些结果的完整代码和解释。
import collections, itertools import nltk.classify.util, nltk.metrics from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews, stopwords from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist def evaluate_classifier(featx): negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] classifier = NaiveBayesClassifier.train(trainfeats) refsets = collections.defaultdict(set) testsets = collections.defaultdict(set) for i, (feats, label) in enumerate(testfeats): refsets[label].add(i) observed = classifier.classify(feats) testsets[observed].add(i) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos']) print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos']) print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg']) print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg']) classifier.show_most_informative_features() def word_feats(words): return dict([(word, True) for word in words]) print 'evaluating single word features' evaluate_classifier(word_feats) word_fd = FreqDist() label_word_fd = ConditionalFreqDist() for word in movie_reviews.words(categories=['pos']): word_fd.inc(word.lower()) label_word_fd['pos'].inc(word.lower()) for word in movie_reviews.words(categories=['neg']): word_fd.inc(word.lower()) label_word_fd['neg'].inc(word.lower()) # n_ii = label_word_fd[label][word] # n_ix = word_fd[word] # n_xi = label_word_fd[label].N() # n_xx = label_word_fd.N() pos_word_count = label_word_fd['pos'].N() neg_word_count = label_word_fd['neg'].N() total_word_count = pos_word_count + neg_word_count word_scores = {} for word, freq in word_fd.iteritems(): pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count) neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count) word_scores[word] = pos_score + neg_score best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000] bestwords = set([w for w, s in best]) def best_word_feats(words): return dict([(word, True) for word in words if word in bestwords]) print 'evaluating best word features' evaluate_classifier(best_word_feats) def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) d = dict([(bigram, True) for bigram in bigrams]) d.update(best_word_feats(words)) return d print 'evaluating best words + bigram chi_sq word features' evaluate_classifier(best_bigram_word_feats)
计算信息增益
要找到最具信息的特征,我们需要为每个词计算信息增益。分类的信息增益是一项度量一个常见的特征在一个特定的类和其他类中的对比。一个主要出现在正面电影评论中的词,很少在负面评论中出现就是具有高的信息量。例如,在电影评论中“magnificent”的存在是一个重要指标,表明是正向的。这使得“magnificent”是高信息量的词。注意,上面的信息量最大的特征并没有改变。这是有道理的,因为该观点是只使用最有信息量的特征而忽略其他。
一个是信息增益的最佳指标是卡方。 NLTK在度量标准数据包的BigramAssocMeasures类中包含有它。要使用它,首先我们需要计算每个词的频率:其整体频率及其各类别内的频率。用FreqDist来表示单词的整体频率,ConditionalFreqDist的条件是类别标签。一旦我们有了这些数字,我们就可以用BigramAssocMeasures.chi_sq函数为词汇计算评分,然后按分数排序,放入一个集合里,取前10000个。然后,我们把这些单词放到一个集合中,并在我们的特征选择函数中使用一组成员资格测试仅选择出现在集合的那些词。现在,基于这些高信息量的词,每个文件都被分类了。
显著的二元词组
上面的代码还评估了包含200个显著二元词组的搭配。下面是结果:
evaluating best words + bigram chi_sq word features accuracy: 0.92 pos precision: 0.913385826772 pos recall: 0.928 neg precision: 0.926829268293 neg recall: 0.912 Most Informative Features magnificent = True pos : neg = 15.0 : 1.0 outstanding = True pos : neg = 13.6 : 1.0 insulting = True neg : pos = 13.0 : 1.0 vulnerable = True pos : neg = 12.3 : 1.0 (‘matt‘, ‘damon‘) = True pos : neg = 12.3 : 1.0 (‘give‘, ‘us‘) = True neg : pos = 12.3 : 1.0 ludicrous = True neg : pos = 11.8 : 1.0 uninvolving = True neg : pos = 11.7 : 1.0 avoids = True pos : neg = 11.7 : 1.0 (‘absolutely‘, ‘no‘) = True neg : pos = 10.6 : 1.0
这表明,只采用高信息量的词的时候二元组并没有多重要。在这种情况下,评估包括二元组或没有的区别的最好方法是看精度和召回。用二元组,你得到的每个类的更均匀的性能。如果没有二元组,准确率和召回率不太平衡。但差异可能取决于您的特定数据,所以不要假设这些观察总是正确的。
改善特征选择
这里最大的教训是,改善特征选择会改善你的分类器。降维是提高分类器性能的你可以做的最好的事情之一。如果数据不增加价值,抛弃也没关系的。特别推荐的是有时数据实际上使你的模型变得更糟。