Document Classification

Natural Language Processing with Python

Chapter 6.1

由于nltk.FreqDist的排序问题,获取电影文本特征词的代码有些微改动。

 1 import nltk
 2 from nltk.corpus import movie_reviews as mr
 3
 4 def document_features(document,words_features):
 5     document_words=set(document)
 6     features={}
 7     for word in words_features:
 8         features[‘has(%s)‘ %word] = (word in document_words)
 9     return features
10
11 def test_doc_classification():
12     documents=[(list(mr.words(fileid)),category)
13                 for category in mr.categories()
14                 for fileid in mr.fileids(categories=category)]
15     all_words_dist=nltk.FreqDist(w.lower() for w in mr.words())
16     words_freq =sorted(all_words_dist.items(), key=lambda x: (-1*x[1], x[0]))[:2000]
17     words_features=[word[0] for word in words_freq]
18
19     featuresets=[(document_features(doc,words_features),c) for (doc,c) in
20                     documents]
21
22     train_set, test_set= featuresets[100:],featuresets[:100]
23     classifier=nltk.NaiveBayesClassifier.train(train_set)
24
25     print nltk.classify.accuracy(classifier,test_set)
26
27     classifier.show_most_informative_features(5)

结果如下,accuracy为0.86:

0.86
Most Informative Features
has(outstanding) = True pos : neg = 10.4 : 1.0
has(seagal) = True neg : pos = 8.7 : 1.0
has(mulan) = True pos : neg = 8.1 : 1.0
has(wonderfully) = True pos : neg = 6.3 : 1.0
has(damon) = True pos : neg = 5.7 : 1.0

时间: 2024-12-17 07:54:19

Document Classification的相关文章

Support Vector Machines for classification

Support Vector Machines for classification To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng: “SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learni

Classification of text documents: using a MLComp dataset

注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html 运行结果为: Loading 20 newsgroups training set... 20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups) 131

Link-based Classification相关数据集

Link-based Classification相关数据集 Datasets Document Classification Datasets: CiteSeer: The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the

Naive Bayesian文本分类器

贝叶斯学习方法中实用性很高的一种为朴素贝叶斯学习期,常被称为朴素贝叶斯分类器.在某些领域中与神经网络和决策树学习相当.虽然朴素贝叶斯分类器忽略单词间的依赖关系,即假设所有单词是条件独立的,但朴素贝叶斯分类在实际应用中有很出色的表现. 朴素贝叶斯文本分类算法伪代码: 朴素贝叶斯文本分类算法流程: 通过计算训练集中每个类别的概率与不同类别下每个单词的概率,然后利用朴素贝叶斯公式计算新文档被分类为各个类别的概率,最终输出概率最大的类别. C++源码: /* Bayesian classifier fo

Java开源框架推荐(全)

Build Tool Tools which handle the buildcycle of an application. Apache Maven - Declarative build and dependency management which favors convention over configuration. It's preferable to Apache Ant which uses a rather procedural approach and can be di

TensorFlow上实践基于自编码的One Class Learning

“我不知道什么是爱,但我知道什么是不爱” --One Class Learning的自白 一.单分类简介 如果将分类算法进行划分,根据类别个数的不同可以分为单分类.二分类.多分类,常见的分类算法主要解决二分类和多分类问题,预测一封邮件是否是垃圾邮件是一个典型的二分类问题,手写体识别是一个典型的多分类问题,这些算法并不能很好的应用在单分类上,但单分类问题在工业界广泛存在,由于每个企业刻画用户的数据都是有限的,很多二分类问题很难找到负样本,比如通过用户的搜索记录预测一个用户是否有小孩,可以通过规则筛

Awesome Machine Learning

Awesome Machine Learning  A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti Als

6 Useful Databases to Dig for Data (and 100 more)

6 Useful Databases to Dig for Data (and 100 more) You already know that data is the bread and butter of reports and presentations. Data makes your presentation solid. It backs up the ideas you are selling. It gives people reasons to listen to you. Ho

使用朴素贝叶斯分类器过滤垃圾邮件

1.从文本中构建词向量 将每个文本用python分割成单词,构建成词向量,这里首先需要一个语料库,为了简化我们直接从所给文本中抽出所有出现的单词构成一个词库. 2.利用词向量计算概率p(x|y) When we attempt to classify a document, we multiply a lot of probabilities together to get the probability that a document belongs to a given class. Thi