自然语言23_Text Classification with NLTK

https://www.pythonprogramming.net/text-classification-nltk-tutorial/?completed=/wordnet-nltk-tutorial/

Text Classification with NLTK

Now that we‘re comfortable with NLTK, let‘s try to tackle text
classification. The goal with text classification can be pretty broad.
Maybe we‘re trying to classify text as about politics or the military.
Maybe we‘re trying to classify it by the gender of the author who wrote
it. A fairly popular text classification task is to identify a body of
text as either spam or not spam, for things like email filters. In our
case, we‘re going to try to create a sentiment analysis algorithm.

To do this, we‘re going to start by trying to use the movie
reviews database that is part of the NLTK corpus. From there we‘ll try
to use words as "features" which are a part of either a positive or
negative movie review. The NLTK corpus movie_reviews data set has the
reviews, and they are labeled already as positive or negative. This
means we can train and test with this data. First, let‘s wrangle our
data.

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])

It may take a moment to run this script, as the movie reviews dataset is somewhat large. Let‘s cover what is happening here.

After importing the data set we want, you see:

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

Basically, in plain English, the above code is translated to: In each category (we have pos or neg), take all of the file IDs (each review has its own ID), then store the word_tokenized version (a list of words) for the file ID, followed by the positive or negative label in one big list.

Next, we use random to shuffle our documents. This is because we‘re going to be training and testing. If we left them in order, chances are we‘d train on all of the negatives, some positives, and then test only against positives. We don‘t want that, so we shuffle the data.

Then, just so you can see the data you are working with, we print out documents[1], which is a big list, where the first element is a list the words, and the 2nd element is the "pos" or "neg" label.

Next, we want to collect all words that we find, so we can have a massive list of typical words. From here, we can perform a frequency distribution, to then find out the most common words. As you will see, the most popular "words" are actually things like punctuation, "the," "a" and so on, but quickly we get to legitimate words. We intend to store a few thousand of the most popular words, so this shouldn‘t be a problem.

print(all_words.most_common(15))

The above gives you the 15 most common words. You can also find out how many occurences a word has by doing:

print(all_words["stupid"])

Next up, we‘ll begin storing our words as features of either positive or negative movie reviews.

时间: 2024-11-08 16:59:18

自然语言23_Text Classification with NLTK的相关文章

自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON 题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间.碰巧这几天在亚马逊上找书时发现了这本<Python自然语言处理>,瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助.所以最近都会学习这本书,也写下这些笔记. 1. NLTK简述 NLTK模块及功能介绍 语言处理任务 NLTK模块 功能描述 获取语料库 nltk.corpus 语料库和词典的标准化接口 字符串处理 nl

自然语言13_Stop words with NLTK

https://www.pythonprogramming.net/stop-words-nltk-tutorial/?completed=/tokenizing-words-sentences-nltk-tutorial/ Stop words with NLTK The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can underst

自然语言19.1_Lemmatizing with NLTK

https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/named-entity-recognition-nltk-tutorial/ Lemmatizing with NLTK A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlie

自然语言14_Stemming words with NLTK

https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ Stemming words with NLTK The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is in

自然语言20_The corpora with NLTK

https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed=/lemmatizing-nltk-tutorial/ The corpora with NLTK In this part of the tutorial, I want us to take a moment to peak into the corpora we all downloaded! The NLTK corpus is a mass

Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-%E5%9C%A8nltk%E4%B8%AD%E4%BD%BF%E7%94%A8%E6%96%AF%E5%9D%A6%E7%A6%8F%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8 原文地址:https://www.cnblogs.com/lhuser/p/

转:资源 | 我爱自然语言处理

转自:http://www.52nlp.cn/resources 这里提供一些52nlp博客的一些系列文章以及收集的自然语言处理相关书籍及其他资源的下载,陆续整理中!如有不妥,我会做删除处理! 特别推荐系列: 1.HMM学习最佳范例全文文档,百度网盘链接: http://pan.baidu.com/s/1pJoMA2B 密码: f7az 2.无约束最优化全文文档 -by @朱鉴 ,百度网盘链接:链接: http://pan.baidu.com/s/1hqEJtT6 密码: qng0 3.PYTH

自然语言处理资源

http://www.52nlp.cn/resources 资源 这里提供一些52nlp博客的一些系列文章以及收集的自然语言处理相关书籍及其他资源的下载,陆续整理中!如有不妥,我会做删除处理! 特别推荐系列:1.HMM学习最佳范例全文文档,百度网盘链接: http://pan.baidu.com/s/1pJoMA2B 密码: f7az 2.无约束最优化全文文档 -by @朱鉴 ,百度网盘链接:链接:http://pan.baidu.com/s/1hqEJtT6 密码: qng0 3.PYTHON

自然语言0_nltk中文使用和学习资料汇总

http://blog.csdn.net/huyoo/article/details/12188573 nltk是一个Python工具包, 用来处理和自然语言处理相关的东西. 包括分词(tokenize), 词性标注(POS), 文本分类, 等等现成的工具. 1. nltk的安装 资料1.1: 黄聪:Python+NLTK自然语言处理学习(一):环境搭建  http://www.cnblogs.com/huangcong/archive/2011/08/29/2157437.html   这个图