自然语言27_Converting words to Features with NLTK

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/

Converting words to Features with NLTK

In this tutorial, we‘re going to be building off the previous
video and compiling feature lists of words from positive reviews and
words from the negative reviews to hopefully see trends in specific
types of words in positive or negative reviews.

To start, our code:

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we‘re going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Next, we can print one feature set like:

print((find_features(movie_reviews.words(‘neg/cv000_29416.txt‘))))

Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let‘s go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!

# -*- coding: utf-8 -*-
"""
Created on Sun Dec  4 09:27:48 2016

@author: daxiong
"""
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

#dict_allWords是一个字典,存储所有文字的频率分布
dict_allWords = nltk.FreqDist(all_words)
#字典keys()列出所有单词,[:3000]表示列出前三千文字
word_features = list(dict_allWords.keys())[:3000]
‘‘‘
 ‘combating‘,
 ‘mouthing‘,
 ‘markings‘,
 ‘directon‘,
 ‘ppk‘,
 ‘vanishing‘,
 ‘victories‘,
 ‘huddleston‘,
 ...]
‘‘‘

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

words=movie_reviews.words(‘neg/cv000_29416.txt‘)
‘‘‘
Out[78]: [‘plot‘, ‘:‘, ‘two‘, ‘teen‘, ‘couples‘, ‘go‘, ‘to‘, ...]
type(words)
Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView

‘‘‘

#去重,words1为集合形式
words1 = set(words)
‘‘‘
words1

{‘!‘,
 ‘"‘,
 ‘&‘,
 "‘",
 ‘(‘,
 ‘)‘,.......
 ‘witch‘,
 ‘with‘,
 ‘world‘,
 ‘would‘,
 ‘wrapped‘,
 ‘write‘,
 ‘world‘,
 ‘would‘,
 ‘wrapped‘,
 ‘write‘,
 ‘years‘,
 ‘you‘,
 ‘your‘}
‘‘‘
features = {}

#victories单词不在words1,输出false
(‘victories‘ in words1)
‘‘‘
Out[73]: False
‘‘‘

features[‘victories‘] = (‘victories‘ in words1)
‘‘‘
features
Out[75]: {‘victories‘: False}
‘‘‘

print((find_features(movie_reviews.words(‘neg/cv000_29416.txt‘))))
‘‘‘
‘schwarz‘: False,
 ‘supervisors‘: False,
 ‘geyser‘: False,
 ‘site‘: False,
 ‘fevered‘: False,
 ‘acknowledged‘: False,
 ‘ronald‘: False,
 ‘wroth‘: False,
 ‘degredation‘: False,
 ...}
‘‘‘

featuresets = [(find_features(rev), category) for (rev, category) in documents]

featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类

时间: 2024-11-04 19:00:48

自然语言27_Converting words to Features with NLTK的相关文章

自然语言12_Tokenizing Words and Sentences with NLTK

https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ Tokenizing Words and Sentences with NLTK Welcome to a Natural Language Processing tutorial series, using the Natural Language Toolkit, or NLTK, module with Python. The NLTK m

自然语言18.1_Named Entity Recognition with NLTK

https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?completed=/chinking-nltk-tutorial/ Named Entity Recognition with NLTK One of the most major forms of chunking in natural language processing is called "Named Entity Recognition

自然语言15_Part of Speech Tagging with NLTK

https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tutorial/ One of the more powerful aspects of the NLTK module is the Part of Speech tagging that it can do for you. This means labeling words in a senten

NLTK与NLP原理及基础

参考https://blog.csdn.net/zxm1306192988/article/details/78896319 以NLTK为基础配合讲解自然语言处理的原理  http://www.nltk.org/ Python上著名的自然语?处理库 自带语料库,词性分类库 自带分类,分词,等功能 强?的社区?持 还有N多的简单版wrapper,如 TextBlob NLTK安装(可能需要预先安装numpy) pip install nltk 安装语料库 import nltk nltk.down

马尔可夫模型自动生成文章

马尔可夫链是一个随机过程,在这个过程中,我们假设前一个或前几个状态对预测下一个状态起决定性作用.和抛硬币不同,这些事件之间不是相互独立的.通过一个例子更容易理解. 想象一下天气只能是下雨天或者晴天.也就是说,状态空间是雨天或者晴天.我们可以将马尔可夫模型表示为一个转移矩阵,矩阵的每一行代表一个状态,每一列代表该状态转移到另外一个状态的概率. 然而,通过这个状态转移示意图更容易理解. 换句话说,假如今天是晴天,那么有90%的概率明天也是晴天,有10%的概率明天是下雨天. 文章生成器 马尔可夫模型有

机器学习学习路线

1 机器学习 机器学习研究的是计算机怎样模拟人类的学习行为,以获取新的知识或技能,并重新组织已有的知识结构使之不断改善自身. 简单一点说,就是计算机从数据中学习出规律和模式,以应用在新数据上做预测的任务. 1.1 机器学习的分类 从功能的角度分类,机器学习在一定量级的数据上,可以解决下列问题: 1) 分类问题 根据数据样本上抽取出的特征,判定其属于有限个类别中的哪一个. 比如: 垃圾邮件识别(结果类别:1垃圾邮件 2正常邮件 文本情感褒贬分析(结果类别:1褒 2贬 图像内容识别识别(结果类别:1

《NLP的相关资料推荐》

本节内容主要参考于微信公众号"CS的陋室"的相关内容. 一 作者的NLP学习之路1 机器学习:<统计学习方法>,雷明的<机器学习与应用>,塞巴斯蒂安的<Python机器学习>,sklearn的API文档 深度学习:黄文坚<tensorflow实战>,tensorflow技术解析与实战,<keras深度学习实践>,TF和keras的官方文档 NLP:<统计自然语言处理>,刘兵的<情感分析>,<基于深度学习的自然语言处理

Mac OS10.9 下python开发环境(eclipse)以及自然语言包NLTK的安装与注意

折腾了大半天,终于把mbp上python自然语言开发环境搭建好了. 第一步,安装JDK1.7 for mac MacOS10.9是自带python2.7.5的,够用,具体的可以打开终端输入python显示版本号码.在10.9中MacOS没有自带的JDK1.7所以我们得先安装JDK1.7 for mac 下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html 选择Mac OS

Python自然语言工具包(NLTK)入门

在本期文章中,小生向您介绍了自然语言工具包(Natural Language Toolkit),它是一个将学术语言技术应用于文本数据集的 Python 库.称为“文本处理”的程序设计是其基本功能:更深入的是专门用于研究自然语言的语法以及语义分析的能力. 鄙人并非见多识广, 语言处理(linguistic processing) 是一个相对新奇的领域.如果在对意义非凡的自然语言工具包(NLTK)的说明中出现了错误,请您谅解.NLTK 是使用 Python 教学以及实践计算语言学的极好工具.此外,计