自然语言14_Stemming words with NLTK

https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/

Stemming words with NLTK

The idea of stemming is a sort of normalizing method. Many
variations of words carry the same meaning, other than when tense is
involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.

I was riding in the car.

This sentence means the same thing. in the car is the same. I was is
the same. the ing denotes a clear past-tense in both cases, so is it
truly necessary to differentiate between ride and riding, in the case of
just trying to figure out the meaning of what this past-tense activity
was?

No, not really.

This is just one minor example, but imagine every word in the English
language, every possible tense and affix you can put on a word. Having
individual dictionary entries per version would be highly redundant and
inefficient, especially since, once we convert to numbers, the "value"
is going to be identical.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

First, we‘re going to grab and define our stemmer:

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Now, let‘s choose some words with a similar stem, like:

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

Next, we can easily stem by doing something like:

for w in example_words:
    print(ps.stem(w))

Our output:

python
python
python
python
pythonli

Now let‘s try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

Now our result is:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Next up, we‘re going to discuss something a bit more advanced from the NLTK module, Part of Speech tagging, where we can use the NLTK module to identify the parts of speech for each word in a sentence.

时间: 2024-10-05 15:32:17

自然语言14_Stemming words with NLTK的相关文章

自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON 题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间.碰巧这几天在亚马逊上找书时发现了这本<Python自然语言处理>,瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助.所以最近都会学习这本书,也写下这些笔记. 1. NLTK简述 NLTK模块及功能介绍 语言处理任务 NLTK模块 功能描述 获取语料库 nltk.corpus 语料库和词典的标准化接口 字符串处理 nl

自然语言13_Stop words with NLTK

https://www.pythonprogramming.net/stop-words-nltk-tutorial/?completed=/tokenizing-words-sentences-nltk-tutorial/ Stop words with NLTK The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can underst

自然语言19.1_Lemmatizing with NLTK

https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/named-entity-recognition-nltk-tutorial/ Lemmatizing with NLTK A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlie

自然语言23_Text Classification with NLTK

https://www.pythonprogramming.net/text-classification-nltk-tutorial/?completed=/wordnet-nltk-tutorial/ Text Classification with NLTK Now that we're comfortable with NLTK, let's try to tackle text classification. The goal with text classification can

自然语言20_The corpora with NLTK

https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed=/lemmatizing-nltk-tutorial/ The corpora with NLTK In this part of the tutorial, I want us to take a moment to peak into the corpora we all downloaded! The NLTK corpus is a mass

Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-%E5%9C%A8nltk%E4%B8%AD%E4%BD%BF%E7%94%A8%E6%96%AF%E5%9D%A6%E7%A6%8F%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8 原文地址:https://www.cnblogs.com/lhuser/p/

自然语言0_nltk中文使用和学习资料汇总

http://blog.csdn.net/huyoo/article/details/12188573 nltk是一个Python工具包, 用来处理和自然语言处理相关的东西. 包括分词(tokenize), 词性标注(POS), 文本分类, 等等现成的工具. 1. nltk的安装 资料1.1: 黄聪:Python+NLTK自然语言处理学习(一):环境搭建  http://www.cnblogs.com/huangcong/archive/2011/08/29/2157437.html   这个图

自然语言1_介绍

相同爱好者请加 QQ:231469242 seo 关键词 自然语言,NLP,nltk,python,tokenization,normalization,linguistics,semantic 单词: NLP:natural language processing  自然语言处理 tokenization 词语切分 normalization 标准化(去除标点,大小写统一         ) nltk:natural language toolkit  (Python)自然语言工具包 corp

自然语言2_常用函数

相同爱好者请加 QQ:231469242 seo 关键词 自然语言,NLP,nltk,python,tokenization,normalization,linguistics,semantic 学习参考书: http://nltk.googlecode.com/svn/trunk/doc/book/ http://blog.csdn.net/tanzhangwen/article/details/8469491 一个NLP爱好者博客 http://blog.csdn.net/tanzhangw