自然语言20_The corpora with NLTK

https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed=/lemmatizing-nltk-tutorial/

The corpora with NLTK

In this part of the tutorial, I want us to take a moment to peak
into the corpora we all downloaded! The NLTK corpus is a massive dump of
all kinds of natural language data sets that are definitely worth
taking a look at.

Almost all of the files in the NLTK corpus follow the same rules
for accessing them by using the NLTK module, but nothing is magical
about them. These files are plain text files for the most part, some are
XML and some are other formats, but they are all accessible by you
manually, or via the module and Python. Let‘s talk about viewing them
manually.

Depending on your installation, your nltk_data directory might be
hiding in a multitude of locations. To figure out where it is, head to
your Python directory, where the NLTK module is. If you do not know
where that is, use the following code:

import nltk
print(nltk.__file__)

Run that, and the output will be the location of the NLTK module‘s __init__.py. Head into the NLTK directory, and then look for the data.py file.

The important blurb of code is:

if sys.platform.startswith(‘win‘):
    # Common locations on Windows:
    path += [
        str(r‘C:\nltk_data‘), str(r‘D:\nltk_data‘), str(r‘E:\nltk_data‘),
        os.path.join(sys.prefix, str(‘nltk_data‘)),
        os.path.join(sys.prefix, str(‘lib‘), str(‘nltk_data‘)),
        os.path.join(os.environ.get(str(‘APPDATA‘), str(‘C:\\‘)), str(‘nltk_data‘))
    ]
else:
    # Common locations on UNIX & OS X:
    path += [
        str(‘/usr/share/nltk_data‘),
        str(‘/usr/local/share/nltk_data‘),
        str(‘/usr/lib/nltk_data‘),
        str(‘/usr/local/lib/nltk_data‘)
    ]

There, you can see the various possible directories for the nltk_data. If you‘re on Windows, chances are it is in your appdata, in the local directory. To get there, you will want to open your file browser, go to the top, and type in %appdata%

Next click on roaming, and then find the nltk_data directory. In there, you will have your corpora file. The full path is something like:
C:\Users\yourname\AppData\Roaming\nltk_data\corpora

Within here, you have all of the available corpora, including things like books, chat logs, movie reviews, and a whole lot more.

Now, we‘re going to talk about accessing these documents via NLTK.
As you can see, these are mostly text documents, so you could just use
normal Python code to open and read documents. That said, the NLTK
module has a few nice methods for handling the corpus, so you may find
it useful to use their methology. Here‘s an example of us opening the
Gutenberg Bible, and reading the first few lines:

from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

# sample text
sample = gutenberg.raw("bible-kjv.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])

One of the more advanced data sets in here is "wordnet." Wordnet is a collection of words, definitions, examples of their use, synonyms, antonyms, and more. We‘ll dive into using wordnet next.

时间: 2024-10-11 12:47:10

自然语言20_The corpora with NLTK的相关文章

自然语言处理(1)之NLTK与PYTHON

自然语言处理(1)之NLTK与PYTHON 题记: 由于现在的项目是搜索引擎,所以不由的对自然语言处理产生了好奇,再加上一直以来都想学Python,只是没有机会与时间.碰巧这几天在亚马逊上找书时发现了这本<Python自然语言处理>,瞬间觉得这对我同时入门自然语言处理与Python有很大的帮助.所以最近都会学习这本书,也写下这些笔记. 1. NLTK简述 NLTK模块及功能介绍 语言处理任务 NLTK模块 功能描述 获取语料库 nltk.corpus 语料库和词典的标准化接口 字符串处理 nl

自然语言13_Stop words with NLTK

https://www.pythonprogramming.net/stop-words-nltk-tutorial/?completed=/tokenizing-words-sentences-nltk-tutorial/ Stop words with NLTK The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can underst

自然语言19.1_Lemmatizing with NLTK

https://www.pythonprogramming.net/lemmatizing-nltk-tutorial/?completed=/named-entity-recognition-nltk-tutorial/ Lemmatizing with NLTK A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlie

自然语言23_Text Classification with NLTK

https://www.pythonprogramming.net/text-classification-nltk-tutorial/?completed=/wordnet-nltk-tutorial/ Text Classification with NLTK Now that we're comfortable with NLTK, let's try to tackle text classification. The goal with text classification can

自然语言14_Stemming words with NLTK

https://www.pythonprogramming.net/stemming-nltk-tutorial/?completed=/stop-words-nltk-tutorial/ Stemming words with NLTK The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is in

Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器

http://www.52nlp.cn/python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E5%AE%9E%E8%B7%B5-%E5%9C%A8nltk%E4%B8%AD%E4%BD%BF%E7%94%A8%E6%96%AF%E5%9D%A6%E7%A6%8F%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8 原文地址:https://www.cnblogs.com/lhuser/p/

自然语言1_介绍

相同爱好者请加 QQ:231469242 seo 关键词 自然语言,NLP,nltk,python,tokenization,normalization,linguistics,semantic 单词: NLP:natural language processing  自然语言处理 tokenization 词语切分 normalization 标准化(去除标点,大小写统一         ) nltk:natural language toolkit  (Python)自然语言工具包 corp

自然语言0_nltk中文使用和学习资料汇总

http://blog.csdn.net/huyoo/article/details/12188573 nltk是一个Python工具包, 用来处理和自然语言处理相关的东西. 包括分词(tokenize), 词性标注(POS), 文本分类, 等等现成的工具. 1. nltk的安装 资料1.1: 黄聪:Python+NLTK自然语言处理学习(一):环境搭建  http://www.cnblogs.com/huangcong/archive/2011/08/29/2157437.html   这个图

自然语言2_常用函数

相同爱好者请加 QQ:231469242 seo 关键词 自然语言,NLP,nltk,python,tokenization,normalization,linguistics,semantic 学习参考书: http://nltk.googlecode.com/svn/trunk/doc/book/ http://blog.csdn.net/tanzhangwen/article/details/8469491 一个NLP爱好者博客 http://blog.csdn.net/tanzhangw