自然语言18.1_Named Entity Recognition with NLTK

https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?completed=/chinking-nltk-tutorial/

Named Entity Recognition with NLTK

One of the most major forms of chunking in natural language
processing is called "Named Entity Recognition." The idea is to have the
machine immediately be able to pull out "entities" like people, places,
things, locations, monetary figures, and more.

This can be a bit of a challenge, but NLTK is this built in for
us. There are two major options with NLTK‘s named entity recognition:
either recognize all named entities, or recognize named entities as
their respective type, like people, places, locations, etc.

Here‘s an example:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            namedEnt.draw()
    except Exception as e:
        print(str(e))

process_content()

Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail. The result is:

If you set binary = False, then the result is:

Immediately, you can see a few things. When Binary is False, it picked up the same things, but wound up splitting up terms like White House into "White" and "House" as if they were different, whereas we could see in the binary = True option, the named entity recognition was correct to say White House was part of the same named entity.

Depending on your goals, you may use the binary option how you see fit. Here are the types of Named Entities that you can get if you have binary as false:

NE Type and Examples

ORGANIZATION - Georgia-Pacific Corp., WHO

PERSON - Eddy Bonte, President Obama

LOCATION - Murray River, Mount Everest

DATE - June, 2008-06-29

TIME - two fifty a m, 1:30 p.m.

MONEY - 175 million Canadian Dollars, GBP 10.40

PERCENT - twenty pct, 18.75 %

FACILITY - Washington Monument, Stonehenge

GPE - South East Asia, Midlothian

Either way, you will probably find that you need to do a bit more
work to get it just right, but this is pretty powerful right out of the
box.

In the next tutorial, we‘re going to talk about something similar to stemming, called lemmatizing.

时间: 2024-10-13 01:29:27

自然语言18.1_Named Entity Recognition with NLTK的相关文章

自然语言12_Tokenizing Words and Sentences with NLTK

https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ Tokenizing Words and Sentences with NLTK Welcome to a Natural Language Processing tutorial series, using the Natural Language Toolkit, or NLTK, module with Python. The NLTK m

自然语言27_Converting words to Features with NLTK

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/ Converting words to Features with NLTK In this tutorial, we're going to be building off the previous video and compiling feature lists of words from positive reviews and words from th

【v2.x OGE教程 18】 Entity相关

让游戏设计变灵活的方法之一,就是使用实体/组件式设计方法,该方法将所有物体都看作Entity对象,而不是为每种物体设计一个类.在屏幕中所绘制的所有东西都是实体(场景.层.文字.几何图形.线条.精灵等).这些实体对象稍后会被动态地设置属性(或组件),使之对应的做出改变. 1.更新相关 Engine以逻辑线程及GL线程以数据同步的方式对游戏进程进行更新,Entity更新其中onUpdate和onDraw方法,onUpdate方法中IgnoreUpdate就是判断是否更新逻辑(包括属性参数,如坐标等)

自然语言15_Part of Speech Tagging with NLTK

https://www.pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/?completed=/stemming-nltk-tutorial/ One of the more powerful aspects of the NLTK module is the Part of Speech tagging that it can do for you. This means labeling words in a senten

<知识库的构建> 2-1 有名字的实体的识别 Named Entity Recognition

引自Fabian Suchanek的讲义. 总结:NER是为了从语料库中找到实体的名字,即要识别语料库中哪写单词使我们想读出来的.NER的实现主要有两种方法,一种是字典法,另一种是正则表达式法.传统字典法就是把entity放入字典中去找是否有对应的名字,很慢,所以后期有了新字典法,即Trie.正则部分强调了如何根据language的形态写出对应的正则.所以重点是要知道实现NER的两种方法及其优缺点及如何用正则描述language. 被命名的实体Named entity:带名字的entity 被命

自然语言17_Chinking with NLTK

https://www.pythonprogramming.net/chinking-nltk-tutorial/?completed=/chunking-nltk-tutorial/ Chinking with NLTK You may find that, after a lot of chunking, you have some words in your chunk you still do not want, but you have no idea how to get rid o

自然语言18_Named-entity recognition

https://en.wikipedia.org/wiki/Named-entity_recognition Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities

【NLP】干货!Python NLTK结合stanford NLP工具包进行文本处理

干货!详述Python NLTK下如何使用stanford NLP工具包 作者:白宁超 2016年11月6日19:28:43 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开数据集.模型上提供了全面.易用的接口,涵盖了分词.词性标注(Part-Of-Speech tag, POS-tag).命名实体识别(Named Entity Recognition, NER).句法分析(Syntactic Parse)等各项 NLP 领域的功能.

Python深度学习自然语言处理工具Stanza试用!这也太强大了吧!

众所周知, 斯坦福大学自然语言处理组 出品了一系列NLP工具包,但是大多数都是用Java写得,对于Python用户不是很友好.几年前我曾基于斯坦福Java工具包和NLTK写过一个简单的中文分词接口: Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器 ,不过用起来也不是很方便.深度学习自然语言处理时代,斯坦福大学自然语言处理组开发了一个纯Python版本的深度学习NLP工具包: Stanza - A Python NLP Library for Many Human Languag