自然语言18_Named-entity recognition

https://en.wikipedia.org/wiki/Named-entity_recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights the names of entities:

[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.[1][2]

Contents

Problem definition

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as terms for certain biological species and substances.[3]

Full named-entity recognition is often broken down, conceptually and possibly also in implementations,[4] as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other[5]). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking.

Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context in which it is used.[6]

Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.[7] Sekine‘s extended hierarchy, proposed in 2002, is made of 200 subtypes.[8] More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.[9]

Formal evaluation

To evaluate the quality of a NER system‘s output, several measures have been defined. While accuracy on the token level is one possibility, it suffers from two problems: the vast majority of tokens in real-world text are not part of entity names as usually defined, so the baseline accuracy (always predict "not an entity") is extravagantly high, typically >90%; and mispredicting the full span of an entity name is not properly penalized (finding only a person‘s first name when their last name follows is scored as ? accuracy).

In academic conferences such as CoNLL, a variant of the F1 score has been defined as follows:[5]

  • Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when [Person Hans] [Person Blick] is predicted but [Person Hans Blick] was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.
  • Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.
  • F1 score is the harmonic mean of these two.

It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute to either precision or recall.

Evaluation models based on a token-by-token matching have been proposed.[10] Such models are able to handle also partially overlapping matches, yet fully rewarding only exact matches. They allow a finer grained evaluation and comparison of extraction systems, taking into account also the degree of mismatch in non-exact predictions.

Approaches

NER systems have been created that use linguistic grammar-based techniques as well as statistical models, i.e. machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.[11][12]

Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.[13]

Problem domains

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.[14] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER competition, with 27 teams participating in this task.[15]

Current challenges and research

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning,[11][16] robust performance across domains[17][18] and scaling up to fine-grained entity types.[8][19] In recent years, many projects have turned to crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER.[20] Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries.[21]

A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia[22][23][24] can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:

<ENTITY url="http://en.wikipedia.org/wiki/Michael_I._Jordan"> Michael Jordan </ENTITY> is a professor at  <ENTITY url="http://en.wikipedia.org/wiki/University_of_California,_Berkeley"> Berkeley </ENTITY>

Software

时间: 2024-10-10 16:56:06

自然语言18_Named-entity recognition的相关文章

自然语言18.1_Named Entity Recognition with NLTK

https://www.pythonprogramming.net/named-entity-recognition-nltk-tutorial/?completed=/chinking-nltk-tutorial/ Named Entity Recognition with NLTK One of the most major forms of chunking in natural language processing is called "Named Entity Recognition

&lt;知识库的构建&gt; 2-1 有名字的实体的识别 Named Entity Recognition

引自Fabian Suchanek的讲义. 总结:NER是为了从语料库中找到实体的名字,即要识别语料库中哪写单词使我们想读出来的.NER的实现主要有两种方法,一种是字典法,另一种是正则表达式法.传统字典法就是把entity放入字典中去找是否有对应的名字,很慢,所以后期有了新字典法,即Trie.正则部分强调了如何根据language的形态写出对应的正则.所以重点是要知道实现NER的两种方法及其优缺点及如何用正则描述language. 被命名的实体Named entity:带名字的entity 被命

Java自然语言处理NLP工具包

自然语言处理 1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实体识别(Named Entity Recognition).词性标注(Part-of Speech Tagging).句题检测(Sentence Detection).查询拼写检查(Query Spell Checking).兴趣短语检测(Interseting Phrase Dete

自然语言处理-介绍、入门与应用

自然语言处理-介绍.入门与应用 根据工业界的估计,仅仅只有21%的数据是以结构化的形式展现的.数据由说话,发微博,发消息等各种方式产生.数据主要是以文本形式存在,而这种方式却是高度无结构化的.使用这些文本消息的例子包括:社交网络上的发言,聊天记录,新闻,博客,文章等等. 尽管我们会有一些高维的数据,但是它所表达的信息我们很难直接获取到,除非它们已经被我们人工地做了处理(例如:我们阅读并理解了它们).或者,我们可以通过自动化系统来对他进行分析. 为了从文本数据里得到有意义并且可行的深层信息,我们需

自然语言处理系列-4条件随机场(CRF)及其tensorlofw实现

前些天与一位NLP大牛交流,请教其如何提升技术水平,其跟我讲务必要重视“NLP的最基本知识”的掌握.掌握好最基本的模型理论,不管是对日常工作和后续论文的发表都有重要的意义.小Dream听了不禁心里一颤,那些自认为放在“历史尘埃”里的机器学习算法我都只有了解了一个大概,至于NLP早期的那些大作也鲜有拜读.心下便决定要好好补一补这个空缺.所以,接下来的数篇文章会相继介绍在NLP中应用比较多的一些机器学习模型,隐马尔科夫模型(HMM),条件随机场(CRF),朴素贝叶斯,支持向量机(SVM),EM算法等

Python深度学习自然语言处理工具Stanza试用!这也太强大了吧!

众所周知, 斯坦福大学自然语言处理组 出品了一系列NLP工具包,但是大多数都是用Java写得,对于Python用户不是很友好.几年前我曾基于斯坦福Java工具包和NLTK写过一个简单的中文分词接口: Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器 ,不过用起来也不是很方便.深度学习自然语言处理时代,斯坦福大学自然语言处理组开发了一个纯Python版本的深度学习NLP工具包: Stanza - A Python NLP Library for Many Human Languag

以spacy中函数调用为例记录对自然语言基本处理任务

# coding=utf-8 import spacy nlp=spacy.load('en_core_web_md-1.2.1') docx=nlp(u'The ways to process documents are so varied and application- and language-dependent that I decided to not constrain them by any interface. Instead, a document is represente

自然语言处理资源NLP

转自:https://github.com/andrewt3000/DL4NLP Deep Learning for NLP resources State of the art resources for NLP sequence modeling tasks such as machine translation, image captioning, and dialog. My notes on neural networks, rnn, lstm Deep Learning for NL

自然语言17_Chinking with NLTK

https://www.pythonprogramming.net/chinking-nltk-tutorial/?completed=/chunking-nltk-tutorial/ Chinking with NLTK You may find that, after a lot of chunking, you have some words in your chunk you still do not want, but you have no idea how to get rid o