


作者 | George Seif

编译 | Xiaowen

An easy introduction to Natural Language Processing

Using computers to understand human language






自然语言处理 (NLP) 是人工智能的一个子领域,致力于使计算机能够理解和处理人类语言,使计算机更接近于人类对语言的理解。计算机对自然语言的直观理解还不如人类,他们不能真正理解语言到底想说什么。简而言之,计算机不能在字里行间阅读。

尽管如此,机器学习 (ML) 的最新进展使计算机能够用自然语言做很多有用的事情!深度学习使我们能够编写程序来执行诸如语言翻译、语义理解和文本摘要等工作。所有这些都增加了现实世界的价值,使得你可以轻松地理解和执行大型文本块上的计算,而无需手工操作。




“Steph Curry was on fire last nice. He totallydestroyed the other team”

对一个人来说,这句话的意思很明显。我们知道 Steph Curry 是一名篮球运动员,即使你不知道,我们也知道他在某种球队,可能是一支运动队。当我们看到“着火”和“毁灭”时,我们知道这意味着Steph Curry昨晚踢得很好,击败了另一支球队。

计算机往往把事情看得太过字面意思。从字面上看,我们会看到“Steph Curry”,并根据大写假设它是一个人,一个地方,或其他重要的东西。但后来我们看到Steph Curry“着火了”…电脑可能会告诉你昨天有人把Steph Curry点上了火!…哎呀。在那之后,电脑可能会说, curry已经摧毁了另一支球队…它们不再存在…伟大的…

Steph Curry真的着火了!




Amazon.com, Inc., doing business as Amazon, is an Americanelectronic commerce and cloud computing company based in Seattle, Washington,that was founded by Jeff Bezos on July 5, 1994. The tech giant is the largestInternet retailer in the world as measured by revenue and market capitalization,and second largest after Alibaba Group in terms of total sales. The amazon.comwebsite started as an online bookstore and later diversified to sell videodownloads/streaming, MP3 downloads/streaming, audiobook downloads/streaming,software, video games, electronics, apparel, furniture, food, toys, andjewelry. The company also produces consumer electronics—Kindle e-readers,Fire tablets, Fire TV, and Echo—and is the world’s largest provider of cloud infrastructure services (IaaS andPaaS). Amazon also sells certain low-end products under its in-house brandAmazonBasics.


首先,我们将安装一些有用的python NLP库,这些库将帮助我们分析本文。

### Installing spaCy, general Python NLP lib 

pip3 install spacy 

### Downloading the English dictionary model for spaCy 

python3 -m spacy download en_core_web_lg 

### Installing textacy, basically a useful add-on to spaCy 

pip3 install textacy




# coding: utf-8 

import spacy 

### Load spaCy‘s English NLP model 

nlp = spacy.load(‘en_core_web_lg‘) 

### The text we want to examine 

text = "Amazon.com, Inc., doing business as Amazon,
is anAmerican electronic commerce and cloud computing
company based in Seattle,Washington, that was founded
by Jeff Bezos on July 5, 1994. The tech giant isthe
largest Internet retailer in the world as measured by
revenue and marketcapitalization, and second largest
after Alibaba Group in terms of total sales.The amazon.
com website started as an online bookstore and later
diversified tosell video downloads/streaming, MP3
downloads/streaming, audiobookdownloads/streaming,
software, video games, electronics, apparel, furniture,
food, toys, and jewelry. The company also produces
consumer electronics-Kindle e-readers,Fire tablets,
Fire TV, and Echo-and is the world‘s largest provider
of cloud infrastructureservices (IaaS and PaaS).
Amazon also sells certain low-end products under
itsin-house brand AmazonBasics." 

### Parse the text with spaCy 

### Our ‘document‘ variable now contains a parsed version oftext. 

document = nlp(text) 

### print out all the named entities that were detected 

for entity in document.ents: 


我们首先加载spaCy’s learned ML模型,并初始化想要处理的文本。我们在文本上运行ML模型来提取实体。当运行taht代码时,你将得到以下输出:

Amazon.com, Inc. ORG
Amazon ORG
American NORP
Seattle GPE
Washington GPE
Jeff Bezos PERSON
July 5, 1994 DATE
second ORDINAL
Alibaba Group ORG
amazon.com ORG
Echo -  LOC
Amazon ORG
AmazonBasics ORG

文本旁边的3个字母代码[1]是标签,表示我们正在查看的实体的类型。看来我们的模型干得不错!Jeff Bezos确实是一个人,日期是正确的,亚马逊是一个组织,西雅图和华盛顿都是地缘政治实体(即国家、城市、州等)。唯一棘手的问题是,Fire TV和Echo之类的东西实际上是产品,而不是组织。然而模型错过了亚马逊销售的其他产品“视频下载/流媒体、mp3下载/流媒体、有声读物下载/流媒体、软件、视频游戏、电子产品、服装、家具、食品、玩具和珠宝”,可能是因为它们在一个庞大的的列表中,因此看起来相对不重要。




# coding: utf-8 

import spacy 

### Load spaCy‘s English NLP model
nlp = spacy.load(‘en_core_web_lg‘) 

### The text we want to examine
text = "Amazon.com, Inc., doing business as Amazon,
is an American electronic commerce and cloud computing
company based in Seattle, Washington, that was founded
by Jeff Bezos on July 5, 1994. The tech giant is the
largest Internet retailer in the world as measured by
revenue and market capitalization, and second largest
after Alibaba Group in terms of total sales. The
amazon.com website started as an online bookstore and
later diversified to sell video downloads/streaming,
MP3 downloads/streaming, audiobook downloads/streaming,
 software, video games, electronics, apparel, furniture
 , food, toys, and jewelry. The company also produces
 consumer electronics?-?Kindle e-readers, Fire tablets,
  Fire TV, and Echo?-?and is the world‘s largest
  provider of cloud infrastructure services (IaaS and
  PaaS). Amazon also sells certain low-end products
  under its in-house brand AmazonBasics." 

### Replace a specific entity with the word "PRIVATE"
def replace_entity_with_placeholder(token):
    if token.ent_iob != 0 and (token.ent_type_ == "PERSON" or token.ent_type_ == "ORG"):
        return "[PRIVATE] "
        return token.string 

### Loop through all the entities in a piece of text and apply entity replacement
def scrub(text):
    doc = nlp(text)
    for ent in doc.ents:
    tokens = map(replace_entity_with_placeholder, doc)
    return "".join(tokens)






# coding: utf-8 

import spacy
import textacy.extract 

### Load spaCy‘s English NLP model
nlp = spacy.load(‘en_core_web_lg‘) 

### The text we want to examine
text = """Washington, D.C., formally the District of Columbia and commonly referred to as Washington or D.C., is the capital of the United States of America.[4] Founded after the American Revolution as the seat of government of the newly independent country, Washington was named after George Washington, first President of the United States and Founding Father.[5] Washington is the principal city of the Washington metropolitan area, which has a population of 6,131,977.[6] As the seat of the United States federal government and several international organizations, the city is an important world political capital.[7] Washington is one of the most visited cities in the world, with more than 20 million annual tourists.[8][9]
The signing of the Residence Act on July 16, 1790, approved the creation of a capital district located along the Potomac River on the country‘s East Coast. The U.S. Constitution provided for a federal district under the exclusive jurisdiction of the Congress and the District is therefore not a part of any state. The states of Maryland and Virginia each donated land to form the federal district, which included the pre-existing settlements of Georgetown and Alexandria. Named in honor of President George Washington, the City of Washington was founded in 1791 to serve as the new national capital. In 1846, Congress returned the land originally ceded by Virginia; in 1871, it created a single municipal government for the remaining portion of the District.
Washington had an estimated population of 693,972 as of July 2017, making it the 20th largest American city by population. Commuters from the surrounding Maryland and Virginia suburbs raise the city‘s daytime population to more than one million during the workweek. The Washington metropolitan area, of which the District is the principal city, has a population of over 6 million, the sixth-largest metropolitan statistical area in the country.
All three branches of the U.S. federal government are centered in the District: U.S. Congress (legislative), President (executive), and the U.S. Supreme Court (judicial). Washington is home to many national monuments and museums, which are primarily situated on or around the National Mall. The city hosts 177 foreign embassies as well as the headquarters of many international organizations, trade unions, non-profit, lobbying groups, and professional associations, including the Organization of American States, AARP, the National Geographic Society, the Human Rights Campaign, the International Finance Corporation, and the American Red Cross.
A locally elected mayor and a 13?member council have governed the District since 1973. However, Congress maintains supreme authority over the city and may overturn local laws. D.C. residents elect a non-voting, at-large congressional delegate to the House of Representatives, but the District has no representation in the Senate. The District receives three electoral votes in presidential elections as permitted by the Twenty-third Amendment to the United States Constitution, ratified in 1961."""
### Parse the text with spaCy
### Our ‘document‘ variable now contains a parsed version of text.
document = nlp(text) 

### Extracting semi-structured statements
statements = textacy.extract.semistructured_statements(document, "Washington") 

print("**** Information from Washington‘s Wikipedia page ****")
count = 1
for statement in statements:
    subject, verb, fact = statement
    print(str(count) + " - Statement: ", statement)
    print(str(count) + " - Fact: ", fact)
    count += 1









如果你想自己玩更多的NLP,看看spaCy文档[2] 和textacy文档[3] 是一个很好的起点!你将看到许多处理解析文本的方法的示例,并从中提取非常有用的信息。所有的东西都是快速和简单的,你可以从中得到一些非常大的价值。是时候用深入的学习来做更大更好的事情了!


[1] https://spacy.io/usage/linguistic-features#entity-types







时间: 2025-01-07 10:19:57



自然语言处理 1. Java自然语言处理 LingPipe LingPipe是一个自然语言处理的Java开源工具包.LingPipe目前已有很丰富的功能,包括主题分类(Top Classification).命名实体识别(Named Entity Recognition).词性标注(Part-of Speech Tagging).句题检测(Sentence Detection).查询拼写检查(Query Spell Checking).兴趣短语检测(Interseting Phrase Dete

Robot Framework 快速入门

Robot Framework 快速入门 目录 介绍 概述 安装 运行demo 介绍样例应用程序 测试用例 第一个测试用例 高级别测试用例 数据驱动测试用例 关键词keywords 内置关键词 库关键词 用户定义关键词 变量 定义变量 使用变量 组织测试用例 测试套件 启动和卸载 使用标签 创建测试库 介绍概述 Robot Framework 是一个关键词驱动的自动测试框架.测试用例位于HTML或者TSV(以tab分隔值)文件,使用在测试库中实现的关键词来在测试中运行程序.因为Robot Fra


自然语言处理-介绍.入门与应用 根据工业界的估计,仅仅只有21%的数据是以结构化的形式展现的.数据由说话,发微博,发消息等各种方式产生.数据主要是以文本形式存在,而这种方式却是高度无结构化的.使用这些文本消息的例子包括:社交网络上的发言,聊天记录,新闻,博客,文章等等. 尽管我们会有一些高维的数据,但是它所表达的信息我们很难直接获取到,除非它们已经被我们人工地做了处理(例如:我们阅读并理解了它们).或者,我们可以通过自动化系统来对他进行分析. 为了从文本数据里得到有意义并且可行的深层信息,我们需


现在是大数据时代,很多人都想要学习大数据,因为不管是就业前景还是薪资都非常的不错,不少人纷纷从其他行业转型到大数据行业,那么零基础的人也想要学习大数据怎么办呢?下面一起探讨下零基础如何快速入门大数据技巧吧. 很多人都需要学习大数据是需要有一定的基础的,编程语言就是必备的条件之一,编程语言目前热门的有:Java.Python.PHP.C/C++等等,无论是学习哪一门编程语言,总之要精细掌握一门语言是非常必须的,我们先拿应用广泛的Java说起哦. .在入门学习大数据的过程当中有遇见学习,行业,缺乏系

笔记:Spring Cloud Zuul 快速入门

Spring Cloud Zuul 实现了路由规则与实例的维护问题,通过 Spring Cloud Eureka 进行整合,将自身注册为 Eureka 服务治理下的应用,同时从 Eureka 中获取了所有其他微服务的实例信息,这样的设计非常巧妙的将服务治理体系中维护的实例信息利用起来,使得维护服务实例的工作交给了服务治理框架自动完成,而对路由规则的维护,默认会将通过以服务名作为 ContextPath 的方式来创建路由映射,也可以做一些特别的配置,对于签名校验.登录校验等在微服务架构中的冗余问题


本文主要是进行HTML简单介绍(详细的属性查帮助文档就行了,这里主要为快速入门,赶时间,在最短的时间中看明白一个html文件的代码(如果能称之为代码的话)详细的样式表,布局啥的有时间再研究吧) HTML 1.html的简介 1.1,html的全称:HyperText Mark-up Language ,超文本标记型语言,是网页的语言. 超文本:比文本更加强大(后面还会讲到XML,可扩展标记性语言) 标记:就是标签,html所有操作都是通过标签直接或间接的操作(把需要操作的数据通过标签封装起来)

crosswalk 快速入门,利用WebRTC(html)开始开发视频通话

crosswalk 快速入门,利用WebRTC(html)开始开发视频通话 安装Python 从http://www.python.org/downloads/ 下载安装程序 安装完后,再添加到环境变量. 安装Oracle JDK 下载页面: http://www.oracle.com/technetwork/java/javase/downloads/ 选择要下载的Java版本(推荐Java 7). 选择一个JDK下载并接受许可协议. 一旦下载,运行安装程序. 安装Ant Ant:下载http


首先,我们简单的介绍一下bash,bash是GNU计划编写的Unixshell,它是许多Linux平台上的内定shell,它提供了用户与系统的很好的交互,对于系统运维人员,bash的地位是举足轻重的,bash编程能很快处理日常的任务 bash入门,一个最简单的bash例子 #vim hello.sh #!/bin/bash #This is the first example of the bash #echo "Hello world" 下面,我们就这个简单的bash 脚本来介绍一下


Quartz概述 Quartz中的触发器 Quartz中提供了两种触发器,分别是CronTrigger和SimpleTrigger. SimpleTrigger 每 隔若干毫秒来触发纳入进度的任务.因此,对于夏令时来说,根本不需要做任何特殊的处理来"保持进度".它只是简单地保持每隔若干毫秒来触发一次,无论你的 SimpleTrigger每隔10秒触发一次还是每隔15分钟触发一次,还是每隔24小时触发一次. CronTrigger 在特定"格林日历"时刻触发纳入进程的