关于最近研究的关键词提取keyword extraction做的笔记

来源:http://blog.csdn.net/caohao2008/article/details/3144639

之前内容的整理

要求:第一: 首先找出具有proposal性质的paper,归纳出经典的方法有哪些. 第二:我们如果想用的话,哪种更实用或者易于实现? 哪种在研究上更有意义.

第一,      较好较全面地介绍keyword extraction的经典特征的文章《Finding Advertising Keywords on Web Pages》.

基于概念的keywords提取,使用概念、分类来辅助关键词抽取。较经典的文章《Discovering Key Concepts in Verbose Queries》,《A study on automatically extracted keywords in text categorization》

基于查询日志的keywords提取,有文章《Using the wisdom of the crowds for keyword generation》,《Keyword Extraction for Contextual Advertisement》

Keywords扩展,keywords生成《Keyword Generation for Search Engine Advertising using Semantic Similarity》, 《Using the wisdom of the crowds for keyword generation》,《n-Keyword based Automatic Query Generation》

第二,      较常用的特征,之前研究者提到过的特征:

《Finding Advertising Keywords on web pages》中提到过的特征

1.语言特征 词性标注

2.首字母大写

3.关键词是否在hypertext里

4.关键词是否在meta data里

5.关键词是否在title里

6.关键词是否在url里

7.TF,DF

8.关键词所处位置信息

9.关键词所在句子长度及文档长度

10.候选短语的长度

11.查询日志

我想到的特征

1.周围信息含量,附近几个词甚至是一个句子的平均信息含量。

2.语义距离,使用co-occurance.

3.NE。  曾经在IE抽取中使用过。

4.关键词之间的关系,语义距离。divergance是越大愈好还是越小越好。或者没有影响?

2.3.2.1 Lin: linguistic features.

The linguistic information used in feature extraction includes: two types of pos tags – noun (NN & NNS) andpropernoun (NNP & NNPS), and one type of chunk – noun phrase(NP). The variations used in MoS are: whether the phrase contain these pos tags; whether all the words in that phrase share the same pos tags (either proper noun or noun); and whether the whole candidate phrase is a noun phrase. For DeS, they are: whether the word has the pos tag; whether the word is the beginning of a noun phrase; whether the word is in a noun phrase, but not the first word; and whether the word is outside any noun phrase.

2.3.2.2 C: capitalization.

Whether a word is capitalized is an indication of being art of a proper noun, or an important word. This set of features for MoS is defined as: whether all the words in the andidate phrase are capitalized; whether the first word of he candidate phrase is capitalized; and whether the candidate phrase has a capitalized word. For DeS, it is imply

whether the word is capitalized.

2.3.2.3 H: hypertext.

Whether a candidate phrase or word is part of the anchor text for a hypertext link is extracted as the following features. For MoS, they are: whether the whole candidate phrase matches exactly the anchortext of a link; whether all the words of the candidate phrase are in the same anchor text; and whether any word of the candidate phrase belongs to the anchor text of a link. For DeS, they are: whether the word is the beginning of the anchor text; whether the word is in the anchor text of a link, but not the first word; and whether the word is outside any anchor text.

2.3.2.4 Ms: meta section features.

The header of an HTML document may provide additional information embedded in meta tags. Although the

text in this region is usually not seen by readers, whether a candidate appears in this meta section seems important. For MoS, the features are whether the whole candidate phrase is in the meta section. For DeS, they are: whether the word is the first word in a meta tag; and whether the word occurs somewhere in a meta tag, but not as the first word.

2.3.2.5 T: title.

The only human readable text in the HTML header is the TITLE, which is usually put in the window caption by the browser. For MoS, the feature is whether the whole candidate phrase is in the title. For DeS, the features are: whether the word is the beginning of the title; and whether the word is in the title, but not the first word.

2.3.2.6 M: meta features.

In addition to TITLE, several meta tags are potentially related to keywords, and are used to derive features. In the MoS framework, the features are: whether the whole candidate phrase is in the meta-description; whether the whole candidate phrase is in the meta-keywords; and whether the whole candidate phrase is in the meta-title. For DeS, the features are: whether the word is the beginning of the metadescription; whether the word is in the meta-description, but not the first word; whether the word is the beginning of the meta-keywords; whether the word is in the meta-keywords, but not the first word; whether the word is the beginning of the meta-title; and whether the word is in the meta-title, but not the first word.

2.3.2.7 U: URL.

A web document has one additional highly useful property – the name of the document, which is its URL. For MoS, the features are: whether the whole candidate phrase is in part of the URL string; and whether any word of the candidate phrase is in the URL string. In the DeS framework, the feature is whether the word is in the URL string.

2.3.2.8 IR: information retrieval oriented features.

We consider the TF (term frequency) and DF (document frequency) values of the candidate as real-valued features. The document frequency is derived by counting how many documents in our web page collection that contain the given term. In addition to the original TF and DF frequency numbers, log(TF + 1) and log(DF + 1) are also used as features. The features used in the monolithic and the decomposed frameworks are basically the same, where for DeS, the “term” is the candidate word.

2.3.2.9 Loc: relative location of the candidate.

The beginning of a document often contains an introduction or summary with important words and phrases. Therefore, the location of the occurrence of the word or phrase in the document is also extracted as a feature. Since the length of a document or a sentence varies considerably, we take only the ratio of the location instead of the absolute number. For example, if a word appears in the 10th position, while the whole document contains 200 words, the ratio is then 0.05. These features used for the monolithic and decomposed frameworks are the same. When the candidate is a phrase, its first word is used as its location. There are three different relative locations used as features: wordRatio – the relative location of the candidate in the sentence; sentRatio – the location of the sentence where the candidate is in divided by the total number of sentences in the document; wordDocRatio – the relative location of the candidate in the document. In addition to these 3 realvalued features, we also use their logarithms as features.Specifically, we used log(1+wordRatio), log(1+sentRatio), and log(1 + wordDocRatio).

2.3.2.10 Len: sentence and document length.

The length (in words) of the sentence (sentLen) where the candidate occurs, and the length of the whole document

(docLen) (words in the header are not included) are used as features. Similarly, log(1+sentLen) and log(1+docLen) are also included.

2.3.2.11 phLen: length of the candidate phrase.

For the monolithic framework, the length of the candidate phrase (phLen) in words and log(1+phLen) are included as features. These features are not used in the decomposed framework.

2.3.2.12 Q: query log.

The query log of a search engine reflects the distribution of the keywords people are most interested in. We use the

information to create the following features. For these experiments, unless otherwise mentioned, we used a log file

with the most frequent 7.5 million queries. For the monolithic framework, we consider one binary fea-

ture – whether the phrase appears in the query log, and two real-valued features – the frequency with which it appears and the log value, log(1 + frequency). For the decomposed framework, we consider more variations of this information: whether the word appears in the query log file as the first word of a query; whether the word appears in the query log file as an interior word of a query; and whether the word appears in the query log file as the last word of a query. The frequency values of the above features and their log values (log(1 + f), where f is the corresponding frequency value) are also used as real-valued features. Finally, whether the word never appears in any query log entries is also a feature.

和师姐讨论完之后的内容整理

背景及应用:

基于内容的广告词推荐系统,例如google、yahoo、ebay等的在线广告系统

问答系统

关键词替代、扩展

冗余查询的精简、调整、重新规整等

辅助分类

辅助话题追踪

特征选取:

1.语言特征:使用POS(part-of-speech),标出词性。如名词、动词、副词、形容词等。

2.title : 该关键词是否出现在document中的标题里。

3.position : 该关键词在document中的位置,是否出现在整篇文章的首句、末句或段落的首句、末句等。《Automatic Keyword Extraction Using Linguistic Features》里面详细介绍了这种方法。

4.TF,IDF:最基本的信息权衡特征。

5.NE : 该关键词是否为命名实体,如人名、地名。是否为日期信息,如年月日,时间等。

6.关键词之间关系:关键词之间的语义距离,是越大越好还是越小越好,还是没有关系?

7.周围词信息含量:该词所在的位置附近几个词的信息含量是否高?或者说该词所在的句子在整篇文章中信息含量情况如何?

8.该关键词是否在其他关键词中出现过:作为关键词出现的概率

9.document所属类别:可参考基于分类的关键词提取和基于concept的关键词提取

10.该词是否出现在一个总结性句子中

关于NE的问题

1.       在paper《News-Oriented Automatic Chinese Keyword Indexing》中使用过

2.       NE的信息含量非常高。

3.       NE的区分度非常高。

值得注意和探讨的问题:

1.       关键词的定义?是区分度最大还是信息含量最大。

2.       由分词带来的影响。TF的粒度的问题。分词本身存在的问题,《Chinese keyword extraction based on max-duplicated Strings of the Documents》找出重复的最大字串。

《News-Oriented Automatic Chinese Keyword Indexing》描写中文关键词抽取,非常经典的一篇文章。其提出了在分词前先统计字符频率,解决了分词不准确及分词粒度带来的问题。提到了过滤关键词的方法等等。使用POS标记词串,然后过滤掉信息含量比较低的词性对应的词汇。例如连词,副词等等。

关于选择出来的特征,如何选取最有效的特征,可以参考论文《Multi-Subset Selection for Keyword Extraction and Other Prototype Search Tasks Using Feature Selection Algorithms》

时间: 2024-10-28 14:54:53

关于最近研究的关键词提取keyword extraction做的笔记的相关文章

HanLP 关键词提取。入门篇

前段时间,领导要求出一个关键字提取的微服务,要求轻量级. 对于没写过微服务的一个小白来讲.硬着头皮上也不能说不会啊. 首先了解下公司目前的架构体系,发现并不是分布式开发,只能算是分模块部署.然后我需要写个Boot的服务,对外提供一个接口就行. 在上网浏览了下分词概念后,然后我选择了Gradle & HanLP & SpringBoot & JDK1.8 & tomcat8 & IDEA工具来实现. Gradle 我也是第一次听说,和Maven一样,可以很快捷的管理项

基于高维聚类技术的中文关键词提取算法

[摘要]关键词提取是中文信息处理技术的热点和难点,基于统计信息的方法是其中一个重要分支.本文针对基于统计信息关键词提取方法准确率低的问题,提出基于高维聚类技术的中文关键词提取算法.算法通过依据小词典的快速分词.二次分词.高维聚类及关键词甄选四个步骤实现关键词的提取.理论分析和实验显示,基于高维聚类技术的中文关键词提取方法具备更好的稳定性.更高的效率及更准确的结果. 引言  关键词提取是通过对一篇输入文章做内容分析,按一定比例或字数要求提取出重要且语义相似性凝聚的关键词的过程.关键词自动提取是文本

文本关键词提取算法

1.TF-IDF 2.基于语义的统计语言模型 文章关键词提取基础件能够在全面把握文章的中心思想的基础上,提取出若干个代表文章语义内容的词汇或短语,相关结果可用于精化阅读.语义查询和快速匹配等. 采用基于语义的统计语言模型,所处理的文档不受行业领域限制,且能够识别出最新出现的新词语,所输出的词语可以配以权重. 3.TF-IWF文档关键词自动提取算法 针对现有TF-IWF的领域文档关键词快速提取算法.该算法使用简单统计并考虑词长.位置.词性等启发性知识计算词权重,并通过文档净化.领域词典 分词等方法

处理关键字(织梦关键词提取功能)

I‘m sorry  提取的是织梦关键词加载功能 附件下载:http://files.cnblogs.com/subtract/关键词提取.zip 使用步骤: 1.加载 splitword.class.php 文件并提取 1 require_once './../splitword.class.php'; //加载提取关键字文件 2 $sp = new SplitWord('utf-8','utf-8'); //初始化给予两个默认字符集(本套提取关键词是UTF-8) 3 $sp->SetSour

python实现关键词提取

简单的关键词提取的代码 文章内容关键词的提取分为三大步: (1) 分词 (2) 去停用词 (3) 关键词提取 分词方法有很多,我这里就选择常用的结巴jieba分词:去停用词,我用了一个停用词表.具体代码如下: 1 import jieba 2 import jieba.analyse 3 4 #第一步:分词,这里使用结巴分词全模式 5 text = '''新闻,也叫消息,是指报纸.电台.电视台.互联网经常使用的记录社会.传播信息.反映时代的一种文体, 6 具有真实性.时效性.简洁性.可读性.准确

Python调用百度接口(情感倾向分析)和讯飞接口(语音识别、关键词提取)处理音频文件

本示例的过程是: 1. 音频转文本 2. 利用文本获取情感倾向分析结果 3. 利用文本获取关键词提取 首先是讯飞的语音识别模块.在这里可以找到非实时语音转写的相关文档以及 Python 示例.我略作了改动,让它可以对不同人说话作区分,并且作了一些封装. 语音识别功能 weblfasr_python3_demo.py 文件: 1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 """ 4 讯飞非实时转写调用demo(语音识别)

scikit-learn:CountVectorizer提取tf都做了什么

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer class sklearn.feature_extraction.text.CountVectorizer(input=u'content', encoding=u'utf-8', decode_er

<知识库的构建> 4-2 实例提取 Instance Extraction

引自Fabian Suchanek的讲义. 总结:介绍了isA这种二元关系和它的应用即推理Taxonomy以得到完整的Taxonomy,再就是介绍了set expansion方法,从种子出发,找到文本中两个与种子相同的实例,就把该文本中其他的实例都添加至种子表中来逐渐的提取实例的方法,此方法也可以应用于HTML表格中. isA(X,Y) : 是一个表示XY之间的二元关系,若成立则X是Y的子集 Hearst Pattern:是一段文本,表示了isA这种二元关系 例如:Homer is a sing

中央情报局关键词提取——Unicode码

Dataset 本文的任务是学习计算机在内存中如何存储一个值.本文的数据集sentences_cia.csv是中央情报局备忘录的一个摘录,描述了酷刑和其他秘密活动的细节.数据格式如下: year,statement,,, 1997,"The FBI information included that al-Mairi's brother ""traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.&q