Learning LexRank——Graph-based Centrality as Salience in Text Summarization(一)

(1)What is Sentence Centrality and Centroid-based Summarization ?

  Extractive summarization works by choosing a subset of the sentences in the original documents. This process can be viewed as identifying the most central sentences in a (multi-document) cluster that give the necessary and sufficient amount of information related to the main theme of the cluster.

  The centroid of a cluster is a pseudo-document which consists of words that have tf×idf scores above a predefined threshold, where tf is the frequency of a word in the cluster, and idf values are typically computed over a much larger and similar genre data set.

  In centroid-based summarization (Radev, Jing, & Budzikowska, 2000), the sentences that contain more words from the centroid of the cluster are considered as central. This is a measure of how close the sentence is to the centroid of the cluster.

(2)Centrality-based Sentence Salience:

  All of our approaches are based on the concept of prestige in social networks. A social network is a mapping of relationships between interacting entities (e.g. people, organizations, computers). Social networks are represented as graphs, where the nodes represent the entities and the links represent the relations between the nodes.

  A cluster of documents can be viewed as a network of sentences that are related to each other. We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or salient) to the topic.

  There are two points to clarify in this definition of centrality:

  1.How to define similarity between two sentences.

  2.How to compute the overall centrality of a sentence given its similarity to other sentences.

  To define similarity, we use the bag-of-words model to represent each sentence as an N-dimensional vector, where N is the number of all possible words in the target language. For each word that occurs in a sentence, the value of the corresponding dimension in the vector representation of the sentence is the number of occurrences of the word in the sentence times the idf of the word. The similarity between two sentences is then defined by the cosine between two corresponding vectors:

  A cluster of documents may be represented by a cosine similarity matrix where each entry in the matrix is the similarity between the corresponding sentence pair.

  Figure 1 shows a subset of a cluster used in DUC 2004, and the corresponding cosine similarity matrix. Sentence ID dXsY indicates the Y th sentence in the Xth document.

                                             Figure 1: Intra-sentence cosine similarities in a subset of cluster d1003t from DUC 2004.

  This matrix can also be represented as a weighted graph where each edge shows the cosine similarity between a pair of sentence (Figure 2).

Figure 2: Weighted cosine similarity graph for the cluster in Figure 1.

(3)Degree Centrality:

  Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold so that the cluster can be viewed as an (undirected) graph.

  Figure 3 shows the graphs that correspond to the adjacency matrices derived by assuming the pair of sentences that have a similarity above 0.1, 0.2, and 0.3, respectively, in Figure 1 are similar to each other. Note that there should also be self links for all of the nodes in the graphs since every sentence is trivially similar to itself. Although we omit the self links for readability, the arguments in the following sections assume that they exist.

                                                             -----------------------------------------------------------------------

                                                               -----------------------------------------------------------------------

Figure 3: Similarity graphs that correspond to thresholds 0.1, 0.2, and 0.3, respectively, for the cluster in Figure 1.

  A simple way of assessing sentence centrality by looking at the graphs in Figure 3 is to count the number of similar sentences for each sentence. We define degree centrality of a sentence as the degree of the corresponding node in the similarity graph. As seen in Table 1, the choice of cosine threshold dramatically influences the interpretation of centrality. Too low thresholds may mistakenly take weak similarities into consideration while too high thresholds may lose many of the similarity relations in a cluster.

Table 1: Degree centrality scores for the graphs in Figure 3. Sentence d4s1 is the most central sentence for thresholds 0.1 and 0.2.

时间: 2025-01-06 07:29:24

Learning LexRank——Graph-based Centrality as Salience in Text Summarization(一)的相关文章

Graph and Deep Learning

原文: https://devblogs.nvidia.com/parallelforall/intersection-large-scale-graph-analytics-deep-learning/ 摘要: 1)图在社交网络的数据分析中非常重要,而图变的越来越大,尽管内存容量也不断增加,in-memory的图处理仍然有局限,因此使用了一个基于Parallel Sliding Windows (PSW) 的分割技术,减小图处理对内存的需要. 2)图的并行化,基于edge的处理,比较容易负载均

Graph machine learning 工具

OGB: Open Graph Benchmark https://ogb.stanford.edu/ https://github.com/snap-stanford/ogb OGB is a collection of benchmark datasets, data-loaders and evaluators for graph machine learning in PyTorch. Data-loaders are fully compatible with PyTorch Geom

SOME USEFUL MACHINE LEARNING LIBRARIES.

from: http://www.erogol.com/broad-view-machine-learning-libraries/ http://www.slideshare.net/VincenzoLomonaco/deep-learning-libraries-and-rst-experiments-with-theano FEBRUARY 6, 2014 EREN 1 COMMENT Especially, with the advent of many different and in

[译]深度神经网络的多任务学习概览(An Overview of Multi-task Learning in Deep Neural Networks)

译自:http://sebastianruder.com/multi-task/ 1. 前言 在机器学习中,我们通常关心优化某一特定指标,不管这个指标是一个标准值,还是企业KPI.为了达到这个目标,我们训练单一模型或多个模型集合来完成指定得任务.然后,我们通过精细调参,来改进模型直至性能不再提升.尽管这样做可以针对一个任务得到一个可接受得性能,但是我们可能忽略了一些信息,这些信息有助于在我们关心的指标上做得更好.具体来说,这些信息就是相关任务的监督数据.通过在相关任务间共享表示信息,我们的模型在

Graph network classification(As a beginner, continue to update)

Data arrangement 1.Reference Webs http://nlp.csai.tsinghua.edu.cn/~tcc/ https://blog.csdn.net/a609640147/article/details/89562262 https://blog.csdn.net/liudingbobo/article/details/83039233 https://blog.csdn.net/lzglzj20100700/article/details/84965339

A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)

A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python) MACHINE LEARNING PYTHON R SHARE  MANISH SARASWAT, APRIL 12, 2016 / 52 Introduction Tree based learning algorithms are considered to be one of the best and mostly used supervised

Deep Learning Overview

[Ref: http://en.wikipedia.org/wiki/Deep_learning] Definition: a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed

《Segment-Tree based Cost Aggregation for Stereo Matching》读后感~

前段时间整理博客发现,自己关于立体匹配部分的介绍太少了,这可是自己花了一个季度研究的东西啊!读了自认为大量的文章,基本上有源码的都自己跑了一遍,还改进了多个算法.不写写会留下遗憾的,所以打算在立体匹配这一块多谢谢博客,一来用于分享,二来用于请教,三来用于备忘.本文介绍的文章就是CVPR2013的<Segment-Tree based Cost Aggregation for Stereo Matching>一文,介绍它原因有以下几点: 1.它是NLCA的变种. 2.它是CVPR的文章. 本文还

从图(Graph)到图卷积(Graph Convolution):漫谈图神经网络模型 (一)

本文属于图神经网络的系列文章,文章目录如下: 从图(Graph)到图卷积(Graph Convolution):漫谈图神经网络模型 (一) 从图(Graph)到图卷积(Graph Convolution):漫谈图神经网络模型 (二) 从图(Graph)到图卷积(Graph Convolution):漫谈图神经网络模型 (三) 笔者最近看了一些图与图卷积神经网络的论文,深感其强大,但一些Survey或教程默认了读者对图神经网络背景知识的了解,对未学过信号处理的读者不太友好.同时,很多教程只讲是什么