knowledge_based topic model - AMC

ABSTRACT摘要

Topic modeling has been widely used to mine topics from documents. However,
a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of doc- uments) to provide reliable statistics to generate coherent topics. In practice, many document collections
do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results.

In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better top- ics.

Another radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. "When faced with a new task, we first mine some reliable (prior) knowledge
from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web."

The proposed al- gorithm mines two forms of knowledge:
must-link (meaning that two words should be in the same topic) and
cannot-link (meaning that two words should not be in the same topic).

It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitiv- ity. Experimental results using review documents from 100 product domains show
that the proposed approach makes dramatic improvements over state-of-the-art baselines.

主题模型已被广泛用于从文档中挖掘主题。然而,主题模型的关键弱点是它需要大量的数据提供可靠的统计量,以产生合理的主题。在实践中许多文献集合缺少大量文件,导致经典主题模型LDA产生非常不合理的主题。基于知识的主题模型的提出,要求用户提供一些现有的领域知识来指导模型,以产生更好的主题。

Introduction

解决the key weakness of topic modeling的方法

1. Inventing better topic models: This approach may be effective if a large number of documents are available. How- ever, since topic models perform unsupervised learning, if the data is small, there is
simply not enough information to provide reliable statistics to generate coherent topics. Some form of supervision or external information beyond the given documents is necessary.

2. Asking users to provide prior domain knowledge: An ob- vious form of external information is the prior knowledge of the domain from the user.
For example, the user can input the knowledge in the form of must-link and cannot- link. A must-link states that two terms (or words) should belong to the same topic, e.g., price and cost. A cannot- link indicates that two terms should not
be in the same topic, e.g., price and picture. Some existing knowledge- based topic models (e.g., [1, 2, 9, 10, 14, 15, 26, 28]) can exploit such prior domain knowledge to produce better topics. However, asking the user to provide prior do- main knowledge
can be problematic in practice because the user may not know what knowledge to provide and wants the system to discover for him/her. It also makes the approach non-automatic.

3. Learning like humans (lifelong learning): We still use the knowledge-based approach but mine the prior knowledge automatically from the results of past learning. This ap- proach works like human learning.
We humans always retain the results learned in the past and use them to help future learning.   However, our approach is very di erent from existing life- long learning methods (see bellow).

Lifelong learning is possible in our context due to two key observations:(怎么去寻找must-link和cannot-link)

1. Although every domain is different, there is a fair amount of topic overlapping across domains. For example, every product review domain(手机,闹钟,笔记本等等) has the topic of price, most electronic products share the topic of battery and some also have the topic
of screen. From the topics learned from these domains, we can mine frequently shared terms among the topics.
For example, we may find price and cost frequently appear together in some topics, which indicates that they are likely to belong to the same topic and thus form a
must-link.
Note that we have the frequency requirement because we want reliable knowledge.

2. From the previously generated topics from many domains, it is also possible to find that picture and price should not be in the same topic (a cannot-link).
This can be done by fi nding a set of topics that have picture as a top topical term, but the term price almost never appear at the top of this set of topics, i.e., they are negatively correlated.

proposed lifelong learning approach:
(终身学习方法)

阶段1Phrase 1 (Initialization): Given n prior document collec- tions D = fD1; : : : ;Dng, a topic model (e.g., LDA) is run on each collection Di 2 D to produce a set of topics Si. Let S = ∪iSi, which we call the prior topics (or p-topics for
short). It then mines must-links M from S using a multiple minimum supports frequent itemset mining algorithm.

阶段2Phase 2 (Lifelong learning): Given a new document collec- tion Dt, a knowledge-based topic model (KBTM) with the must-links M is run to generate a set of topics At. Based on At, the algorithm finds a set of cannot-links C. The KBTM then
continues, which is now guided by both must-links M and cannot-links C, to produce the final topic set At. We will explain why we mine cannot-links based on At in Sec- tion 4.2. To enable lifelong learning, At is incorporated into S, which is used to generate
a new set of must-links M.

lifelong learnning approach框图:

OVERALL ALGORITHM

MINING KNOWLEDGE

1.Mining Must-Link Knowledge

e.g. price, cost , we should expect to see price and cost as topical terms in the same topic across many domains. Note that they may not appear together in every topic about price due to the special context of the domain or past topic modeling errors.

In practice, top terms under a topic are expected to represent some similar seman- tic meaning. The lower ranked terms usually have very low probabilities due to the smoothing effect of the Dirichlet hyper-parameters rather than true correlations within
the topic, leading to their unreliability. Thus, in this work, only top 15 terms are employed to represent a topic.

Given a set of prior topics (p-topics) S, we nd sets of terms that appear together in multiple topics using the data mining technique frequent itemset mining (FIM).But A single minimum support is not appropriate.

di fferent topics may have very di erent frequencies in the data.called the rare item problem.

thus use the multiple minimum supports frequent itemset mining (MS-FIM)

2. Mining Cannot-Link Knowledge

For a term w, there are usually only a few terms wm that share must-links with w while there are a huge number of terms wc that can form cannot-links with w.

However, for a new or test do- main Dt, most of these cannot-links are not useful because the vocabulary size of Dt is much smaller than V . Thus, we focus only on those terms that are relevant to Dt.

we extract cannot-links from each pair of top terms w1 and w2 in each c-topic At j ∈ At:cannot-link mining is targeted to each c-topic.

determine whether two terms form a cannot-link:

Let the number of prior domains that w1 and w2 appear in different p-topics be Ndiff and the number of prior domains that w1 and w2 share the same topic be Nshare. Ndiff should be much larger than Nshare.

We need to use two conditions or thresholds to control the formation of a cannot-link: {即shared少而diff多}

1. The ratio Ndiff =(Nshare +Ndiff ) (called the support ra- tio) is equal to or larger than a threshold πc. This condi- tion is intuitive because p-topics may contain noise due to errors of topic models.

2. Ndiff is greater than a support threshold πdiff . This condition is needed because the above ratio can be 1, but Ndiff can be very small, which may not give reliable cannot-links.

{如screen和pad在p-topics都很少出现Ndiff小,但他们也没有交集Nshare=0,这样support ratio is 1}

AMC MODEL

handling incorrect knowledge:

The idea is that the semantic relationships re ected by correct must-links and cannot-links should also be reason- ably induced by the statistical information underlying the domain collection.

Dealing with Issues of Must-Links

1. A term can have multiple meanings or senses.transitivity problem.

2. Not every must-link is suitable for a domain.

Recognizing Multiple Senses

Given two must-links m1 and m2, if they share the same word sense, the p-topics that cover m1 should have some overlapping with the p-topics that cover m2. For example, must-links flight, brightg and flight, luminanceg should be mostly coming from the same
set of p-topics related to the semantic meaning \something that makes things visible" of light.

m1 and m2 share the same sense if

from:

ref:KDD2014-Zhiyuan(Brett)Chen-Mining Topics in Documents

时间: 2024-08-15 17:52:58

knowledge_based topic model - AMC的相关文章

knowledge_based topic model KBTM

http://blog.csdn.net/pipisorry/article/details/44040701 术语 Mustlink states that two words should belong to the same topic Cannot-link states that two words should not belong to the same topic. DF-LDA is perhaps the earliest KBTM, which can incorporat

topic model

0.基石--贝叶斯推断 计算后验概率即为我们对参数的估计: 其中: ? ??--输入数据 ? ???--待估计的参数 ? ??--似然分布 ? ???--参数的先验分布 ? 对新样本的预测:我们要估计的概率 1.常用的概率分布 Dirichlet Distribution 2.文本建模 2.1 基本模型--unigram model 最基本的一种文本模型. 我们做这样的假设:语料库是从词表中独立的抽取的个.有似然方程 其中是term[t]出现的次数.我们的目标是估计,根据贝叶斯推断的方法,我们需

Topic Model之Probabilistic Latent Semantic Indexing(PLSI/PLSA)

Probabilistic Latent Semantic Indexing(PLSI/PLSA)是常用的话题模型之一,他通过生成模型来模拟文档的产生过程,然后用Maximum likelihood的方法估计模型中未知参数的值,来获取整个生成模型中的参数值,从而构建起整个生成模型. 一. 基本概念 1.  SVD奇异值分解:SVD主要用来求低阶近似问题.当给定一个MXN的矩阵C时(其秩为r),我们希望找到一个近似矩阵C'(其秩不大于k),当k远小于r时,我们称C'为C的低阶近似,设X = C -

【转】基于LDA的Topic Model变形

转载自wentingtu 基于LDA的Topic Model变形最近几年来,随着LDA的产生和发展,涌现出了一批搞Topic Model的牛人.我主要关注了下面这位大牛和他的学生:David M. BleiLDA的创始者,04年博士毕业.一篇关于Topic Model的博士论文充分体现其精深的数学概率功底:而其自己实现的LDA又可体现其不俗的编程能力.说人无用,有论文为证: J. Chang and D. Blei. Relational Topic Models for Document Ne

Topic Model的分类和设计原则

Topic Model的分类和设计原则 http://blog.csdn.net/xianlingmao/article/details/7065318 topic model的介绍性文章已经很多,在此仅做粗略介绍,本文假设读者已经较为熟悉Topic Medel. Topic Model (LDA)认为一个离散数据集合(如文档集合,图片集合,为行文方便,本文统统以文档集合作为描述对象,其他的数据集合只需换掉对应的术语即可)是由隐含在数据集合背后的topic set 生成的,这个set中的每一个t

Topic Model 实战

Topic Model在考虑语义的情景中被广泛使用,实践证明效果也不错.本文总结了一些Topic Model实战技巧. 利用优质“少量”数据学习模型,缓解单机速度和内存问题,对剩余/新文档做推导(可以数据并行).比如用微博数据训练LDA时,先把长度短的微博过滤掉(有工作得出长度为7的文本已经适合LDA进行学习的结论),剔除相似的微博(转发/分享会造成很多近乎相同的微博).数据量大并且单机环境中可试一下GraphLab Create,还支持采样比较快的alias LDA.如果不仅是为了学习当前文档

我是这样一步步理解--主题模型(Topic Model)、LDA(案例代码)

1. LDA模型是什么 LDA可以分为以下5个步骤: 一个函数:gamma函数. 四个分布:二项分布.多项分布.beta分布.Dirichlet分布. 一个概念和一个理念:共轭先验和贝叶斯框架. 两个模型:pLSA.LDA. 一个采样:Gibbs采样 关于LDA有两种含义,一种是线性判别分析(Linear Discriminant Analysis),一种是概率主题模型:隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA),本文讲后者. 按照wiki上的介绍,L

topic model - LDA 1

http://blog.csdn.net/pipisorry/article/details/42129099 step1 :install gensim step 2 :Corpora and Vector Spaces 将用字符串表示的文档转换为用id表示的文档向量: documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of

Topic Model Demo

R语言实现:library(tidyverse) 1.Ndocs = 500 WordsPerDoc = rpois(Ndocs, 100) 2. thetaList = list(c(A=.60, B=.25, C=.15),c(A=.10, B=.10, C=.80))     //主题A.B.C theta_1 = t(replicate(Ndocs/2, thetaList[[1]]))  theta_2 = t(replicate(Ndocs/2, thetaList[[2]])) t