ABSTRACT摘要
Topic modeling has been widely used to mine topics from documents. However,
a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of doc- uments) to provide reliable statistics to generate coherent topics. In practice, many document collections
do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results.
In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better top- ics.
Another radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. "When faced with a new task, we first mine some reliable (prior) knowledge
from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web."
The proposed al- gorithm mines two forms of knowledge:
must-link (meaning that two words should be in the same topic) and
cannot-link (meaning that two words should not be in the same topic).
It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitiv- ity. Experimental results using review documents from 100 product domains show
that the proposed approach makes dramatic improvements over state-of-the-art baselines.
主题模型已被广泛用于从文档中挖掘主题。然而,主题模型的关键弱点是它需要大量的数据提供可靠的统计量,以产生合理的主题。在实践中许多文献集合缺少大量文件,导致经典主题模型LDA产生非常不合理的主题。基于知识的主题模型的提出,要求用户提供一些现有的领域知识来指导模型,以产生更好的主题。
Introduction
解决the key weakness of topic modeling的方法
1. Inventing better topic models: This approach may be effective if a large number of documents are available. How- ever, since topic models perform unsupervised learning, if the data is small, there is
simply not enough information to provide reliable statistics to generate coherent topics. Some form of supervision or external information beyond the given documents is necessary.
2. Asking users to provide prior domain knowledge: An ob- vious form of external information is the prior knowledge of the domain from the user.
For example, the user can input the knowledge in the form of must-link and cannot- link. A must-link states that two terms (or words) should belong to the same topic, e.g., price and cost. A cannot- link indicates that two terms should not
be in the same topic, e.g., price and picture. Some existing knowledge- based topic models (e.g., [1, 2, 9, 10, 14, 15, 26, 28]) can exploit such prior domain knowledge to produce better topics. However, asking the user to provide prior do- main knowledge
can be problematic in practice because the user may not know what knowledge to provide and wants the system to discover for him/her. It also makes the approach non-automatic.
3. Learning like humans (lifelong learning): We still use the knowledge-based approach but mine the prior knowledge automatically from the results of past learning. This ap- proach works like human learning.
We humans always retain the results learned in the past and use them to help future learning. However, our approach is very di erent from existing life- long learning methods (see bellow).
Lifelong learning is possible in our context due to two key observations:(怎么去寻找must-link和cannot-link)
1. Although every domain is different, there is a fair amount of topic overlapping across domains. For example, every product review domain(手机,闹钟,笔记本等等) has the topic of price, most electronic products share the topic of battery and some also have the topic
of screen. From the topics learned from these domains, we can mine frequently shared terms among the topics.
For example, we may find price and cost frequently appear together in some topics, which indicates that they are likely to belong to the same topic and thus form a
must-link. Note that we have the frequency requirement because we want reliable knowledge.
2. From the previously generated topics from many domains, it is also possible to find that picture and price should not be in the same topic (a cannot-link).
This can be done by fi nding a set of topics that have picture as a top topical term, but the term price almost never appear at the top of this set of topics, i.e., they are negatively correlated.
proposed lifelong learning approach:
(终身学习方法)
阶段1Phrase 1 (Initialization): Given n prior document collec- tions D = fD1; : : : ;Dng, a topic model (e.g., LDA) is run on each collection Di 2 D to produce a set of topics Si. Let S = ∪iSi, which we call the prior topics (or p-topics for
short). It then mines must-links M from S using a multiple minimum supports frequent itemset mining algorithm.
阶段2Phase 2 (Lifelong learning): Given a new document collec- tion Dt, a knowledge-based topic model (KBTM) with the must-links M is run to generate a set of topics At. Based on At, the algorithm finds a set of cannot-links C. The KBTM then
continues, which is now guided by both must-links M and cannot-links C, to produce the final topic set At. We will explain why we mine cannot-links based on At in Sec- tion 4.2. To enable lifelong learning, At is incorporated into S, which is used to generate
a new set of must-links M.
lifelong learnning approach框图:
OVERALL ALGORITHM
MINING KNOWLEDGE
1.Mining Must-Link Knowledge
e.g. price, cost , we should expect to see price and cost as topical terms in the same topic across many domains. Note that they may not appear together in every topic about price due to the special context of the domain or past topic modeling errors.
In practice, top terms under a topic are expected to represent some similar seman- tic meaning. The lower ranked terms usually have very low probabilities due to the smoothing effect of the Dirichlet hyper-parameters rather than true correlations within
the topic, leading to their unreliability. Thus, in this work, only top 15 terms are employed to represent a topic.
Given a set of prior topics (p-topics) S, we nd sets of terms that appear together in multiple topics using the data mining technique frequent itemset mining (FIM).But A single minimum support is not appropriate.
di fferent topics may have very di erent frequencies in the data.called the rare item problem.
thus use the multiple minimum supports frequent itemset mining (MS-FIM)
2. Mining Cannot-Link Knowledge
For a term w, there are usually only a few terms wm that share must-links with w while there are a huge number of terms wc that can form cannot-links with w.
However, for a new or test do- main Dt, most of these cannot-links are not useful because the vocabulary size of Dt is much smaller than V . Thus, we focus only on those terms that are relevant to Dt.
we extract cannot-links from each pair of top terms w1 and w2 in each c-topic At j ∈ At:cannot-link mining is targeted to each c-topic.
determine whether two terms form a cannot-link:
Let the number of prior domains that w1 and w2 appear in different p-topics be Ndiff and the number of prior domains that w1 and w2 share the same topic be Nshare. Ndiff should be much larger than Nshare.
We need to use two conditions or thresholds to control the formation of a cannot-link: {即shared少而diff多}
1. The ratio Ndiff =(Nshare +Ndiff ) (called the support ra- tio) is equal to or larger than a threshold πc. This condi- tion is intuitive because p-topics may contain noise due to errors of topic models.
2. Ndiff is greater than a support threshold πdiff . This condition is needed because the above ratio can be 1, but Ndiff can be very small, which may not give reliable cannot-links.
{如screen和pad在p-topics都很少出现Ndiff小,但他们也没有交集Nshare=0,这样support ratio is 1}
AMC MODEL
handling incorrect knowledge:
The idea is that the semantic relationships re ected by correct must-links and cannot-links should also be reason- ably induced by the statistical information underlying the domain collection.
Dealing with Issues of Must-Links
1. A term can have multiple meanings or senses.transitivity problem.
2. Not every must-link is suitable for a domain.
Recognizing Multiple Senses
Given two must-links m1 and m2, if they share the same word sense, the p-topics that cover m1 should have some overlapping with the p-topics that cover m2. For example, must-links flight, brightg and flight, luminanceg should be mostly coming from the same
set of p-topics related to the semantic meaning \something that makes things visible" of light.
m1 and m2 share the same sense if
from:
ref:KDD2014-Zhiyuan(Brett)Chen-Mining Topics in Documents