topic model - LDA 1

http://blog.csdn.net/pipisorry/article/details/42129099

step 2 :Corpora and Vector Spaces

将用字符串表示的文档转换为用id表示的文档向量:

documents = ["Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"]
"""
#use StemmedCountVectorizer to get stemmed without stop words corpus
Vectorizer = StemmedCountVectorizer
# Vectorizer = CountVectorizer
vectorizer = Vectorizer(stop_words=‘english‘)
vectorizer.fit_transform(documents)
texts = vectorizer.get_feature_names()
# print(texts)
"""
texts = [doc.lower().split() for doc in documents]
# print(texts)
dict = corpora.Dictionary(texts)    #自建词典
# print dict, dict.token2id
#通过dict将用字符串表示的文档转换为用id表示的文档向量
corpus = [dict.doc2bow(text) for text in texts]
print(corpus)

【http://www.52nlp.cn/%E】

from:http://blog.csdn.net/pipisorry/article/details/42129099

ref:http://radimrehurek.com/gensim/tutorial.html

时间： 2024-12-12 11:49:03

topic model - LDA 1的相关文章

knowledge_based topic model - AMC

ABSTRACT摘要 Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of doc- uments) to provide reliable statistics to generate coherent topic

【转】基于LDA的Topic Model变形

转载自wentingtu 基于LDA的Topic Model变形最近几年来,随着LDA的产生和发展,涌现出了一批搞Topic Model的牛人.我主要关注了下面这位大牛和他的学生:David M. BleiLDA的创始者,04年博士毕业.一篇关于Topic Model的博士论文充分体现其精深的数学概率功底:而其自己实现的LDA又可体现其不俗的编程能力.说人无用,有论文为证: J. Chang and D. Blei. Relational Topic Models for Document Ne

Topic Model的分类和设计原则

Topic Model的分类和设计原则 http://blog.csdn.net/xianlingmao/article/details/7065318 topic model的介绍性文章已经很多,在此仅做粗略介绍,本文假设读者已经较为熟悉Topic Medel. Topic Model (LDA)认为一个离散数据集合(如文档集合,图片集合,为行文方便,本文统统以文档集合作为描述对象,其他的数据集合只需换掉对应的术语即可)是由隐含在数据集合背后的topic set 生成的,这个set中的每一个t

Topic Model 实战

Topic Model在考虑语义的情景中被广泛使用,实践证明效果也不错.本文总结了一些Topic Model实战技巧. 利用优质“少量”数据学习模型,缓解单机速度和内存问题,对剩余/新文档做推导(可以数据并行).比如用微博数据训练LDA时,先把长度短的微博过滤掉(有工作得出长度为7的文本已经适合LDA进行学习的结论),剔除相似的微博(转发/分享会造成很多近乎相同的微博).数据量大并且单机环境中可试一下GraphLab Create,还支持采样比较快的alias LDA.如果不仅是为了学习当前文档

topic model

0.基石--贝叶斯推断计算后验概率即为我们对参数的估计: 其中: ? ??--输入数据 ? ???--待估计的参数 ? ??--似然分布 ? ???--参数的先验分布 ? 对新样本的预测:我们要估计的概率 1.常用的概率分布 Dirichlet Distribution 2.文本建模 2.1 基本模型--unigram model 最基本的一种文本模型. 我们做这样的假设:语料库是从词表中独立的抽取的个.有似然方程其中是term[t]出现的次数.我们的目标是估计,根据贝叶斯推断的方法,我们需

knowledge_based topic model KBTM

http://blog.csdn.net/pipisorry/article/details/44040701 术语 Mustlink states that two words should belong to the same topic Cannot-link states that two words should not belong to the same topic. DF-LDA is perhaps the earliest KBTM, which can incorporat

Topic Model之Probabilistic Latent Semantic Indexing(PLSI/PLSA)

Probabilistic Latent Semantic Indexing(PLSI/PLSA)是常用的话题模型之一,他通过生成模型来模拟文档的产生过程,然后用Maximum likelihood的方法估计模型中未知参数的值,来获取整个生成模型中的参数值,从而构建起整个生成模型. 一. 基本概念 1. SVD奇异值分解:SVD主要用来求低阶近似问题.当给定一个MXN的矩阵C时(其秩为r),我们希望找到一个近似矩阵C'(其秩不大于k),当k远小于r时,我们称C'为C的低阶近似,设X = C -

我是这样一步步理解--主题模型(Topic Model)、LDA(案例代码)

1. LDA模型是什么 LDA可以分为以下5个步骤: 一个函数:gamma函数. 四个分布:二项分布.多项分布.beta分布.Dirichlet分布. 一个概念和一个理念:共轭先验和贝叶斯框架. 两个模型:pLSA.LDA. 一个采样:Gibbs采样关于LDA有两种含义,一种是线性判别分析(Linear Discriminant Analysis),一种是概率主题模型:隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA),本文讲后者. 按照wiki上的介绍,L

Topic Model Demo

R语言实现:library(tidyverse) 1.Ndocs = 500 WordsPerDoc = rpois(Ndocs, 100) 2. thetaList = list(c(A=.60, B=.25, C=.15),c(A=.10, B=.10, C=.80)) //主题A.B.C theta_1 = t(replicate(Ndocs/2, thetaList[[1]])) theta_2 = t(replicate(Ndocs/2, thetaList[[2]])) t