如何计算两个文档的相似度(三)

本文代码全部实现,并附上注释:

# -*- coding: cp936 -*-
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer
from gensim import corpora, models, similarities
import logging

courses = [line.strip() for line in file('/home/liuxianga/coursera/coursera_corpus')]     #######列表,每个列表元素为整篇文档
# print "courses"
# print courses
courses_name = [course.split('\t')[0] for course in courses]        #########文档名称,同样是列表结构
# print "courses_name[0:10]"
# print courses_name[0:10]
texts_lower = [[word for word in document.lower().split()] for document in courses]       #######列表的列表,内层列表内容为整篇文档,文档字母全部小写化
# print "texts_lower[0]"
# print texts_lower[0]
texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]      #######列表的列表,内层列表内容为整篇文档,文档字母全部小写化,且为unicode编码
# print "texts_tokenized[0]"
# print texts_tokenized[0]
english_stopwords = stopwords.words('english')
# print "english_stopwords"
# print english_stopwords
len(english_stopwords)
texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]   ##########从texts_tokenized中去除stopword
# print "texts_filtered_stopwords[0]"
# print texts_filtered_stopwords[0]
english_punctuations = [',','.',':',';','?','(',')','[',']','&','!','*','@','#','$','%']
texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]    ##########从texts_tokenized中去除标点符号
# print "texts_filtered[0]"
# print texts_filtered[0]
st = LancasterStemmer()
# print st.stem('stemmed')
# print st.stem('stemming')
texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]     ########词干化
# print "texts_stemmed[0]"
# print texts_stemmed[0]
all_stems = sum(texts_stemmed, [])
stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]
# print "texts[0:10]"
# print texts[0:10]
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)
index = similarities.MatrixSimilarity(lsi[corpus])
print "####################################################################################################################"
# print "courses_name[210]"
# print courses_name[210]          #######课程名称
ml_course = texts[210]        #####课程整篇文档
ml_bow = dictionary.doc2bow(ml_course)
ml_lsi = lsi[ml_bow]      ########210课程与10个主题相似性大小
print "ml_lsi"
print ml_lsi
sims = index[ml_lsi]     #######给指定课程名称推荐若干篇类似文档,sims包含所有的推荐文档
print "sims[0:10]"
print sims[0:10]
sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])      ######对于sims的所有元素排序
print "sort_sims[0:10]"
print sort_sims[0:10]

# print "courses_name[210]"
# print courses_name[210]
#
# print "courses_name[174]"
# print courses_name[174]
#
# print "courses_name[238]"
# print courses_name[238]
#
# print "courses_name[203]"
# print courses_name[203]

三、课程图谱相关实验

1、数据准备

为了方便大家一起来做验证,这里准备了一份Coursera的课程数据,可以在这里下载:coursera_corpus,(百度网盘链接:
http://t.cn/RhjgPkv,密码: oppc)总共379个课程,每行包括3部分内容:课程名\t课程简介\t课程详情, 已经清除了其中的html tag, 下面所示的例子仅仅是其中的课程名:

Writing II: Rhetorical Composing

Genetics and Society: A Course for Educators

General Game Playing

Genes and the Human Condition (From Behavior to Biotechnology)

A Brief History of Humankind

New Models of Business in Society

Analyse Numérique pour Ingénieurs

Evolution: A Course for Educators

Coding the Matrix: Linear Algebra through Computer Science Applications

The Dynamic Earth: A Course for Educators

好了,首先让我们打开Python, 加载这份数据:

>>> courses = [line.strip() for line in file(‘coursera_corpus’)]

>>> courses_name = [course.split(‘\t’)[0] for course in courses]

>>> print courses_name[0:10]

[‘Writing II: Rhetorical Composing’, ‘Genetics and Society: A Course for Educators’, ‘General Game Playing’, ‘Genes and the Human Condition (From Behavior to Biotechnology)’, ‘A Brief History of Humankind’, ‘New Models of Business in Society’, ‘Analyse Num\xc3\xa9rique
pour Ing\xc3\xa9nieurs’, ‘Evolution: A Course for Educators’, ‘Coding the Matrix: Linear Algebra through Computer Science Applications’, ‘The Dynamic Earth: A Course for Educators’]

2、引入NLTK

NTLK是著名的Python自然语言处理工具包,但是主要针对的是英文处理,不过课程图谱目前处理的课程数据主要是英文,因此也足够了。NLTK配套有文档,有语料库,有书籍,甚至国内有同学无私的翻译了这本书:
用Python进行自然语言处理,有时候不得不感慨:做英文自然语言处理的同学真幸福。

首先仍然是安装NLTK,在NLTK的主页详细介绍了如何在Mac, Linux和Windows下安装NLTK:http://nltk.org/install.html ,最主要的还是要先装好依赖NumPy和PyYAML,其他没什么问题。安装NLTK完毕,可以import nltk测试一下,如果没有问题,还有一件非常重要的工作要做,下载NLTK官方提供的相关语料:

>>> import nltk

>>> nltk.download()

这个时候会弹出一个图形界面,会显示两份数据供你下载,分别是all-corpora和book,最好都选定下载了,这个过程需要一段时间,语料下载完毕后,NLTK在你的电脑上才真正达到可用的状态,可以测试一下布朗语料库

>>> from nltk.corpus import brown

>>> brown.readme()

‘BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified
1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n’

>>> brown.words()[0:10]

[‘The’, ‘Fulton’, ‘County’, ‘Grand’, ‘Jury’, ‘said’, ‘Friday’, ‘an’, ‘investigation’, ‘of’]

>>> brown.tagged_words()[0:10]

[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), (‘County’, ‘NN-TL’), (‘Grand’, ‘JJ-TL’), (‘Jury’, ‘NN-TL’), (‘said’, ‘VBD’), (‘Friday’, ‘NR’), (‘an’, ‘AT’), (‘investigation’, ‘NN’), (‘of’, ‘IN’)]

>>> len(brown.words())

1161192

现在我们就来处理刚才的课程数据,如果按此前的方法仅仅对文档的单词小写化的话,我们将得到如下的结果:

>>> texts_lower = [[word for word in document.lower().split()] for document in courses]

>>> print texts_lower[0]

[‘writing’, ‘ii:’, ‘rhetorical’, ‘composing’, ‘rhetorical’, ‘composing’, ‘engages’, ‘you’, ‘in’, ‘a’, ‘series’, ‘of’, ‘interactive’, ‘reading,’, ‘research,’, ‘and’, ‘composing’, ‘activities’, ‘along’, ‘with’, ‘assignments’, ‘designed’, ‘to’, ‘help’, ‘you’,
‘become’, ‘more’, ‘effective’, ‘consumers’, ‘and’, ‘producers’, ‘of’, ‘alphabetic,’, ‘visual’, ‘and’, ‘multimodal’, ‘texts.’, ‘join’, ‘us’, ‘to’, ‘become’, ‘more’, ‘effective’, ‘writers…’, ‘and’, ‘better’, ‘citizens.’, ‘rhetorical’, ‘composing’, ‘is’, ‘a’,
‘course’, ‘where’, ‘writers’, ‘exchange’, ‘words,’, ‘ideas,’, ‘talents,’, ‘and’, ‘support.’, ‘you’, ‘will’, ‘be’, ‘introduced’, ‘to’, ‘a’, …

注意其中很多标点符号和单词是没有分离的,所以我们引入nltk的word_tokenize函数,并处理相应的数据:

>>> from nltk.tokenize import word_tokenize

>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode(‘utf-8’))] for document in courses]

>>> print texts_tokenized[0]

[‘writing’, ‘ii’, ‘:’, ‘rhetorical’, ‘composing’, ‘rhetorical’, ‘composing’, ‘engages’, ‘you’, ‘in’, ‘a’, ‘series’, ‘of’, ‘interactive’, ‘reading’, ‘,’, ‘research’, ‘,’, ‘and’, ‘composing’, ‘activities’, ‘along’, ‘with’, ‘assignments’, ‘designed’, ‘to’, ‘help’,
‘you’, ‘become’, ‘more’, ‘effective’, ‘consumers’, ‘and’, ‘producers’, ‘of’, ‘alphabetic’, ‘,’, ‘visual’, ‘and’, ‘multimodal’, ‘texts.’, ‘join’, ‘us’, ‘to’, ‘become’, ‘more’, ‘effective’, ‘writers’, ‘…’, ‘and’, ‘better’, ‘citizens.’, ‘rhetorical’, ‘composing’,
‘is’, ‘a’, ‘course’, ‘where’, ‘writers’, ‘exchange’, ‘words’, ‘,’, ‘ideas’, ‘,’, ‘talents’, ‘,’, ‘and’, ‘support.’, ‘you’, ‘will’, ‘be’, ‘introduced’, ‘to’, ‘a’, …

对课程的英文数据进行tokenize之后,我们需要去停用词,幸好NLTK提供了一份英文停用词数据:

>>> from nltk.corpus import stopwords

>>> english_stopwords = stopwords.words(‘english’)

>>> print english_stopwords

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’, ‘hers’, ‘herself’, ‘it’, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’,
‘who’, ‘whom’, ‘this’, ‘that’, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’,
‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’,
‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, ‘should’, ‘now’]

>>> len(english_stopwords)

127

总计127个停用词,我们首先过滤课程语料中的停用词:

>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]

>>> print texts_filtered_stopwords[0]

[‘writing’, ‘ii’, ‘:’, ‘rhetorical’, ‘composing’, ‘rhetorical’, ‘composing’, ‘engages’, ‘series’, ‘interactive’, ‘reading’, ‘,’, ‘research’, ‘,’, ‘composing’, ‘activities’, ‘along’, ‘assignments’, ‘designed’, ‘help’, ‘become’, ‘effective’, ‘consumers’, ‘producers’,
‘alphabetic’, ‘,’, ‘visual’, ‘multimodal’, ‘texts.’, ‘join’, ‘us’, ‘become’, ‘effective’, ‘writers’, ‘…’, ‘better’, ‘citizens.’, ‘rhetorical’, ‘composing’, ‘course’, ‘writers’, ‘exchange’, ‘words’, ‘,’, ‘ideas’, ‘,’, ‘talents’, ‘,’, ‘support.’, ‘introduced’,
‘variety’, ‘rhetorical’, ‘concepts\xe2\x80\x94that’, ‘,’, ‘ideas’, ‘techniques’, ‘inform’, ‘persuade’, ‘audiences\xe2\x80\x94that’, ‘help’, ‘become’, ‘effective’, ‘consumer’, ‘producer’, ‘written’, ‘,’, ‘visual’, ‘,’, ‘multimodal’, ‘texts.’, ‘class’, ‘includes’,
‘short’, ‘videos’, ‘,’, ‘demonstrations’, ‘,’, ‘activities.’, ‘envision’, ‘rhetorical’, ‘composing’, ‘learning’, ‘community’, ‘includes’, ‘enrolled’, ‘course’, ‘instructors.’, ‘bring’, ‘expertise’, ‘writing’, ‘,’, ‘rhetoric’, ‘course’, ‘design’, ‘,’, ‘designed’,
‘assignments’, ‘course’, ‘infrastructure’, ‘help’, ‘share’, ‘experiences’, ‘writers’, ‘,’, ‘students’, ‘,’, ‘professionals’, ‘us.’, ‘collaborations’, ‘facilitated’, ‘wex’, ‘,’, ‘writers’, ‘exchange’, ‘,’, ‘place’, ‘exchange’, ‘work’, ‘feedback’]

停用词被过滤了,不过发现标点符号还在,这个好办,我们首先定义一个标点符号list:

>>> english_punctuations = [‘,’, ‘.’, ‘:’, ‘;’, ‘?’, ‘(‘, ‘)’, ‘[‘, ‘]’, ‘&’, ‘!’, ‘*’, ‘@’, ‘#’, ‘$’, ‘%’]

然后过滤这些标点符号:

>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]

>>> print texts_filtered[0]

[‘writing’, ‘ii’, ‘rhetorical’, ‘composing’, ‘rhetorical’, ‘composing’, ‘engages’, ‘series’, ‘interactive’, ‘reading’, ‘research’, ‘composing’, ‘activities’, ‘along’, ‘assignments’, ‘designed’, ‘help’, ‘become’, ‘effective’, ‘consumers’, ‘producers’, ‘alphabetic’,
‘visual’, ‘multimodal’, ‘texts.’, ‘join’, ‘us’, ‘become’, ‘effective’, ‘writers’, ‘…’, ‘better’, ‘citizens.’, ‘rhetorical’, ‘composing’, ‘course’, ‘writers’, ‘exchange’, ‘words’, ‘ideas’, ‘talents’, ‘support.’, ‘introduced’, ‘variety’, ‘rhetorical’, ‘concepts\xe2\x80\x94that’,
‘ideas’, ‘techniques’, ‘inform’, ‘persuade’, ‘audiences\xe2\x80\x94that’, ‘help’, ‘become’, ‘effective’, ‘consumer’, ‘producer’, ‘written’, ‘visual’, ‘multimodal’, ‘texts.’, ‘class’, ‘includes’, ‘short’, ‘videos’, ‘demonstrations’, ‘activities.’, ‘envision’,
‘rhetorical’, ‘composing’, ‘learning’, ‘community’, ‘includes’, ‘enrolled’, ‘course’, ‘instructors.’, ‘bring’, ‘expertise’, ‘writing’, ‘rhetoric’, ‘course’, ‘design’, ‘designed’, ‘assignments’, ‘course’, ‘infrastructure’, ‘help’, ‘share’, ‘experiences’, ‘writers’,
‘students’, ‘professionals’, ‘us.’, ‘collaborations’, ‘facilitated’, ‘wex’, ‘writers’, ‘exchange’, ‘place’, ‘exchange’, ‘work’, ‘feedback’]

更进一步,我们对这些英文单词词干化(Stemming),NLTK提供了好几个相关工具接口可供选择,具体参考这个页面:
http://nltk.org/api/nltk.stem.html , 可选的工具包括Lancaster Stemmer,
Porter Stemmer等知名的英文Stemmer。这里我们使用LancasterStemmer:

>>> from nltk.stem.lancaster import LancasterStemmer

>>> st = LancasterStemmer()

>>> st.stem(‘stemmed’)

‘stem’

>>> st.stem(‘stemming’)

‘stem’

>>> st.stem(‘stemmer’)

‘stem’

>>> st.stem(‘running’)

‘run’

>>> st.stem(‘maximum’)

‘maxim’

>>> st.stem(‘presumably’)

‘presum’

让我们调用这个接口来处理上面的课程数据:

>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]

>>> print texts_stemmed[0]

[‘writ’, ‘ii’, ‘rhet’, ‘compos’, ‘rhet’, ‘compos’, ‘eng’, ‘sery’, ‘interact’, ‘read’, ‘research’, ‘compos’, ‘act’, ‘along’, ‘assign’, ‘design’, ‘help’, ‘becom’, ‘effect’, ‘consum’, ‘produc’, ‘alphabet’, ‘vis’, ‘multimod’, ‘texts.’, ‘join’, ‘us’, ‘becom’, ‘effect’,
‘writ’, ‘…’, ‘bet’, ‘citizens.’, ‘rhet’, ‘compos’, ‘cours’, ‘writ’, ‘exchang’, ‘word’, ‘idea’, ‘tal’, ‘support.’, ‘introduc’, ‘vary’, ‘rhet’, ‘concepts\xe2\x80\x94that’, ‘idea’, ‘techn’, ‘inform’, ‘persuad’, ‘audiences\xe2\x80\x94that’, ‘help’, ‘becom’, ‘effect’,
‘consum’, ‘produc’, ‘writ’, ‘vis’, ‘multimod’, ‘texts.’, ‘class’, ‘includ’, ‘short’, ‘video’, ‘demonst’, ‘activities.’, ‘envid’, ‘rhet’, ‘compos’, ‘learn’, ‘commun’, ‘includ’, ‘enrol’, ‘cours’, ‘instructors.’, ‘bring’, ‘expert’, ‘writ’, ‘rhet’, ‘cours’, ‘design’,
‘design’, ‘assign’, ‘cours’, ‘infrastruct’, ‘help’, ‘shar’, ‘expery’, ‘writ’, ‘stud’, ‘profess’, ‘us.’, ‘collab’, ‘facilit’, ‘wex’, ‘writ’, ‘exchang’, ‘plac’, ‘exchang’, ‘work’, ‘feedback’]

在我们引入gensim之前,还有一件事要做,去掉在整个语料库中出现次数为1的低频词,测试了一下,不去掉的话对效果有些影响:

>>> all_stems = sum(texts_stemmed, [])

>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)

>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]

3、引入gensim

有了上述的预处理,我们就可以引入gensim,并快速的做课程相似度的实验了。以下会快速的过一遍流程,具体的可以参考上一节的详细描述。

>>> from gensim import corpora, models, similarities

>>> import logging

>>> logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)

>>> dictionary = corpora.Dictionary(texts)

2013-06-07 21:37:07,120 : INFO : adding document #0 to Dictionary(0 unique tokens)

2013-06-07 21:37:07,263 : INFO : built Dictionary(3341 unique tokens) from 379 documents (total 46417 corpus positions)

>>> corpus = [dictionary.doc2bow(text) for text in texts]

>>> tfidf = models.TfidfModel(corpus)

2013-06-07 21:58:30,490 : INFO : collecting document frequencies

2013-06-07 21:58:30,490 : INFO : PROGRESS: processing document #0

2013-06-07 21:58:30,504 : INFO : calculating IDF weights for 379 documents and 3341 features (29166 matrix non-zeros)

>>> corpus_tfidf = tfidf[corpus]

这里我们拍脑门决定训练topic数量为10的LSI模型:

>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)

>>> index = similarities.MatrixSimilarity(lsi[corpus])

2013-06-07 22:04:55,443 : INFO : scanning corpus to determine the number of features

2013-06-07 22:04:55,510 : INFO : creating matrix for 379 documents and 10 features

基于LSI模型的课程索引建立完毕,我们以Andrew Ng教授的机器学习公开课为例,这门课程在我们的coursera_corpus文件的第211行,也就是:

>>> print courses_name[210]

Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上,然后和其他课程计算相似度:

>>> ml_course = texts[210]

>>> ml_bow = dicionary.doc2bow(ml_course)

>>> ml_lsi = lsi[ml_bow]

>>> print ml_lsi

[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]

>>> sims = index[ml_lsi]

>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程:

>>> print sort_sims[0:10]

[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:

>>> print courses_name[210]

Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课

>>> print courses_name[174]

Machine Learning

第三门课是Coursera的另一位创始人,同样是大牛的Daphne Koller教授的概率图模型公开课

>>> print courses_name[238]

Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课,有同学评价是Deep Learning的必修课。

>>> print courses_name[203]

Neural Networks for Machine Learning

感觉效果还不错,如果觉得有趣的话,也可以动手试试。

好了,这个系列就到此为止了,原计划写一下在英文维基百科全量数据上的实验,因为课程图谱目前暂时不需要,所以就到此为止,感兴趣的同学可以直接阅读gensim上的相关文档,非常详细。之后我可能更关注将NLTK应用到中文信息处理上,欢迎关注。

时间: 2024-08-02 04:57:30

如何计算两个文档的相似度(三)的相关文章

如何计算两个文档的相似度(二)

注:完全进行了测试,并附有完整代码: # -*- coding: cp936 -*- from gensim import corpora, models, similarities import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s' , level=logging.INFO) documents = ["Shipment of gold damaged in a fire&q

如何计算两个文档的相似度

一.TF-IDF.余弦相似度.向量空间模型 (1)使用TF-IDF算法,找出两篇文章的关键词: (2)每篇文章各取出若干个关键词(比如20个),合并成一个集合,计算每篇文章对于这个集合中的词的词频(为了避免文章长度的差异,可以使用相对词频): (3)生成两篇文章各自的词频向量: (4)计算两个向量的余弦相似度,值越大就表示越相似. 二.SVD和LSI LSA(潜在语义分析)的基本思路:LSA希望通过降低传统向量空间的维度来去除空间中的“噪音”,而降维可以通过SVD实现,因此首先对Term-Doc

可视化webpart基础开发——利用事件接收器实现同步操作两个文档库(添加、删除、修改文档)

可视化webpart基础开发——利用事件接收器实现同步操作两个文档库(添加.删除.修改文档) 分类: SharePoint2012-01-18 18:02 1189人阅读 评论(0) 收藏 举报 文档propertiesstringurl测试web 1.测试文档库(Doclib1.Doclib2): 增加一栏“测试栏1”. 2.新建“可视化web部件项目”,添加“解决方案资源管理器”里边选中项目右键“添加”-“新建项”-"事件接收器“ 如图操作,选择”列表项事件“和”文档库“集相应处理事件 实现

如何在word2007中并排查看对比显示两个文档

使用word编辑或修改文件时,有时会需要对两个文档进行对比,此时就应该使用并排查看功能. 点击“视图”菜单中的“并排查看” 所打开的两个文档就会同时打开,并排显示 可点击“同步滚动”设置或取消同步滚动

linux 中两个文档怎么对比内容是否一致

可以用diff命令对比文档内容.[语法]: diff [参数] 文件1 文件2[说明]: 本命令比较两个文本文件,将不同的行列出来-b 将一串空格或TAB 转换成一个空格或TAB-e 生成一个编辑角本,作为ex 或ed 的输入可将文件1 转换成文件2[例子]:diff file1 file2diff -b file1 file2diff -e file1 file2 >edscriptdiff 命令的常用参数a 将所有文件当做文本文件来处理b 忽略空格造成的不同B 忽略空行造成的不同q 只报告什

使用 MyBatis 必看两篇文档导读:MyBatis 与 MyBatis-Spring

太阳火神的美丽人生 (http://blog.csdn.net/opengl_es) 本文遵循"署名-非商业用途-保持一致"创作公用协议 转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS.Android.Html5.Arduino.pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作. MyBatis 简介 什么是 MyBatis ? MyBatis 是支持定制化 SQL.存储过程以及高级映射的优秀的持久层框架.MyBatis 避

linux删除两个文档中相同记录的行

文档1(a.txt) 111222333444555666777888999 文档2(b.txt) aaabbbcccdddeee111fff222333jjjkkk444 整理后的文件三內容如下:  (c.txt)aaabbbcccdddeeefffjjjkkk 作法如下:    1.先合并两个文本到一个临时文件中         cat a.txt b.txt > temp1.txt    2. 对生成的临时文件內容进行排序,將排序后的结果放到一个临时文件中        sort +0 -

跟着文档学python(三):zip (Python function, in 2. Built-in Functions)

一,函数的文档: zip(): zip(*iterables) Make an iterator that aggregates elements from each of the iterables. Returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops w

产品需求文档的学习记录(三)

我们通过思维导图将想法进行了结构化梳理,接下来我们就需要进行方案的可行性推演,验证产品功能是否可行,预估项目要花多少人力物力,因此我们就要通过原型设计进行相关需求的论证.一开始就撰写PRD文档,我们很难对产品进行各方面的评估,也无法得知方案的可行性,并且无法直观细致的考虑产品. 原型设计是帮助我们更细致的思考,并做各项需求的评估,同时也是将自己脑海里的想法进行输出,通过原型设计后,我们就可以进行产品宣讲了.相对于之前抽象的文字描述,原型则更加清晰产品的需求,设计和技术人员或者老板也能够更加直观的