gensim与numpy array 互转

目的

  将gensim输出的格式转化为numpy array格式，支持作为scikit-learn，tensorflow的输入

实施

使用nltk库的停用词和网上收集的资料整合成一份新的停用词表，用来过滤文档中的停用词，也去除了数字和特殊的标点符号，最后将所有字母转化为小写形式。

以下是原文：

Subject: Re: Candida(yeast) Bloom, Fact or Fiction

From: [email protected] (Pat Churchill)

Organization: Actrix Networks

Lines: 17

I am currently in the throes of a hay fever attack. SO who certainly

never reads Usenet, let alone Sci.med, said quite spontaneously "

There are a lot of mushrooms and toadstools out on the lawn at the

moment. Sure that‘s not your problem?"

Well, who knows? Or maybe it‘s the sourdough bread I bake?

After reading learned, semi-learned, possibly ignorant and downright

ludicrous stuff in this thread, I am about ready to believe anything :-)

If the hayfever gets any worse, maybe I will cook those toadstools...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The floggings will continue until morale improves

[email protected] Pat Churchill, Wellington New Zealand

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

下面是分词后的结果：

[‘subject‘, ‘bloom‘, ‘fiction‘, ‘pat‘, ‘organization‘, ‘networks‘, ‘lines‘, ‘hay‘, ‘fever‘, ‘attack‘, ‘reads‘, ‘usenet‘, ‘lot‘, ‘lawn‘, ‘moment‘, ‘bread‘, ‘reading‘, ‘learned‘, ‘ignorant‘, ‘stuff‘, ‘thread‘, ‘ready‘, ‘worse‘, ‘cook‘, ‘continue‘, ‘pat‘, ‘wellington‘, ‘zealand‘]

使用gensim工具包转化为词袋子模型如下：

[(17, 1.0), (23, 1.0), (32, 1.0), (381, 1.0), (536, 1.0), (768, 1.0), (776, 1.0), (877, 1.0), (950, 1.0), (1152, 1.0), (1195, 1.0), (1389, 1.0), (1548, 1.0), (1577, 1.0), (1682, 2.0), (1861, 1.0), (2041, 1.0), (3098, 1.0), (3551, 1.0), (3886, 1.0), (5041, 1.0), (5148, 1.0), (5149, 1.0), (8494, 1.0), (8534, 1.0), (9972, 1.0), (11608, 1.0)]

上述gensim转换的格式不能直接作为scikit-learn，tensorflow的输入，需要使用

Gensim包的工具函数进行转换, 转换后变为：

[ 0. 0. 0. ..., 0. 0. 0.]

···

custom_train_matrix = gensim.matutils.corpus2dense(custom_train_corpus, num_terms=len_custom_dict).T # 关键方法为corpus2dense

custom_test_matrix = gensim.matutils.corpus2dense(custom_test_corpus, num_terms=len_custom_dict).T

···

上述输出的格式已经是numpy arrary格式了，可以作为scikit-learn，tensorflow的输入了。我们使用tf-idf技术，提高重要单词的比重，降低常见单词的比重，使用sckit-learn包转换上述输出如下：

(0, 11608) 0.179650698812

(0, 9972) 0.235827148023

(0, 8534) 0.208306524508

...................

(0, 381) 0.119518580927

(0, 32) 0.034904390394

(0, 23) 0.0374215245774

(0, 17) 0.0349485731726

上述输入已经可以作为scikit-learn的输入数据了，如果要作为tensorflow的输入数据，还需要将其转化为numpy array格式

···

custom_train_matrix = custom_train_matrix.toarray() # 将稀疏矩阵转化为numpy array

custom_test_matrix = custom_test_matrix.toarray() # 将稀疏矩阵转化为numpy array

···

原文地址：https://www.cnblogs.com/linyihai/p/8608926.html

时间： 2024-10-07 18:25:11

gensim与numpy array 互转

gensim与numpy array 互转的相关文章

python numpy array 的一些问题

numpy.array

python numpy array 与matrix 乘方

第四十篇 Numpy.array的基本操作——向量及矩阵的运算

numpy.array 合并和分割

numpy array或matrix的交换两行

numpy array转置与两个array合并

numpy.array的shape属性 —— 2018-09-07

[Python Cookbook] Numpy Array Joint Methods: Append, Extend & Concatenate