tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)
参数:
max_document_length: 文档的最大长度。如果文本的长度大于最大长度,那么它会被剪切,反之则用0填充。
min_frequency: 词频的最小值,出现次数小于最小词频则不会被收录到词表中。
vocabulary: CategoricalVocabulary 对象。
tokenizer_fn:分词函数
例子:
from tensorflow.contrib import learn import numpy as np max_document_length = 4 x_text =[ ‘i love you‘, ‘me too‘ ] vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length) vocab_processor.fit(x_text) print next(vocab_processor.transform([‘i me too‘])).tolist() x = np.array(list(vocab_processor.fit_transform(x_text))) print x
运行结果为:
[1, 2, 3, 0] [[1 4 5 0] [2 3 0 0]]
看一下词和索引的对应:
embedding_size = len(vocab_processor.vocabulary_)print embedding_sizevocab_dict = vocab_processor.vocabulary_._mapping sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1]) vocabulary = list(list(zip(*sorted_vocab))[0]) print vocabulary
结果是:
6[‘<UNK>‘, ‘i‘, ‘me‘, ‘too‘, ‘love‘, ‘you‘]
原文地址:https://www.cnblogs.com/helloworld0604/p/9002337.html
时间: 2024-11-09 10:06:35