kaggle实战之 bag of words meet bag of poopcorn

由于编辑器总是崩溃,我只能直接把代码贴上了。

import numpy

#first step
import pandas as pd
import numpy as np

# Read data from files
#这三行的目的就是读入文件,pd.read_csv()这个API里面参数还是比较多的,可以查阅官方文档
#人工标记过的训练数据
train = pd.read_csv( "data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )
#测试集
test = pd.read_csv( "data/testData.tsv", header=0, delimiter="\t", quoting=3 )
#未标记的训练数据,其实和测试集没什么区别,可以作为word2vec训练的时候的语料
unlabeled_train = pd.read_csv( "data/unlabeledTrainData.tsv", header=0,delimiter="\t", quoting=3 )

# Verify the number of reviews that were read (100,000 in total)
#显示读入数据的行数
print "Read %d labeled train reviews, %d labeled test reviews, and %d unlabeled reviews\n" %       (train["review"].size,test["review"].size, unlabeled_train["review"].size )

# second strp
# Import various modules for string cleaning
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

#数据预处理,主要是网页标签,去数字和去停用词
def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    #BeautifulSoup这个库是一个在做爬虫是经常使用的库,主要作用除去爬下来的文档标签,
    #大家可以看到原始句子里面含有<br /><br />这些标签,这是由于这些评论是从网页里面爬取出来的
    #我们后续的处理是必须要去掉这些标签的,get_text()这个API可以轻松实现这个功能
    review_text = BeautifulSoup(review,"html.parser").get_text()
    #
    # 2. Remove non-letters
    #这里就需要正则表达式的知识了,这句话实现的功能就是将数字去掉并且用一个空格去替换
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    #将大写字母转换为小写字母,也许大小写不同会影响到处理吧,不太清楚
    #这也是中英文自然语言处理的区别之一,中文不必考虑大小写问题,但是中文分词比英文分词麻烦很多
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    #除去停用词,这是自然语言处理里面经常会做的,不过为什么是Optionally remove
    #后面有答案
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    # print words
    return(words)

# Download the punkt tokenizer for sentence splitting
#nltk是python里面常用的自然语言处理的工具包,但是这一步会出问题
#原因貌似是nltk_data的网址变了,我是自己手动在网上找到了nltk_data
#然后放在特定的路径就可以了
import nltk.data

# Load the punkt tokenizer
tokenizer = nltk.data.load(‘tokenizers/punkt/english.pickle‘)

# Define a function to split a review into parsed sentences
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    #使用nltk将每一条评论都分成一个个句子,比如利用英文的句号‘.‘进行划分.
    #‘review.strip()‘的作用是进行分词,不得补羡慕英文分词是真么简单高效
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    #每个评论都被分成了几个句子,这里就是去掉那些长度为0的句子
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            #这里调用review_to_wordlist()实现数据清洗
            sentences.append( review_to_wordlist( raw_sentence,               remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words,
    # so this returns a list of lists
    #也就是说,输出是sentence的列表,而每个sentence也是一个单词的列表
    return sentences

sentences = []  # Initialize an empty list of sentences
#这个处理就是把标记的训练数据进行处理,都放入sentences这个列表里面,这个列表每个元素
#其实是原来评论里面的一句话,不过是经过了数据清洗和分词
#注意到review_to_sentences(review.decode("utf8"), tokenizer)这个调用remove_stopwords=False
#也就是说不除去停用词,为什么呢?这个就和word2vec这个方法有关了,有停用词可以保留完整的语料信息
#传统表示文本的方式都是BOW,也就是词袋模型,但是这种方法有两个的缺点:1.无法表征出词的关系,比如“篮球”“足球”“鸡腿”
#明显“篮球”和“足球”含义相近,但是词袋模型并不能体现出来。2.维度过高,计算量过大,一般利用互信息,卡方检验等等进行降维处理
#word2vec也是将词表示成一种向量的办法,但是利用word2vec表示同意后的优点在于:1.词意相近的词语距离会更近(可以进算向量之间的距离)
#2.维度低,可以人工指定维数。理解word2vec需要很多的数学知识,我在这里就不讲了
print "Parsing sentences from training set"
for review in train["review"]:
    sentences += review_to_sentences(review.decode("utf8"), tokenizer)
#为什么未标记的数据也能用呢,因为word2vec是无监督的,只是将这笔资料用作训练word2vec的语料库
#因此,这也体现出word2vec一个优点,因为未标记的预料是比标记预料容易获取到的
print "Parsing sentences from unlabeled set"
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review.decode("utf8"), tokenizer)

# Import the built-in logging module and configure it so that Word2Vec
# creates nice output messages
#输出日志信息,level一共是五级,这里level=logging.INFO
import logging
logging.basicConfig(format=‘%(asctime)s : %(levelname)s : %(message)s‘,    level=logging.INFO)

# Set values for various parameters
#设定word2vec的参数,具体每个参数含义需要理解word2vec数学原理以及查阅API文档
num_features = 300    # Word vector dimensionality指定维度
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel四个线程
context = 10          # Context window size滑动窗口大小
downsampling = 1e-3   # Downsample setting for frequent words负采样

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print "Training model..."
model = word2vec.Word2Vec(sentences, workers=num_workers,             size=num_features, min_count = min_word_count,             window = context, sample = downsampling)

# If you don‘t plan to train the model any further, calling
# init_sims will make the model much more memory-efficient.
#保存模型,因为跑word2vec还是需要花时间的,因此在训练好之后保存下来,下次就可以直接使用了
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

#"man woman child kitchen"四个单词里面哪个和其他三个差距最大
# print model.doesnt_match("man woman child kitchen".split())
# print model.doesnt_match("france england germany berlin".split())
# print model.doesnt_match("paris berlin london austria".split())
#和"man"最像的单词
# print model.most_similar("man")
# print model.most_similar("queen")
# print model.most_similar("awful")

# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    #
    # Index2word is a list that contains the names of the words in
    # the model‘s vocabulary. Convert it to a set, for speed

    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the review and, if it is in the model‘s
    # vocaublary, add its feature vector to the total
    for word in words:
    if word in index2word_set:
        nwords = nwords + 1.
        #从模型里面取出相应单词的向量值
        featureVec = np.add(featureVec,model[word])
    #
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec

#利用word2vec建模的关键就是如何给一个表示一个样本,在这个问题里面也就是如何表示一条评论?
#BOW词袋模型由于其高维度,可以轻松表示,而且还是稀疏的
#我们知道,经过word2vec,每个单词可以用长度为300的向量表示,假设某一条评论有100个单词,也就是100个向量
#我们的处理是将100个向量加起来再除去100,结果是一个300维测向量,也就是每条评论用300维向量表示
#看起来这种方法不是很靠谱,比较简单粗暴,我说说自己的两点理解:1.BOW词袋模型表示一个句子其实也是用的这个方法
#2.这样最起码保证了每个评论可以用相同维度的数据来表示
def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate
    # the average feature vector for each one and return a 2D numpy array
    #
    # Initialize a counter
    counter = 0
    #
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    #
    # Loop through the reviews
    for review in reviews:
       #
       # Print a status message every 1000th review
       if counter%1000 == 0:
           print "Review %d of %d" % (counter, len(reviews))
       #
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model,            num_features)
       #
       # Increment the counter
       counter = counter + 1
    return reviewFeatureVecs

#这里为什么又要除去停用词呢?前面是利用word2vec表示单词,语料越完整越好
#这里是利用向量化的单词去表示文本,而在文本中,停用词对于文本表示几乎毫无作用,因此要去掉
clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append( review_to_wordlist( review,         remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print "Creating average feature vecs for test reviews"
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append( review_to_wordlist( review,         remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

print type(testDataVecs)
print len(testDataVecs)
print testDataVecs[0]
print len(testDataVecs[0])

# Fit a random forest to the training data, using 100 trees
#利用随机森林去建模
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
forest = RandomForestClassifier( n_estimators = 100 )

print "Fitting a random forest to labeled training data..."
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results
result = forest.predict( testDataVecs )

# Write the test results
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )
result = forest.predict( testDataVecs )

# Write the test results
#利用率SVC去建模
model_svc = SVC.fit( trainDataVecs, train["sentiment"] )
result = model_svc.predict( testDataVecs )
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "okcing.csv", index=False, quoting=3 )
#SVC的效果确实比随机森林要好一些
时间: 2024-07-28 21:01:04

kaggle实战之 bag of words meet bag of poopcorn的相关文章

kaggle实战记录 =&gt;Digit Recognizer(7月份完全掌握细节及内容)

date:2016-07-11 今天开始注册了kaggle,从digit recognizer开始学习, 由于是第一个案例对于整个流程目前我还不够了解,首先了解大神是怎么运行怎么构思,然后模仿.这样的学习流程可能更加有效,目前看到排名靠前的是用TensorFlow.ps:TensorFlow是可以直接安linux环境下面,但是目前不能在windows环境里面运行(伤心一万点). TensorFlow模块用的是NN(神经网络),既然现在接触到可以用神经网络的例子我再也不好意思再逃避学习神经网络下面

They Also Offer The Required Expertise To Help You Design Your Promotional Bag

There are simply no limits on how many bags of jute and denim are available. You can find them in all sorts of patterns and colors that bring joy to use. Embroidered and painted denim jute bags and can help to draw attention to the people around. The

This Is One Classy Laptop Bag That Fits The Definition Down To A Tee

Isolated bags used for various purposes, some of which are described below. In addition to their main object is also popular as promotional items. Bags have proven to be the best form of advertising today, mainly because of its excellent brand visibi

Kaggle—So Easy!百行代码实现排名Top 5%的图像分类比赛

Kaggle-So Easy!百行代码实现排名Top 5%的图像分类比赛 作者:七月在线彭老师说明:本文最初由彭老师授权翟惠良发布在公众号"七月在线实验室"上,现再由July重新编辑发布到本blog上.Github: https://github.com/pengpaiSH/Kaggle_NCFM 前言 根据我个人的经验,学好AI,有五个必修:数学.数据结构.Python数据分析.ML.DL,必修之外,有五个选修可供选择:NLP.CV.DM.量化.Spark,然后配套七月在线的这些必修

Kaggle新手入门之路

学完了Coursera上Andrew Ng的Machine Learning后,迫不及待地想去参加一场Kaggle的比赛,却发现从理论到实践的转变实在是太困难了,在此记录学习过程. 一:安装Anaconda 教程大多推荐使用Jupyter Notebook来进行数据科学的相关编程,我们通过Anaconda来安装Jupyter Notebook和需要用到的一些python库,按照以下方法重新安装了Anaconda,平台Win10 Anaconda安装 二:Jupyter Notebook 参照以下

XGBoost入门及实战

kaggle比赛必备算法XGBoost入门及实战 xgboost一直在kaggle竞赛江湖里被传为神器,它在对结构化数据的应用占据主导地位,是目前开源的最快最好的工具包,与常见的工具包算法相比速度提高了10倍以上! XGBoost is an implementation of gradient boosted decision trees (GBDT) designed for speed and performance. xgboost 是对梯度增强决策树(GBDT)的一种实现,具有更高的性

The Runway In The Form Of Lots Of Goodies For Handbag Lovers

Crossbody designer bags are perfect for students, business people and those who work on the go. It seems to be very elegant, I worked with informally. Crossbody designer handbags are one of the most popular choices among women because it allows them

Bags Are Available In Different Sizes

European and American breeds rare black color and distinct facial features, pale pinkish purple color of this long dress with matching holes for a fresh, clean, and more importantly, what Dior spokesman Oscar red carpet in a Dior evening gowns can se

Bagging(Bootstrap aggregating)、随机森林(random forests)、AdaBoost

引言 在这篇文章中,我会详细地介绍Bagging.随机森林和AdaBoost算法的实现,并比较它们之间的优缺点,并用scikit-learn分别实现了这3种算法来拟合Wine数据集.全篇文章伴随着实例,由浅入深,看过这篇文章以后,相信大家一定对ensemble的这些方法有了很清晰地了解. Bagging bagging能提升机器学习算法的稳定性和准确性,它可以减少模型的方差从而避免overfitting.它通常应用在决策树方法中,其实它可以应用到任何其它机器学习算法中.如果大家对决策树的算法不太