Kaggle竞赛题之——Sentiment Analysis on Movie Reviews

Classify the sentiment of sentences from the Rotten Tomatoes dataset

题目链接:https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

越来越喜欢iPython notebook了。以下所有工作都可以在一个页面上完成,FireFox支持比Chrome要好。

数据集分为train.tsv和test.tsv。字段以\t分隔,每一行有四个字段:PhraseId,SentenceId,Phrase,Sentiment。

情感标识:

0 - negative

1 - somewhat negative

2 - neutral

3 - somewhat positive

4 - positive

import pandas as pd
df = pd.read_csv('train.tsv',header=0,delimiter='\t')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
df.head()

Out[6]:

  PhraseId SentenceId Phrase Sentiment
0 1 1 A series of escapades demonstrating the adage ... 1
1 2 1 A series of escapades demonstrating the adage ... 2
2 3 1 A series 2
3 4 1 A 2
4 5 1 series 2
In [13]:
df.Sentiment.value_counts()/df.Sentiment.count()
Out[13]:
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
dtype: float64

直接用训练集的前5行做分类准确性测试:

X_train = df['Phrase']
y_train = df['Sentiment']
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
                      ])
text_clf = text_clf.fit(X_train,y_train)
X_test = df.head()['Phrase']
predicted = text_clf.predict(X_test)
print np.mean(predicted == df.head()['Sentiment'])
for phrase, sentiment in zip(X_test, predicted):
    print('%r => %s' % (phrase, sentiment))

分类准确率及结果:

0.8
'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .' => 3
'A series of escapades demonstrating the adage that what is good for the goose' => 2
'A series' => 2
'A' => 2
'series' => 2
df.head()['Sentiment']
0    1
1    2
2    2
3    2
4    2

第一个分类错误。

测试数据集:

test_df = pd.read_csv('test.tsv',header=0,delimiter='\t')
test_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66292 entries, 0 to 66291
Data columns (total 3 columns):
PhraseId      66292 non-null int64
SentenceId    66292 non-null int64
Phrase        66292 non-null object
dtypes: int64(2), object(1)

用训练好的模型对测试数据集进行分类:

from numpy import savetxt
X_test = test_df['Phrase']
phraseIds = test_df['PhraseId']
predicted = text_clf.predict(X_test)
pred = [[index+156061,x] for index,x in enumerate(predicted)]
savetxt('../Submissions/lr_benchmark.csv',pred,delimiter=',',fmt='%d,%d',header='PhraseId,Sentiment',comments='')

提交结果:

参考:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

时间: 2024-10-16 16:38:51

Kaggle竞赛题之——Sentiment Analysis on Movie Reviews的相关文章

Sentiment Analysis(1)-Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables

The content is from this paper: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables, by Tetsuji Nakagawa. A typical approach for sentiment classification is to use supervised machine learning algorithms with bag-of-words a

Paper Weekly-Opinion mining and sentiment analysis

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts http://www.aclweb.org/anthology/P04-1035 by B Pang -2004- ?Cited by 2242 Large-Scale Sentiment Analysis for News and Blogs http://icwsm.org/papers/3--G

Sentiment Analysis resources

Wikipedia: Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. In 1997, firstly proposed b

NAACL 2013 Paper Mining User Relations from Online Discussions using Sentiment Analysis and PMF

中文简单介绍:本文对怎样基于情感分析和概率矩阵分解从网络论坛讨论中挖掘用户关系进行了深入研究. 论文出处:NAACL'13. 英文摘要: Advances in sentiment analysis have enabled extraction of user relations implied in online textual exchanges such as forum posts. However,recent studies in this direction only consi

sentiment analysis(very ish est less)

import jiebaimport numpy as np #打开词典文件,返回列表def open_dict(Dict = 'mini', path=r'/Users/apple888/PycharmProjects/Textming/Sent_Dict/Hownet/'): path = path + '%s.txt' % Dict dictionary = open(path, 'r', encoding='utf-8') dict = [] for word in dictionary

关于机器学习和深度学习的资料

声明:转来的,原文出处:http://blog.csdn.net/achaoluo007/article/details/43564321 编者按:本文收集了百来篇关于机器学习和深度学习的资料,含各种文档,视频,源码等.而且原文也会不定期的更新,望看到文章的朋友能够学到更多. <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost 到随机森林.Deep Learning. &

机器学习&amp;深度学习资料分享

感谢:https://github.com/ty4z2008/Qix/blob/master/dl.md <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost 到随机森林.Deep Learning. <Deep Learning in Neural Networks: An Overview> 介绍:这是瑞士人工智能实验室 Jurgen Schmidhuber

近200篇机器学习&amp;amp;深度学习资料分享

编者按:本文收集了百来篇关于机器学习和深度学习的资料,含各种文档,视频,源码等.并且原文也会不定期的更新.望看到文章的朋友能够学到很多其它. <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍非常全面,从感知机.神经网络.决策树.SVM.Adaboost 到随机森林.Deep Learning. <Deep Learning in Neural Networks: An Overview> 介绍:这是瑞士人工智能实验室

计算机深度学习资料整理

编者按:本文收集了百来篇关于机器学习和深度学习的资料,含各种文档,视频,源码等.而且原文也会不定期的更新,望看到文章的朋友能够学到更多. <Brief History of Machine Learning> 介绍:这是一篇介绍机器学习历史的文章,介绍很全面,从感知机.神经网络.决策树.SVM.Adaboost 到随机森林.Deep Learning. <Deep Learning in Neural Networks: An Overview> 介绍:这是瑞士人工智能实验室 Ju