Classify the sentiment of sentences from the Rotten Tomatoes dataset
题目链接:https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
越来越喜欢iPython notebook了。以下所有工作都可以在一个页面上完成,FireFox支持比Chrome要好。
数据集分为train.tsv和test.tsv。字段以\t分隔,每一行有四个字段:PhraseId,SentenceId,Phrase,Sentiment。
情感标识:
0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive
import pandas as pd df = pd.read_csv('train.tsv',header=0,delimiter='\t') df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 156060 entries, 0 to 156059 Data columns (total 4 columns): PhraseId 156060 non-null int64 SentenceId 156060 non-null int64 Phrase 156060 non-null object Sentiment 156060 non-null int64 dtypes: int64(3), object(1)
df.head()
Out[6]:
PhraseId | SentenceId | Phrase | Sentiment | |
---|---|---|---|---|
0 | 1 | 1 | A series of escapades demonstrating the adage ... | 1 |
1 | 2 | 1 | A series of escapades demonstrating the adage ... | 2 |
2 | 3 | 1 | A series | 2 |
3 | 4 | 1 | A | 2 |
4 | 5 | 1 | series | 2 |
In [13]: df.Sentiment.value_counts()/df.Sentiment.count() Out[13]: 2 0.509945 3 0.210989 1 0.174760 4 0.058990 0 0.045316 dtype: float64
直接用训练集的前5行做分类准确性测试:
X_train = df['Phrase'] y_train = df['Sentiment'] import numpy as np from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', LogisticRegression()), ]) text_clf = text_clf.fit(X_train,y_train) X_test = df.head()['Phrase'] predicted = text_clf.predict(X_test) print np.mean(predicted == df.head()['Sentiment']) for phrase, sentiment in zip(X_test, predicted): print('%r => %s' % (phrase, sentiment))
分类准确率及结果:
0.8 'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .' => 3 'A series of escapades demonstrating the adage that what is good for the goose' => 2 'A series' => 2 'A' => 2 'series' => 2
df.head()['Sentiment'] 0 1 1 2 2 2 3 2 4 2
第一个分类错误。
测试数据集:
test_df = pd.read_csv('test.tsv',header=0,delimiter='\t') test_df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 66292 entries, 0 to 66291 Data columns (total 3 columns): PhraseId 66292 non-null int64 SentenceId 66292 non-null int64 Phrase 66292 non-null object dtypes: int64(2), object(1)
用训练好的模型对测试数据集进行分类:
from numpy import savetxt X_test = test_df['Phrase'] phraseIds = test_df['PhraseId'] predicted = text_clf.predict(X_test) pred = [[index+156061,x] for index,x in enumerate(predicted)] savetxt('../Submissions/lr_benchmark.csv',pred,delimiter=',',fmt='%d,%d',header='PhraseId,Sentiment',comments='')
提交结果:
参考:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
时间: 2024-10-16 16:38:51