Python for Data Analysis | MovieLens

Background

MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。

ratings.dat

UserID::MovieID::Rating::Timestamp

users.dat

UserID::Gender::Age::Occupation::Zip-code

movies.dat

MovieID::Title::Genres

通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中。

* head=None, case-sensitive.

In [1]: import pandas as pd

In [2]: unames = [‘user_id‘, ‘gender‘, ‘age‘, ‘occupation‘, ‘zip‘]
In [3]: users = pd.read_table(‘C:/Users/I******/Desktop/.../movielens/users.dat‘, sep=‘::‘, header=None, names=unames)

In [4]: rnames = [‘user_id‘, ‘movie_id‘, ‘rating‘, ‘timestamp‘]
In [5]: ratings = pd.read_table(‘C:/Users/I******/Desktop/.../movielens/ratings.dat‘, sep=‘::‘, header=None, names=rnames)

In [6]: mnames = [‘movie_id‘, ‘title‘, ‘genres‘]
In [7]: movies = pd.read_table(‘C:/Users/I******/Desktop/.../movielens/movies.dat‘, sep=‘::‘, header=None, names=mnames)

利用Python的切片语法,通过查看每个DataFrame的前几行,验证数据加载工作是否顺利。

In [8]: users[:5]
Out[8]:
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

In [9]: ratings[:5]
Out[9]:
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

In [10]: movies[:5]
Out[10]:
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children‘s|Comedy
1         2                      Jumanji (1995)  Adventure|Children‘s|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy

In [11]: ratings
Out[11]:
         user_id  movie_id  rating  timestamp
0              1      1193       5  978300760
1              1       661       3  978302109
2              1       914       3  978301968
3              1      3408       4  978300275
4              1      2355       5  978824291
5              1      1197       3  978302268
6              1      1287       5  978302039
7              1      2804       5  978300719
8              1       594       4  978302268
9              1       919       4  978301368
10             1       595       5  978824268
11             1       938       4  978301752
12             1      2398       4  978302281
13             1      2918       4  978302124
14             1      1035       5  978301753
15             1      2791       4  978302188
16             1      2687       3  978824268
17             1      2018       4  978301777
18             1      3105       5  978301713
19             1      2797       4  978302039
20             1      2321       3  978302205
21             1       720       3  978300760
22             1      1270       5  978300055
23             1       527       5  978824195
24             1      2340       3  978300103
25             1        48       5  978824351
26             1      1097       4  978301953
27             1      1721       4  978300055
28             1      1545       4  978824139
29             1       745       3  978824268
...          ...       ...     ...        ...
1000179     6040      2762       4  956704584
1000180     6040      1036       3  956715455
1000181     6040       508       4  956704972
1000182     6040      1041       4  957717678
1000183     6040      3735       4  960971654
1000184     6040      2791       4  956715569
1000185     6040      2794       1  956716438
1000186     6040       527       5  956704219
1000187     6040      2003       1  956716294
1000188     6040       535       4  964828734
1000189     6040      2010       5  957716795
1000190     6040      2011       4  956716113
1000191     6040      3751       4  964828782
1000192     6040      2019       5  956703977
1000193     6040       541       4  956715288
1000194     6040      1077       5  964828799
1000195     6040      1079       2  956715648
1000196     6040       549       4  956704746
1000197     6040      2020       3  956715288
1000198     6040      2021       3  956716374
1000199     6040      2022       5  956716207
1000200     6040      2028       5  956704519
1000201     6040      1080       4  957717322
1000202     6040      1089       4  956704996
1000203     6040      1090       3  956715518
1000204     6040      1091       1  956716541
1000205     6040      1094       5  956704887
1000206     6040       562       5  956704746
1000207     6040      1096       4  956715648
1000208     6040      1097       4  956715569

[1000209 rows x 4 columns]

先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键。

  1 In [12]: data = pd.merge(pd.merge(ratings, users), movies)
  2
  3 In [13]: data
  4 Out[13]:
  5          user_id  movie_id  rating   timestamp gender  age  occupation    zip    6 0              1      1193       5   978300760      F    1          10  48067
  7 1              2      1193       5   978298413      M   56          16  70072
  8 2             12      1193       4   978220179      M   25          12  32793
  9 3             15      1193       4   978199279      M   25           7  22903
 10 4             17      1193       5   978158471      M   50           1  95350
 11 5             18      1193       4   978156168      F   18           3  95825
 12 6             19      1193       5   982730936      M    1          10  48073
 13 7             24      1193       5   978136709      F   25           7  10023
 14 8             28      1193       3   978125194      F   25           1  14607
 15 9             33      1193       5   978557765      M   45           3  55421
 16 10            39      1193       5   978043535      M   18           4  61820
 17 11            42      1193       3   978038981      M   25           8  24502
 18 12            44      1193       4   978018995      M   45          17  98052
 19 13            47      1193       4   977978345      M   18           4  94305
 20 14            48      1193       4   977975061      M   25           4  92107
 21 15            49      1193       4   978813972      M   18          12  77084
 22 16            53      1193       5   977946400      M   25           0  96931
 23 17            54      1193       5   977944039      M   50           1  56723
 24 18            58      1193       5   977933866      M   25           2  30303
 25 19            59      1193       4   977934292      F   50           1  55413
 26 20            62      1193       4   977968584      F   35           3  98105
 27 21            80      1193       4   977786172      M   56           1  49327
 28 22            81      1193       5   977785864      F   25           0  60640
 29 23            88      1193       5   977694161      F   45           1  02476
 30 24            89      1193       5   977683596      F   56           9  85749
 31 25            95      1193       5   977626632      M   45           0  98201
 32 26            96      1193       3   977621789      F   25          16  78028
 33 27            99      1193       2   982791053      F    1          10  19390
 34 28           102      1193       5  1040737607      M   35          19  20871
 35 29           104      1193       2   977546620      M   25          12  00926
 36 ...          ...       ...     ...         ...    ...  ...         ...    ...
 37 1000179     4933      3084       3   962757020      M   25          15  94040
 38 1000180     4802      2218       2  1014866656      M   56           1  40601
 39 1000181     4812      2308       2   962932391      M   18          14  25301
 40 1000182     4874       624       4   962781918      F   25           4  70808
 41 1000183     5059      1434       4   962484364      M   45          16  22652
 42 1000184     5947      1434       4   957190428      F   45          16  97215
 43 1000185     5077      1868       3   962417299      M   25           2  20037
 44 1000186     5944      1868       1   957197520      F   18          10  27606
 45 1000187     5105       404       3   962337582      M   50           7  18977
 46 1000188     5185       404       4   963402617      F   35           4  44485
 47 1000189     5532       404       5   959619841      M   25          17  27408
 48 1000190     5543       404       3   960127592      M   25          17  97401
 49 1000191     5220      2543       3   961546137      M   25           7  91436
 50 1000192     5754      2543       4   958272316      F   18           1  60640
 51 1000193     5227       591       3   961475931      M   18          10  64050
 52 1000194     5795       591       1   958145253      M   25           1  92688
 53 1000195     5313      3656       5   960920392      M   56           0  55406
 54 1000196     5328      2438       4   960838075      F   25           4  91740
 55 1000197     5334      3323       3   960796159      F   56          13  46140
 56 1000198     5334       127       1   960795494      F   56          13  46140
 57 1000199     5334      3382       5   960796159      F   56          13  46140
 58 1000200     5420      1843       3   960156505      F    1          19  14850
 59 1000201     5433       286       3   960240881      F   35          17  45014
 60 1000202     5494      3530       4   959816296      F   35          17  94306
 61 1000203     5556      2198       3   959445515      M   45           6  92103
 62 1000204     5949      2198       5   958846401      M   18          17  47901
 63 1000205     5675      2703       3   976029116      M   35          14  30030
 64 1000206     5780      2845       1   958153068      M   18          17  92886
 65 1000207     5851      3607       5   957756608      F   18          20  55410
 66 1000208     5938      2909       4   957273353      M   25           1  35401
 67
 68                                                      title   69 0                   One Flew Over the Cuckoo‘s Nest (1975)
 70 1                   One Flew Over the Cuckoo‘s Nest (1975)
 71 2                   One Flew Over the Cuckoo‘s Nest (1975)
 72 3                   One Flew Over the Cuckoo‘s Nest (1975)
 73 4                   One Flew Over the Cuckoo‘s Nest (1975)
 74 5                   One Flew Over the Cuckoo‘s Nest (1975)
 75 6                   One Flew Over the Cuckoo‘s Nest (1975)
 76 7                   One Flew Over the Cuckoo‘s Nest (1975)
 77 8                   One Flew Over the Cuckoo‘s Nest (1975)
 78 9                   One Flew Over the Cuckoo‘s Nest (1975)
 79 10                  One Flew Over the Cuckoo‘s Nest (1975)
 80 11                  One Flew Over the Cuckoo‘s Nest (1975)
 81 12                  One Flew Over the Cuckoo‘s Nest (1975)
 82 13                  One Flew Over the Cuckoo‘s Nest (1975)
 83 14                  One Flew Over the Cuckoo‘s Nest (1975)
 84 15                  One Flew Over the Cuckoo‘s Nest (1975)
 85 16                  One Flew Over the Cuckoo‘s Nest (1975)
 86 17                  One Flew Over the Cuckoo‘s Nest (1975)
 87 18                  One Flew Over the Cuckoo‘s Nest (1975)
 88 19                  One Flew Over the Cuckoo‘s Nest (1975)
 89 20                  One Flew Over the Cuckoo‘s Nest (1975)
 90 21                  One Flew Over the Cuckoo‘s Nest (1975)
 91 22                  One Flew Over the Cuckoo‘s Nest (1975)
 92 23                  One Flew Over the Cuckoo‘s Nest (1975)
 93 24                  One Flew Over the Cuckoo‘s Nest (1975)
 94 25                  One Flew Over the Cuckoo‘s Nest (1975)
 95 26                  One Flew Over the Cuckoo‘s Nest (1975)
 96 27                  One Flew Over the Cuckoo‘s Nest (1975)
 97 28                  One Flew Over the Cuckoo‘s Nest (1975)
 98 29                  One Flew Over the Cuckoo‘s Nest (1975)
 99 ...                                                    ...
100 1000179                                   Home Page (1999)
101 1000180                            Juno and Paycock (1930)
102 1000181                                Detroit 9000 (1973)
103 1000182                               Condition Red (1995)
104 1000183                               Stranger, The (1994)
105 1000184                               Stranger, The (1994)
106 1000185                                  Truce, The (1996)
107 1000186                                  Truce, The (1996)
108 1000187  Brother Minister: The Assassination of Malcolm...
109 1000188  Brother Minister: The Assassination of Malcolm...
110 1000189  Brother Minister: The Assassination of Malcolm...
111 1000190  Brother Minister: The Assassination of Malcolm...
112 1000191                          Six Ways to Sunday (1997)
113 1000192                          Six Ways to Sunday (1997)
114 1000193                            Tough and Deadly (1995)
115 1000194                            Tough and Deadly (1995)
116 1000195                                       Lured (1947)
117 1000196                               Outside Ozona (1998)
118 1000197                              Chain of Fools (2000)
119 1000198  Silence of the Palace, The (Saimt el Qusur) (1...
120 1000199                             Song of Freedom (1936)
121 1000200                     Slappy and the Stinkers (1998)
122 1000201                           Nemesis 2: Nebula (1995)
123 1000202                          Smoking/No Smoking (1993)
124 1000203                                 Modulations (1998)
125 1000204                                 Modulations (1998)
126 1000205                              Broken Vessels (1998)
127 1000206                                  White Boys (1999)
128 1000207                           One Little Indian (1973)
129 1000208        Five Wives, Three Secretaries and Me (1998)
130
131                          genres
132 0                         Drama
133 1                         Drama
134 2                         Drama
135 3                         Drama
136 4                         Drama
137 5                         Drama
138 6                         Drama
139 7                         Drama
140 8                         Drama
141 9                         Drama
142 10                        Drama
143 11                        Drama
144 12                        Drama
145 13                        Drama
146 14                        Drama
147 15                        Drama
148 16                        Drama
149 17                        Drama
150 18                        Drama
151 19                        Drama
152 20                        Drama
153 21                        Drama
154 22                        Drama
155 23                        Drama
156 24                        Drama
157 25                        Drama
158 26                        Drama
159 27                        Drama
160 28                        Drama
161 29                        Drama
162 ...                         ...
163 1000179             Documentary
164 1000180                   Drama
165 1000181            Action|Crime
166 1000182   Action|Drama|Thriller
167 1000183                  Action
168 1000184                  Action
169 1000185               Drama|War
170 1000186               Drama|War
171 1000187             Documentary
172 1000188             Documentary
173 1000189             Documentary
174 1000190             Documentary
175 1000191                  Comedy
176 1000192                  Comedy
177 1000193   Action|Drama|Thriller
178 1000194   Action|Drama|Thriller
179 1000195                   Crime
180 1000196          Drama|Thriller
181 1000197            Comedy|Crime
182 1000198                   Drama
183 1000199                   Drama
184 1000200       Children‘s|Comedy
185 1000201  Action|Sci-Fi|Thriller
186 1000202                  Comedy
187 1000203             Documentary
188 1000204             Documentary
189 1000205                   Drama
190 1000206                   Drama
191 1000207    Comedy|Drama|Western
192 1000208             Documentary
193
194 [1000209 rows x 10 columns]

查看指定记录

Error

1 In [14]: data.ix[0]
2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: DeprecationWarning:
3 .ix is deprecated. Please use
4 .loc for label based indexing or
5 .iloc for positional indexing

Solution

 1 In [15]: data.iloc[0]
 2 Out[15]:
 3 user_id                                            1
 4 movie_id                                        1193
 5 rating                                             5
 6 timestamp                                  978300760
 7 gender                                             F
 8 age                                                1
 9 occupation                                        10
10 zip                                            48067
11 title         One Flew Over the Cuckoo‘s Nest (1975)
12 genres                                         Drama
13 Name: 0, dtype: object

使用pivot_table方法,按性别计算每部电影的平均得分

pandas.pivot_table(datavalues=Noneindex=Nonecolumns=Noneaggfunc=‘mean‘fill_value=Nonemargins=Falsedropna=Truemargins_name=‘All‘)01

 1 In [16]: mean_ratings = data.pivot_table(‘rating‘, index=‘title‘, columns=‘gender‘, aggfunc=‘mean‘)
 2
 3 In [17]: mean_ratings[:10]
 4 Out[17]:
 5 gender                                    F         M
 6 title
 7 $1,000,000 Duck (1971)             3.375000  2.761905
 8 ‘Night Mother (1986)               3.388889  3.352941
 9 ‘Til There Was You (1997)          2.675676  2.733333
10 ‘burbs, The (1989)                 2.793478  2.962085
11 ...And Justice for All (1979)      3.828571  3.689024
12 1-900 (1994)                       2.000000  3.000000
13 10 Things I Hate About You (1999)  3.646552  3.311966
14 101 Dalmatians (1961)              3.791444  3.500000
15 101 Dalmatians (1996)              3.240000  2.911215
16 12 Angry Men (1957)                4.184397  4.328421

过滤评分数据不足250条的电影。

先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象;

 1 In [18]: ratings_by_title = data.groupby(‘title‘).size()
 2
 3 In [19]: ratings_by_title[:10]
 4 Out[19]:
 5 title
 6 $1,000,000 Duck (1971)                37
 7 ‘Night Mother (1986)                  70
 8 ‘Til There Was You (1997)             52
 9 ‘burbs, The (1989)                   303
10 ...And Justice for All (1979)        199
11 1-900 (1994)                           2
12 10 Things I Hate About You (1999)    700
13 101 Dalmatians (1961)                565
14 101 Dalmatians (1996)                364
15 12 Angry Men (1957)                  616
16 dtype: int64

保留评分数据大于250条的电影名称。

 1 In [20]: active_titles = ratings_by_title.index[ratings_by_title >= 250]
 2
 3 In [21]: active_titles
 4 Out[21]:
 5 Index([u‘‘burbs, The (1989)‘, u‘10 Things I Hate About You (1999)‘,
 6        u‘101 Dalmatians (1961)‘, u‘101 Dalmatians (1996)‘,
 7        u‘12 Angry Men (1957)‘, u‘13th Warrior, The (1999)‘,
 8        u‘2 Days in the Valley (1996)‘, u‘20,000 Leagues Under the Sea (1954)‘,
 9        u‘2001: A Space Odyssey (1968)‘, u‘2010 (1984)‘,
10        ...
11        u‘X-Men (2000)‘, u‘Year of Living Dangerously (1982)‘,
12        u‘Yellow Submarine (1968)‘, u‘You‘ve Got Mail (1998)‘,
13        u‘Young Frankenstein (1974)‘, u‘Young Guns (1988)‘,
14        u‘Young Guns II (1990)‘, u‘Young Sherlock Holmes (1985)‘,
15        u‘Zero Effect (1998)‘, u‘eXistenZ (1999)‘],
16       dtype=‘object‘, name=u‘title‘, length=1216)

据此从mean_ratings中选取所需的行。

Error

1 In [22]: mean_ratings = mean_ratings.ix[active_titles]
2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: DeprecationWarning:
3 .ix is deprecated. Please use
4 .loc for label based indexing or
5 .iloc for positional indexing

Solution

In [23]: mean_ratings = mean_ratings.loc[active_titles]

In [24]: mean_ratings
Out[24]:
gender                                                     F         M
title
‘burbs, The (1989)                                  2.793478  2.962085
10 Things I Hate About You (1999)                   3.646552  3.311966
101 Dalmatians (1961)                               3.791444  3.500000
101 Dalmatians (1996)                               3.240000  2.911215
12 Angry Men (1957)                                 4.184397  4.328421
13th Warrior, The (1999)                            3.112000  3.168000
2 Days in the Valley (1996)                         3.488889  3.244813
20,000 Leagues Under the Sea (1954)                 3.670103  3.709205
2001: A Space Odyssey (1968)                        3.825581  4.129738
2010 (1984)                                         3.446809  3.413712
28 Days (2000)                                      3.209424  2.977707
39 Steps, The (1935)                                3.965517  4.107692
54 (1998)                                           2.701754  2.782178
7th Voyage of Sinbad, The (1958)                    3.409091  3.658879
8MM (1999)                                          2.906250  2.850962
About Last Night... (1986)                          3.188679  3.140909
Absent Minded Professor, The (1961)                 3.469388  3.446809
Absolute Power (1997)                               3.469136  3.327759
Abyss, The (1989)                                   3.659236  3.689507
Ace Ventura: Pet Detective (1994)                   3.000000  3.197917
Ace Ventura: When Nature Calls (1995)               2.269663  2.543333
Addams Family Values (1993)                         3.000000  2.878531
Addams Family, The (1991)                           3.186170  3.163498
Adventures in Babysitting (1987)                    3.455782  3.208122
Adventures of Buckaroo Bonzai Across the 8th Di...  3.308511  3.402321
Adventures of Priscilla, Queen of the Desert, T...  3.989071  3.688811
Adventures of Robin Hood, The (1938)                4.166667  3.918367
African Queen, The (1951)                           4.324232  4.223822
Age of Innocence, The (1993)                        3.827068  3.339506
Agnes of God (1985)                                 3.534884  3.244898
...                                                      ...       ...
White Men Can‘t Jump (1992)                         3.028777  3.231061
Who Framed Roger Rabbit? (1988)                     3.569378  3.713251
Who‘s Afraid of Virginia Woolf? (1966)              4.029703  4.096939
Whole Nine Yards, The (2000)                        3.296552  3.404814
Wild Bunch, The (1969)                              3.636364  4.128099
Wild Things (1998)                                  3.392000  3.459082
Wild Wild West (1999)                               2.275449  2.131973
William Shakespeare‘s Romeo and Juliet (1996)       3.532609  3.318644
Willow (1988)                                       3.658683  3.453543
Willy Wonka and the Chocolate Factory (1971)        4.063953  3.789474
Witness (1985)                                      4.115854  3.941504
Wizard of Oz, The (1939)                            4.355030  4.203138
Wolf (1994)                                         3.074074  2.899083
Women on the Verge of a Nervous Breakdown (1988)    3.934307  3.865741
Wonder Boys (2000)                                  4.043796  3.913649
Working Girl (1988)                                 3.606742  3.312500
World Is Not Enough, The (1999)                     3.337500  3.388889
Wrong Trousers, The (1993)                          4.588235  4.478261
Wyatt Earp (1994)                                   3.147059  3.283898
X-Files: Fight the Future, The (1998)               3.489474  3.493797
X-Men (2000)                                        3.682310  3.851702
Year of Living Dangerously (1982)                   3.951220  3.869403
Yellow Submarine (1968)                             3.714286  3.689286
You‘ve Got Mail (1998)                              3.542424  3.275591
Young Frankenstein (1974)                           4.289963  4.239177
Young Guns (1988)                                   3.371795  3.425620
Young Guns II (1990)                                2.934783  2.904025
Young Sherlock Holmes (1985)                        3.514706  3.363344
Zero Effect (1998)                                  3.864407  3.723140
eXistenZ (1999)                                     3.098592  3.289086

[1216 rows x 2 columns]

了解女性观众最喜欢的电影,对F列降序排列。

Error

1 In [25]: top_female_ratings = mean_ratings.sort_index(by=‘F‘, ascending=False)
2 C:\Users\I******\AppData\Local\Enthought\Canopy\App\appdata\canopy-2.1.3.3542.win-x86_64\lib\site-packages\IPython\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)

Solution

 1 In [26]: top_female_ratings = mean_ratings.sort_values(by=‘F‘, ascending=False)
 2
 3 In [27]: top_female_ratings[:10]
 4 Out[27]:
 5 gender                                                     F         M
 6 title
 7 Close Shave, A (1995)                               4.644444  4.473795
 8 Wrong Trousers, The (1993)                          4.588235  4.478261
 9 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
10 Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
11 Schindler‘s List (1993)                             4.562602  4.491415
12 Shawshank Redemption, The (1994)                    4.539075  4.560625
13 Grand Day Out, A (1992)                             4.537879  4.293255
14 To Kill a Mockingbird (1962)                        4.536667  4.372611
15 Creature Comforts (1990)                            4.513889  4.272277
16 Usual Suspects, The (1995)                          4.513317  4.518248

计算不同性别的评分分歧:

给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序 --> 女性观众更喜欢的电影;

 1 In [28]: mean_ratings[‘diff‘] = mean_ratings[‘M‘] - mean_ratings[‘F‘]
 2
 3 In [29]: sorted_by_diff = mean_ratings.sort_values(by=‘diff‘)
 4
 5 In [30]: sorted_by_diff[:15]
 6 Out[30]:
 7 gender                                        F         M      diff
 8 title
 9 Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
10 Jumpin‘ Jack Flash (1986)              3.254717  2.578358 -0.676359
11 Grease (1978)                          3.975265  3.367041 -0.608224
12 Little Women (1994)                    3.870588  3.321739 -0.548849
13 Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
14 Anastasia (1997)                       3.800000  3.281609 -0.518391
15 Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
16 Color Purple, The (1985)               4.158192  3.659341 -0.498851
17 Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
18 Free Willy (1993)                      2.921348  2.438776 -0.482573
19 French Kiss (1995)                     3.535714  3.056962 -0.478752
20 Little Shop of Horrors, The (1960)     3.650000  3.179688 -0.470312
21 Guys and Dolls (1955)                  4.051724  3.583333 -0.468391
22 Mary Poppins (1964)                    4.197740  3.730594 -0.467147
23 Patch Adams (1998)                     3.473282  3.008746 -0.464536

对排序结果反序,并取出前15行 --> 男性观众更喜欢的电影;

 1 In [31]: sorted_by_diff[::-1][:15]
 2 Out[31]:
 3 gender                                         F         M      diff
 4 title
 5 Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
 6 Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
 7 Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
 8 Longest Day, The (1962)                 3.411765  4.031447  0.619682
 9 Cable Guy, The (1996)                   2.250000  2.863787  0.613787
10 Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
11 Hidden, The (1987)                      3.137931  3.745098  0.607167
12 Rocky III (1982)                        2.361702  2.943503  0.581801
13 Caddyshack (1980)                       3.396135  3.969737  0.573602
14 For a Few Dollars More (1965)           3.409091  3.953795  0.544704
15 Porky‘s (1981)                          2.296875  2.836364  0.539489
16 Animal House (1978)                     3.628906  4.167192  0.538286
17 Exorcist, The (1973)                    3.537634  4.067239  0.529605
18 Fright Night (1985)                     2.973684  3.500000  0.526316
19 Barb Wire (1996)                        1.585366  2.100386  0.515020

不考虑性别因素,计算得分数据的方差或标准差。

 1 In [32]: rating_std_by_title = data.groupby(‘title‘)[‘rating‘].std()
 2
 3 In [33]: rating_std_by_title = rating_std_by_title.loc[active_titles]
 4
 5 In [34]: rating_std_by_title.sort_values(ascending=False)[:10]
 6 Out[34]:
 7 title
 8 Dumb & Dumber (1994)                     1.321333
 9 Blair Witch Project, The (1999)          1.316368
10 Natural Born Killers (1994)              1.307198
11 Tank Girl (1995)                         1.277695
12 Rocky Horror Picture Show, The (1975)    1.260177
13 Eyes Wide Shut (1999)                    1.259624
14 Evita (1996)                             1.253631
15 Billy Madison (1995)                     1.249970
16 Fear and Loathing in Las Vegas (1998)    1.246408
17 Bicentennial Man (1999)                  1.245533
18 Name: rating, dtype: float64

Reference

01 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

时间: 2024-10-07 11:11:06

Python for Data Analysis | MovieLens的相关文章

《Python For Data Analysis》学习笔记-1

在引言章节里,介绍了MovieLens 1M数据集的处理示例.书中介绍该数据集来自GroupLens Research(http://www.groupLens.org/node/73),该地址会直接跳转到https://grouplens.org/datasets/movielens/,这里面提供了来自MovieLens网站的各种评估数据集,可以下载相应的压缩包,我们需要的MovieLens 1M数据集也在里面. 下载解压后的文件夹如下: 这三个dat表都会在示例中用到,但是我所阅读的<Pyt

Python For Data Analysis -- NumPy

NumPy作为python科学计算的基础,为何python适合进行数学计算,除了简单易懂,容易学习 Python可以简单的调用大量的用c和fortran编写的legacy的库   The NumPy ndarray: A Multidimensional Array Object ndarray,可以理解为n维数组,用于抽象矩阵和向量 Creating ndarrays 最简单的就是,从list初始化, 当然还有其他的方式,比如, 汇总,     Data Types for ndarrays

Python For Data Analysis -- Pandas

首先pandas的作者就是这本书的作者 对于Numpy,我们处理的对象是矩阵 pandas是基于numpy进行封装的,pandas的处理对象是二维表(tabular, spreadsheet-like),和矩阵的区别就是,二维表是有元数据的 用这些元数据作为index更方便,而Numpy只有整形的index,但本质是一样的,所以大部分操作是共通的 大家碰到最多的二维表应用,关系型数据库中的表,有列名和行号,这些就是元数据 当然你可以用抽象的矩阵来对这些二维表做统计,但使用pandas会更方便  

Python For Data Analysis -- IPython

IPython Basics 首先比一般的python shell更方便一些 比如某些数据结构的pretty-printed,比如字典 更方便的,整段代码的copy,执行 并且可以兼容部分system shell , 比如目录浏览,文件操作等   Tab Completion 这个比较方便,可以在下面的case下,提示和补全未输入部分 a. 当前命名空间中的名字 b.对象或模块的属性和函数 c. 文件路径   Introspection, 内省 ?,在标识符前或后加上,显示出对象状况和docst

《python for data analysis》第四章,numpy的基本使用

<利用python进行数据分析>第四章的程序,介绍了numpy的基本使用方法.(第三章为Ipython的基本使用) 科学计算.常用函数.数组处理.线性代数运算.随机模块-- # -*- coding:utf-8 -*-# <python for data analysis>第四章, numpy基础# 数组与矢量计算import numpy as npimport time # 开始计时start = time.time() # 创建一个arraydata = np.array([[

《python for data analysis》第十章,时间序列

< python for data analysis >一书的第十章例程, 主要介绍时间序列(time series)数据的处理.label:1. datetime object.timestamp object.period object2. pandas的Series和DataFrame object的两种特殊索引:DatetimeIndex 和 PeriodIndex3. 时区的表达与处理4. imestamp object.period object的频率概念,及其频率转换5. 两种频

使用Python进行Data Analysis(1)

Python是一门热门语言,可以应用于多个方向,比如网络变成,云计算,爬虫,自动化运维,自动化运维以及数据科学等. 本文就数据科学方向,介绍如何使用Python进行Data Analysis 1. 工具安装 工欲善其事,必先利其器.可以使用工具Anaconda和Jupyter Notebook以及Python2.7进行开发 1.1 Python 2.7: 下载地址:https://www.python.org/downloads/release/python-2716/,选择对应的操作系统版本进

python for data analysis chapter1~2

Q1:numpy与series的区别:index Tab补全(任意路径Tab) 内省(函数:?显示文档字符串,??显示源代码:结合通配符:np.* load *?) %load .py ctrl-c(强行中断) %timeit(执行时间)%debug? %pwd %matplotlib inline(否则你创建的图可能不会出现) 单行注释# 多行注释,多行字符串‘’‘ ’‘’ Q2:赋值,浅拷贝和深拷贝 1.赋值:简单地拷贝对象的引用,两个对象的id相同. 2.浅拷贝:创建一个新的组合对象,这个

[Python For Data Analysis] Numpy Basics

创建数组 import numpy as np # np.array 将一个iterable object转换为 ndarray data2 = [[2, 3, 4], [5, 6, 7]] arr2 = np.array(data2, dtype = np.float64) #[[2. 3. 4.] # [5. 6. 7.]] arr3 = np.array(data2, dtype = np.int32) #[[2 3 4] # [5 6 7]] # astype 方式将一种数据类型的arr