NBA控卫聚类——K-Means详解

Dataset

在NBA的媒体报道,体育记者通常集中在少数几个球员。由于我们的数据科学的帽子,我们不禁感到一阵怀疑为什么这个球员与其他球员不一样。那么就使用数据科学进一步探索该这个问题。本文的数据集nba_2013.csv是2013 - 2014赛季的NBA球员的表现。

player – name of the player(名字)
pos – the position of the player（位置）
g – number of games the player was in（参赛场数）
pts – total points the player scored（总得分）
fg. – field goal percentage（投篮命中率）
ft. – free throw percentage（罚球命中率）

import pandas as pd
import numpy as np

nba = pd.read_csv("nba_2013.csv")
nba.head(3)
‘‘‘
         player pos  age bref_team_id   g  gs    mp   fg  fga    fg.  \
0    Quincy Acy  SF   23          TOT  63   0   847   66  141  0.468
1  Steven Adams   C   20          OKC  81  20  1197   93  185  0.503
2   Jeff Adrien  PF   27          TOT  53  12   961  143  275  0.520   

      ...      drb  trb  ast  stl  blk  tov   pf  pts     season  season_end
0     ...      144  216   28   23   26   30  122  171  2013-2014        2013
1     ...      190  332   43   40   57   71  203  265  2013-2014        2013
2     ...      204  306   38   24   36   39  108  362  2013-2014        2013  

[3 rows x 31 columns]
‘‘‘

Point Guards

控球后卫(Point Guards)往往是全队进攻的组织者，并通过对球的控制来决定在恰当的时间传球给适合的球员，是球场上拿球机会最多的人。他要把球从后场安全地带到前场，再把球传给其他队友，这才有让其他人得分的机会。一个合格的控球后卫必须要能够在只有一个人防守他的情况下，毫无问题地将球带过半场。然后，他还要有很好的传球能力，能够在大多数的时间里，将球传到球应该要到的地方：有时候是一个可以投篮的空档，有时候是一个更好的导球位置。

先提取出所有控卫的球员信息：

point_guards = nba[nba[‘pos‘] == ‘PG‘]

Points Per Game

由于我们的数据集给出的是球员的总得分（pts）以及参赛场数(g)，没有直接给出每场球赛的平均得分(Points Per Game)，但是可以根据前两个值计算：

point_guards[‘ppg‘] = point_guards[‘pts‘] / point_guards[‘g‘]

# Sanity check, make sure ppg = pts/g
point_guards[[‘pts‘, ‘g‘, ‘ppg‘]].head(5)
‘‘‘
    pts   g        ppg
24  930  71  13.098592
29  150  20   7.500000
30  660  79   8.354430
38  666  72   9.250000
50  378  55   6.872727
‘‘‘

Assist Turnover Ratio

NBA中专门有一项数据统计叫assist/turnover，是用这个队员助攻数比上他的失误数，这项统计能准确的反映一个控卫是否称职。

助攻失误比的计算公式如下，其中Assists表示总助攻（ast），Turnovers表示总失误(tov)。

计算之前要将那些失误率为0的球员去掉，一是因为他可能参赛数很少，分析他没有意义，而来作为除数不能为0.

point_guards = point_guards[point_guards[‘tov‘] != 0]
point_guards[‘atr‘] = point_guards[‘ast‘] / point_guards[‘tov‘]

Visualizing The Point Guards

可视化控卫的信息，X轴表示的是个平均每场球赛的得分，Y轴是助攻失误比。

plt.scatter(point_guards[‘ppg‘], point_guards[‘atr‘], c=‘y‘)
plt.title("Point Guards")
plt.xlabel(‘Points Per Game‘, fontsize=13)
plt.ylabel(‘Assist Turnover Ratio‘, fontsize=13)

Clustering Players

粗略看一下上面的图，大约有五簇比较集中。可以利用聚类技术将相同的控卫聚集在一组。KMeans是比较常用的聚类算法，是基于质心的聚类（簇是一个圆圈，质心是这个簇的平均向量）。将K设置为5.在美国议员党派——K均值聚类这篇文章中我们也用到KMeans这个算法，当时我们是直接调用sklearn中的包（from sklearn.cluster import KMeans），然后直接计算。但是在本文，我们想要研究KMeans的具体步骤，因此一步步迭代。

kmeans_model = KMeans(n_clusters=2, random_state=1)
senator_distances = kmeans_model.fit_transform(votes.iloc[:, 3:])

Step 1

首先随机生成5个中心点：

num_clusters = 5
# Use numpy‘s random function to generate a list, length: num_clusters, of indices
random_initial_points = np.random.choice(point_guards.index, size=num_clusters)
# Use the random indices to create the centroids
centroids = point_guards.ix[random_initial_points]

可视化初始聚类中心，其中中心点用红色标识，其他的点用黄色标识：

plt.scatter(point_guards[‘ppg‘], point_guards[‘atr‘], c=‘yellow‘)
plt.scatter(centroids[‘ppg‘], centroids[‘atr‘], c=‘red‘)
plt.title("Centroids")
plt.xlabel(‘Points Per Game‘, fontsize=13)
plt.ylabel(‘Assist Turnover Ratio‘, fontsize=13)

然后将中心点转化为一个字典格式，字典的键是这个簇的名称，字典的值是这个中心点的信息（”ppg”,”atr”）。

def centroids_to_dict(centroids):
    dictionary = dict()
    # iterating counter we use to generate a cluster_id
    counter = 0

    # iterate a pandas data frame row-wise using .iterrows()
    for index, row in centroids.iterrows():
        coordinates = [row[‘ppg‘], row[‘atr‘]] #list对象
        dictionary[counter] = coordinates
        counter += 1

    return dictionary

centroids_dict = centroids_to_dict(centroids)

再然后就是计算每个点到聚类中心的距离然后将每个点的聚类中心修改为离其最近的那个簇。

import math
# 计算两个点距离的函数
def calculate_distance(centroid, player_values): # 参数都是list对象
    root_distance = 0
    for x in range(0, len(centroid)):
        difference = centroid[x] - player_values[x]
        squared_difference = difference**2
        root_distance += squared_difference

    euclid_distance = math.sqrt(root_distance)
    return euclid_distance
# 返回离每个点最近的簇的键
def assign_to_cluster(row):
    lowest_distance = -1
    closest_cluster = -1
    for cluster_id, centroid in centroids_dict.items():
        df_row = [row[‘ppg‘], row[‘atr‘]]
        euclidean_distance = calculate_distance(centroid, df_row)

        if lowest_distance == -1:
            lowest_distance = euclidean_distance
            closest_cluster = cluster_id
        elif euclidean_distance < lowest_distance:
            lowest_distance = euclidean_distance
            closest_cluster = cluster_id
    return closest_cluster

# 生成一个新的属性：存储每个节点的簇号
point_guards[‘cluster‘] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)

可视化第一次迭代的聚类图，将不同的簇用不同的颜色表示出来：

def visualize_clusters(df, num_clusters):
    colors = [‘b‘, ‘g‘, ‘r‘, ‘c‘, ‘m‘, ‘y‘, ‘k‘]

    for n in range(num_clusters):
        clustered_df = df[df[‘cluster‘] == n]
        plt.scatter(clustered_df[‘ppg‘], clustered_df[‘atr‘], c=colors[n-1])
        plt.xlabel(‘Points Per Game‘, fontsize=13)
        plt.ylabel(‘Assist Turnover Ratio‘, fontsize=13)

visualize_clusters(point_guards, 5)

Step 2

将所有节点聚集后，开始重新计算每个簇的质点：

def recalculate_centroids(df):
    new_centroids_dict = dict()

    for cluster_id in range(0, num_clusters):
        values_in_cluster = df[df[‘cluster‘] == cluster_id]
        # Calculate new centroid using mean of values in the cluster
        new_centroid = [np.average(values_in_cluster[‘ppg‘]), np.average(values_in_cluster[‘atr‘])]
        new_centroids_dict[cluster_id] = new_centroid
    return new_centroids_dict

centroids_dict = recalculate_centroids(point_guards)

Repeat Step 1

然后重复第一步中的，将所有节点重新分给离其最近的那个簇中（跟1相差不大）：

point_guards[‘cluster‘] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)
visualize_clusters(point_guards, num_clusters)

Repeat Step 2 And Step 1

centroids_dict = recalculate_centroids(point_guards)
point_guards[‘cluster‘] = point_guards.apply(lambda row: assign_to_cluster(row), axis=1)
visualize_clusters(point_guards, num_clusters)

Challenges Of K-Means

观察前几次迭代，每次节点改变都不是很大，主要是因为：

K-Means算法在迭代的过程中，对于每个簇不会引起很大的变化，因此这个算法总是收敛的并且很稳定。
由于K-Means算法迭代得很保守，因此最终结果与选取的初始质点有很大关系。

为了解决这些问题，sklearn包中的K-Means实现中做了一些智能的功能，比如重复聚类，每次随机选取质心，这比只采用一次质心选取所带来的偏差要少很多。

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(point_guards[[‘ppg‘, ‘atr‘]])
point_guards[‘cluster‘] = kmeans.labels_

visualize_clusters(point_guards, num_clusters)

时间： 2024-10-16 08:27:37

NBA控卫聚类——K-Means详解的相关文章

第二章控件架构与自定义控件详解 + ListView使用技巧 + Scroll分析

1.Android控件架构下图是UI界面架构图,每个Activity都有一个Window对象,通常是由PhoneWindow类来实现的.PhoneWindow将DecorView作为整个应用窗口的根View,DecorView将屏幕分成两部分:TitleView和ContentView.ContentView实际上是一个FrameLayout,里面容纳的就是我们在xml布局文件中定义的布局. 为什么调用requestWindowFeature()方法一定要在setContentView()方法调

使用appearance proxy定制控件的默认外观(详解)

控件的外观,受到tint color,background image, background color等属性的共同影响,通常要修改某个控件对象的外观,就去调用上面属性的相关setter方法(或者其他可以修改它们的方法)就可以了.但是,如果希望整个app中的控件都保持一致的风格,比如所有button的风格(指的是大小,背景图,形状等)都一样,那么一个一个去重复设置每个button的风格,就显得太麻烦了.如果可以给Button类设定一个默认外观,就方便多了.appearance proxy就可以

Android 控件架构与自定义控件详解

架构: PhoneWindow 将一个 DecorView 设置为整个应用窗口的根 View,这里面所有 View 的监听事件,都通过 WindowManagerService 来接收.DecorView 分为 TitleView 和 ContentView,ContentView 是一个 ID 为 content 的 FrameLayout 在 onCreate() 方法中调用 setContentView() 方法后,ActivityManagerService 会回调onResume()

ListView（1）控件架构与ArrayAdapter详解

ListView是Android开发中比较常用的一个组件,它以列表的形式展示信息,并能根据信息的长度自适应显示.比如说我们手机里的通讯录就用到了ListView显示联系人信息.在大量的场合下,我们都需要使用这个控件.虽然在Android 5.0时代,RecyclerView在很多地方都在逐渐取代ListView,但ListView的使用范围依然非常的广泛.我们也不能跳过ListView直接去学习RecyclerView,对ListView的透彻理解是十分有必要的. 首先来看ListView在Vi

OpenLayers 之控件（control）详解

每一个地图应用都应该有一些工具方便用户控制地图的行为,比如缩放,全屏,坐标控件等等,在 OpenLayers 中怎么添加这些工具呢?下面我给大家介绍一下 OpenLayers 中包含的控件种类,并介绍其使用方法.对控件的定制化,和对 OpenLayers 增加控件和优化控件等超出了本文范围. 一.control 类 OpenLayers 中的控件是由 control 类定义的,这是一个虚基类,不负责实例化特定的控件,它的主要作用是让其他具体的种类的控件类实现继承.OpenLayers 中包含的控

unity3D 实现手机的双指触控和Input类touch详解

多点触控时,下标是从0开始的,两个触控点下标就是0,1. 代码如下: nt touchCount = 2; // 触摸帧的数量 if(touchCount == Input.touchCount()){ vector2 touchPosition1 = Input.GetTouch(0).position; vector2 touchPosition2 = Input.GetTouch(1).position; } 1.Input.touchCount 触摸随之增长,一秒50次增量. 2.Inp

Android控件架构与自定义控件详解（二）——自定义View

在自定义View时,我们通常会去重写onDraw()方法来绘制View的显示内容.如果该View还需要使用wrap_content属性,那么还必须重写onMeasure()方法.另外,通过自定义attrs属性,还可以设置新的属性配置值. 在View中通常有一些比较重要的回调方法. onFinishInflate():从XML加载组件后回调. onSizeChanged(;:组件大小改变时. onMeasure():回调该方法来进行测量. onLayout():回调该方法来确定显示的位置. onT

Top K 算法详解

http://xingyunbaijunwei.blog.163.com/blog/static/7653806720111149318357/ 问题描述百度面试题: 搜索引擎会通过日志文件把用户每次检索使用的所有检索串都记录下来,每个查询串的长度为1-255字节. 假设目前有一千万个记录(这些查询串的重复度比较高,虽然总数是1千万,但如果除去重复后,不超过3百万个.一个查询串的重复度越高,说明查询它的用户越多,也就是越热门.),请你统计最热门的1

Android控件架构与自定义控件详解

基于 <android 群英传 >的读书笔记 View的测量-onMeasure() 测量的模式可以有以下三种: EXACTLY 即精确值模式,当我们将控件的layout_width属性或layout_height属性指定为具体参数值时,系统使用的就是EXACTLY AT_MOST 即最大值模式,当控件的layout_width属性或layout_hright属性是warp_content时,控件大小一般随着控件的子控件或内容的变化而变化,此时控件尺寸只要不超过父控件允许的尺寸即可 UNSPE