cs231n - assignment1 - neural net 梯度推导

Implementing a Neural Network

In this exercise we will develop a neural network with fully-connected layers to perform classification, and test it out on the CIFAR-10 dataset.

这里开始采用矩阵的形式来推导梯度,而且将逐级推导梯度,这种方式有很大的好处。

首先来回顾一下我们的网络结结构:输入层(D),全连接层-ReLu(H),softmax(C)。网络输入 X[N×D],groundtruth y[N×1]

网络参数: W1[D×H],b1[1×H],W2[H×C],b2[1×C]

Propagation:

FC1_out=X?W1+b1???(1)

H_out=maximum(0,FC1_out)???(2)

FC2_out=H_out?W2+b2???(3)

final_output=softmax(FC2_out)???(4)

Backpropogation

?L?FC2_out=final_output[N×C]?MaskMat[N×C]???(5)

MaskMat参见这里

?L?W2=?FC2_out?W2?L?FC2_out=H_outT??L?FC2_out???(6)

?L?b2=?FC2_out?b2?L?FC2_out=[1...1][1×H]??L?FC2_out???(7)

?L?H_out=?L?FC2_out?FC2_out?H_out=?L?FC2_outWT2,?L?H_out=maxmium(?L?H_out,0)???(8)

?L?W1=?H_cout?W1??L?H_out=XT??L?H_out???(9)

?L?b1=?H_cout?b1??L?H_out=[1...1][1×N]??L?H_out???(10)


# neural_net.py

import numpy as np
import matplotlib.pyplot as plt

class TwoLayerNet(object):
  """
  A two-layer fully-connected neural network. The net has an input dimension of
  N, a hidden layer dimension of H, and performs classification over C classes.
  We train the network with a softmax loss function and L2 regularization on the
  weight matrices. The network uses a ReLU nonlinearity after the first fully
  connected layer.

  In other words, the network has the following architecture:

  input - fully connected layer - ReLU - fully connected layer - softmax

  The outputs of the second fully-connected layer are the scores for each class.
  """

  def __init__(self, input_size, hidden_size, output_size, std=1e-4):
    """
    Initialize the model. Weights are initialized to small random values and
    biases are initialized to zero. Weights and biases are stored in the
    variable self.params, which is a dictionary with the following keys:

    W1: First layer weights; has shape (D, H)
    b1: First layer biases; has shape (H,)
    W2: Second layer weights; has shape (H, C)
    b2: Second layer biases; has shape (C,)

    Inputs:
    - input_size: The dimension D of the input data.
    - hidden_size: The number of neurons H in the hidden layer.
    - output_size: The number of classes C.
    """
    self.params = {}
    self.params[‘W1‘] = std * np.random.randn(input_size, hidden_size)
    self.params[‘b1‘] = np.zeros(hidden_size)
    self.params[‘W2‘] = std * np.random.randn(hidden_size, output_size)
    self.params[‘b2‘] = np.zeros(output_size)

  def loss(self, X, y=None, reg=0.0):
    """
    Compute the loss and gradients for a two layer fully connected neural
    network.

    Inputs:
    - X: Input data of shape (N, D). Each X[i] is a training sample.
    - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
      an integer in the range 0 <= y[i] < C. This parameter is optional; if it
      is not passed then we only return scores, and if it is passed then we
      instead return the loss and gradients.
    - reg: Regularization strength.

    Returns:
    If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
    the score for class c on input X[i].

    If y is not None, instead return a tuple of:
    - loss: Loss (data loss and regularization loss) for this batch of training
      samples.
    - grads: Dictionary mapping parameter names to gradients of those parameters
      with respect to the loss function; has the same keys as self.params.
    """
    # Unpack variables from the params dictionary
    W1, b1 = self.params[‘W1‘], self.params[‘b1‘]
    W2, b2 = self.params[‘W2‘], self.params[‘b2‘]
    N, D = X.shape

    # Compute the forward pass
    scores = None
    #############################################################################
    # TODO: Perform the forward pass, computing the class scores for the input. #
    # Store the result in the scores variable, which should be an array of      #
    # shape (N, C).                                                             #
    #############################################################################

    # evaluate class scores, [N x K]
    hidden_layer = np.maximum(0, np.dot(X,W1)+b1) # ReLU activation
    scores = np.dot(hidden_layer, W2)+b2

    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    # If the targets are not given then jump out, we‘re done
    if y is None:
      return scores

    # Compute the loss
    loss = None
    #############################################################################
    # TODO: Finish the forward pass, and compute the loss. This should include  #
    # both the data loss and L2 regularization for W1 and W2. Store the result  #
    # in the variable loss, which should be a scalar. Use the Softmax           #
    # classifier loss. So that your results match ours, multiply the            #
    # regularization loss by 0.5                                                #
    #############################################################################

    # compute the class probabilities
    #scores -= np.max(scores, axis = 1)[:, np.newaxis]
    #exp_scores = np.exp(scores)
    exp_scores = np.exp(scores-np.max(scores, axis=1, keepdims=True))
    probs = exp_scores/np.sum(exp_scores, axis=1, keepdims=True) #[N X C]

    correct_logprobs = -np.log(probs[range(N),y])
    data_loss = np.sum(correct_logprobs)/N
    reg_loss = 0.5 * reg * ( np.sum(W1*W1) + np.sum(W2*W2) )
    loss = data_loss + reg_loss

    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    # Backward pass: compute gradients
    grads = {}
    #############################################################################
    # TODO: Compute the backward pass, computing the derivatives of the weights #
    # and biases. Store the results in the grads dictionary. For example,       #
    # grads[‘W1‘] should store the gradient on W1, and be a matrix of same size #
    #############################################################################

    # compute the gradient on scores
    dscores = probs
    dscores[range(N),y] -= 1
    dscores /= N

    # backpropate the gradient to the parameters
    # first backprop into parameters W2 and b2
    dW2 = np.dot(hidden_layer.T, dscores)
    db2 = np.sum(dscores, axis=0, keepdims=False)
    # next backprop into hidden layer
    dhidden = np.dot(dscores, W2.T)
    # backprop the ReLU non-linearity
    dhidden[hidden_layer <= 0] = 0
    # finally into W,b
    dW1 = np.dot(X.T, dhidden)
    db1 = np.sum(dhidden, axis=0, keepdims=False)

    # add regularization gradient contribution
    dW2 += reg * W2
    dW1 += reg * W1

    grads[‘W1‘] = dW1
    grads[‘W2‘] = dW2
    grads[‘b1‘] = db1
    grads[‘b2‘] = db2
    #print dW1.shape, dW2.shape, db1.shape, db2.shape
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################

    return loss, grads

  def train(self, X, y, X_val, y_val,
            learning_rate=1e-3, learning_rate_decay=0.95,
            reg=1e-5, num_iters=100,
            batch_size=200, verbose=False):
    """
    Train this neural network using stochastic gradient descent.

    Inputs:
    - X: A numpy array of shape (N, D) giving training data.
    - y: A numpy array f shape (N,) giving training labels; y[i] = c means that
      X[i] has label c, where 0 <= c < C.
    - X_val: A numpy array of shape (N_val, D) giving validation data.
    - y_val: A numpy array of shape (N_val,) giving validation labels.
    - learning_rate: Scalar giving learning rate for optimization.
    - learning_rate_decay: Scalar giving factor used to decay the learning rate
      after each epoch.
    - reg: Scalar giving regularization strength.
    - num_iters: Number of steps to take when optimizing.
    - batch_size: Number of training examples to use per step.
    - verbose: boolean; if true print progress during optimization.
    """
    num_train = X.shape[0]
    iterations_per_epoch = max(num_train / batch_size, 1)

    # Use SGD to optimize the parameters in self.model
    loss_history = []
    train_acc_history = []
    val_acc_history = []

    for it in xrange(num_iters):
      X_batch = None
      y_batch = None

      #########################################################################
      # TODO: Create a random minibatch of training data and labels, storing  #
      # them in X_batch and y_batch respectively.                             #
      #########################################################################
      sample_index = np.random.choice(num_train, batch_size, replace=True)
      X_batch = X[sample_index, :]
      y_batch = y[sample_index]

      #########################################################################
      #                             END OF YOUR CODE                          #
      #########################################################################

      # Compute loss and gradients using the current minibatch
      loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
      loss_history.append(loss)

      #########################################################################
      # TODO: Use the gradients in the grads dictionary to update the         #
      # parameters of the network (stored in the dictionary self.params)      #
      # using stochastic gradient descent. You‘ll need to use the gradients   #
      # stored in the grads dictionary defined above.                         #
      #########################################################################
      dW1 = grads[‘W1‘]
      dW2 = grads[‘W2‘]
      db1 = grads[‘b1‘]
      db2 = grads[‘b2‘]
      self.params[‘W1‘] -= learning_rate*dW1
      self.params[‘W2‘] -= learning_rate*dW2
      self.params[‘b1‘] -= learning_rate*db1
      self.params[‘b2‘] -= learning_rate*db2

      #########################################################################
      #                             END OF YOUR CODE                          #
      #########################################################################

      if verbose and it % 100 == 0:
        print ‘iteration %d / %d: loss %f‘ % (it, num_iters, loss)

      # Every epoch, check train and val accuracy and decay learning rate.
      if it % iterations_per_epoch == 0:
        # Check accuracy
        train_acc = (self.predict(X_batch) == y_batch).mean()
        val_acc = (self.predict(X_val) == y_val).mean()
        train_acc_history.append(train_acc)
        val_acc_history.append(val_acc)

        # Decay learning rate
        learning_rate *= learning_rate_decay

    return {
      ‘loss_history‘: loss_history,
      ‘train_acc_history‘: train_acc_history,
      ‘val_acc_history‘: val_acc_history,
    }

  def predict(self, X):
    """
    Use the trained weights of this two-layer network to predict labels for
    data points. For each data point we predict scores for each of the C
    classes, and assign each data point to the class with the highest score.

    Inputs:
    - X: A numpy array of shape (N, D) giving N D-dimensional data points to
      classify.

    Returns:
    - y_pred: A numpy array of shape (N,) giving predicted labels for each of
      the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
      to have class c, where 0 <= c < C.
    """
    y_pred = None

    ###########################################################################
    # TODO: Implement this function; it should be VERY simple!                #
    ###########################################################################
    hidden_lay = np.maximum(0, np.dot(X,self.params[‘W1‘])+self.params[‘b1‘])
    y_pred = np.argmax( np.dot(hidden_lay, self.params[‘W2‘]), axis=1)

    ###########################################################################
    #                              END OF YOUR CODE                           #
    ###########################################################################

    return y_pred

Tune your hyperparameters

What’s wrong?. Looking at the visualizations above, we see that the loss is decreasing more or less linearly, which seems to suggest that the learning rate may be too low. Moreover, there is no gap between the training and validation accuracy, suggesting that the model we used has low capacity, and that we should increase its size. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.

Tuning. Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using Neural Networks, so we want you to get a lot of practice. Below, you should experiment with different values of the various hyperparameters, including hidden layer size, learning rate, numer of training epochs, and regularization strength. You might also consider tuning the learning rate decay, but you should be able to get good performance using the default value.

Approximate results. You should be aim to achieve a classification accuracy of greater than 48% on the validation set. Our best network gets over 52% on the validation set.

Experiment: You goal in this exercise is to get as good of a result on CIFAR-10 as you can, with a fully-connected Neural Network. For every 1% above 52% on the Test set we will award you with one extra bonus point. Feel free implement your own techniques (e.g. PCA to reduce dimensionality, or adding dropout, or adding features to the solver, etc.).

# two_layer_net.ipynb

best_net = None # store the best model into this
best_stats = None
#################################################################################
# TODO: Tune hyperparameters using the validation set. Store your best trained  #
# model in best_net.                                                            #
#                                                                               #
# To help debug your network, it may help to use visualizations similar to the  #
# ones we used above; these visualizations will have significant qualitative    #
# differences from the ones we saw above for the poorly tuned network.          #
#                                                                               #
# Tweaking hyperparameters by hand can be fun, but you might find it useful to  #
# write code to sweep through possible combinations of hyperparameters          #
# automatically like we did on the previous exercises.                          #
#################################################################################
input_size = 32 * 32 * 3
hidden_size = 300
num_classes = 10

results = {}
best_val = -1
learning_rates = [1e-3, 1.2e-3, 1.4e-3, 1.6e-3, 1.8e-3]
regularization_strengths = [1e-4, 1e-3, 1e-2]

params = [(x,y) for x in learning_rates for y in regularization_strengths ]
for lrate, regular in params:
    net = TwoLayerNet(input_size, hidden_size, num_classes)
    # Train the network
    stats = net.train(X_train, y_train, X_val, y_val,
                      num_iters=1600, batch_size=400,
                      learning_rate=lrate, learning_rate_decay=0.90,
                      reg=regular, verbose=False)

    # Predict on the validation set
    accuracy_train = (net.predict(X_train) == y_train).mean()
    accuracy_val = (net.predict(X_val) == y_val).mean()
    results[(lrate, regular)] = (accuracy_train, accuracy_val)
    if( best_val < accuracy_val ):
        best_val = accuracy_val
        best_net = net
        best_stats = stats

# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print ‘lr %e reg %e train accuracy: %f val accuracy: %f‘ % (
                lr, reg, train_accuracy, val_accuracy)

print ‘best validation accuracy achieved during cross-validation: %f‘ % best_val

# Plot the loss function and train / validation accuracies
plt.subplot(2, 1, 1)
plt.plot(best_stats[‘loss_history‘])
plt.title(‘Loss history‘)
plt.xlabel(‘Iteration‘)
plt.ylabel(‘Loss‘)

plt.subplot(2, 1, 2)
plt.plot(best_stats[‘train_acc_history‘], label=‘train‘,color=‘r‘)
plt.plot(best_stats[‘val_acc_history‘], label=‘val‘,color=‘g‘)
plt.title(‘Classification accuracy history‘)
plt.xlabel(‘Epoch‘)
plt.ylabel(‘Clasification accuracy‘)
plt.show()

#################################################################################
#                               END OF YOUR CODE                                #
#################################################################################

lr 1.000000e-03 reg 1.000000e-04 train accuracy: 0.541551 val accuracy: 0.499000

lr 1.000000e-03 reg 1.000000e-03 train accuracy: 0.541694 val accuracy: 0.511000

lr 1.000000e-03 reg 1.000000e-02 train accuracy: 0.540898 val accuracy: 0.490000

lr 1.200000e-03 reg 1.000000e-04 train accuracy: 0.562041 val accuracy: 0.528000

lr 1.200000e-03 reg 1.000000e-03 train accuracy: 0.563653 val accuracy: 0.507000

lr 1.200000e-03 reg 1.000000e-02 train accuracy: 0.564184 val accuracy: 0.512000

lr 1.400000e-03 reg 1.000000e-04 train accuracy: 0.580857 val accuracy: 0.532000

lr 1.400000e-03 reg 1.000000e-03 train accuracy: 0.580857 val accuracy: 0.513000

lr 1.400000e-03 reg 1.000000e-02 train accuracy: 0.575245 val accuracy: 0.534000

lr 1.600000e-03 reg 1.000000e-04 train accuracy: 0.593347 val accuracy: 0.529000

lr 1.600000e-03 reg 1.000000e-03 train accuracy: 0.594857 val accuracy: 0.548000

lr 1.600000e-03 reg 1.000000e-02 train accuracy: 0.593878 val accuracy: 0.551000

lr 1.800000e-03 reg 1.000000e-04 train accuracy: 0.605306 val accuracy: 0.537000

lr 1.800000e-03 reg 1.000000e-03 train accuracy: 0.610000 val accuracy: 0.533000

lr 1.800000e-03 reg 1.000000e-02 train accuracy: 0.603204 val accuracy: 0.546000

best validation accuracy achieved during cross-validation: 0.551000

Test accuracy: 0.542

时间: 2024-11-08 21:46:03

cs231n - assignment1 - neural net 梯度推导的相关文章

triple loss 原理以及梯度推导

[理解triple] 如上图所示,triple是一个三元组,这个三元组是这样构成的:从训练数据集中随机选一个样本,该样本称为Anchor,然后再随机选取一个和Anchor (记为x_a)属于同一类的样本和不同类的样本,这两个样本对应的称为Positive (记为x_p)和Negative (记为x_n),由此构成一个(Anchor,Positive,Negative)三元组. [理解triple loss] 有了上面的triple的概念, triple loss就好理解了.针对三元组中的每个元素

笔记:CS231n+assignment1(作业一)

CS231n的课后作业非常的好,这里记录一下自己对作业一些笔记. 一.第一个是KNN的代码,这里的trick是计算距离的三种方法,核心的话还是python和machine learning中非常实用的向量化操作,可以大大的提高计算速度. import numpy as np class KNearestNeighbor(object):#首先是定义一个处理KNN的类 """ a kNN classifier with L2 distance """

CS231n: Convolutional Neural Networks for Visual Recognition - Spring 2017

喜大普奔!!!!! CS231n 2017新鲜出炉啦!!!!! 课程主页:http://cs231n.stanford.edu/ 有讲义,有教案,有讲座,更重要的是--还有官方授课视频!!!!!意不意外?惊不惊喜?开不开心???!!!

cs231n Convolutional Neural Networks for Visual Recognition 2 SVM softmax

linear classification 上节中简单介绍了图像分类的概念,并且学习了费时费内存但是精度不高的knn法,本节我们将会进一步学习一种更好的方法,以后的章节中会慢慢引入神经网络和convolutional neural network.这种新的算法有两部分组成: 1. 评价函数score function,用于将原始数据映射到分类结果(预测值): 2. 损失函数loss function, 用于定量分析预测值与真实值的相似程度,损失函数越小,预测值越接近真实值. 我们将两者结合,损失

CS231N assignment1

# Visualize some examples from the dataset. # We show a few examples of training images from each class. classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] #类别列表 num_classes = len(classes) #类别数目 samples_per_cla

深度学习:卷积神经网络(convolution neural network)

(一)卷积神经网络 卷积神经网络最早是由Lecun在1998年提出的. 卷积神经网络通畅使用的三个基本概念为: 1.局部视觉域: 2.权值共享: 3.池化操作. 在卷积神经网络中,局部接受域表明输入图像与隐藏神经元的连接方式.在图像处理操作中采用局部视觉域的原因是:图像中的像素并不是孤立存在的,每一个像素与它周围的像素都有着相互关联,而并不是与整幅图像的像素点相关,因此采用局部视觉接受域可以类似图像的此种特性. 另外,在图像数据中存在大量的冗余数据,因此在图像处理过程中需要对这些冗余数据进行处理

Deep Neural Networks的Tricks

Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks. 主要以下面八个部分展开介绍: mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Ne

[转] 贺完结!CS231n官方笔记授权翻译总集篇发布

哈哈哈!我们也是不谦虚,几个"业余水平"的网友,怎么就"零星"地把这件事给搞完了呢!总之就是非常开心,废话不多说,进入正题吧! CS231n简介 CS231n的全称是CS231n: Convolutional Neural Networks for Visual Recognition,即面向视觉识别的卷积神经网络.该课程是斯坦福大学计算机视觉实验室推出的课程.需要注意的是,目前大家说CS231n,大都指的是2016年冬季学期(一月到三月)的最新版本. 课程描述:请

机器学习公开课笔记(5):神经网络(Neural Network)——学习

这一章可能是Andrew Ng讲得最不清楚的一章,为什么这么说呢?这一章主要讲后向传播(Backpropagration, BP)算法,Ng花了一大半的时间在讲如何计算误差项$\delta$,如何计算$\Delta$的矩阵,以及如何用Matlab去实现后向传播,然而最关键的问题——为什么要这么计算?前面计算的这些量到底代表着什么,Ng基本没有讲解,也没有给出数学的推导的例子.所以这次内容我不打算照着公开课的内容去写,在查阅了许多资料后,我想先从一个简单的神经网络的梯度推导入手,理解后向传播算法的