Reinforcement Learning Q-learning 算法学习-3

//Q-learning 源码分析。 import java.util.Random;

public class QLearning1
    private static final int Q_SIZE = 6;
    private static final double GAMMA = 0.8;
    private static final int ITERATIONS = 10;
    private static final int INITIAL_STATES[] = new int[] {1, 3, 5, 2, 4, 0};

    private static final int R[][] = new int[][] {{-1, -1, -1, -1, 0, -1},
                                                  {-1, -1, -1, 0, -1, 100},
                                                  {-1, -1, -1, 0, -1, -1},
                                                  {-1, 0, 0, -1, 0, -1},
                                                  {0, -1, -1, 0, -1, 100},
                                                  {-1, 0, -1, -1, 0, 100}};

    private static int q[][] = new int[Q_SIZE][Q_SIZE];
    private static int currentState = 0;

    private static void train()

        // Perform training, starting at all initial states.
        for(int j = 0; j < ITERATIONS; j++)
            for(int i = 0; i < Q_SIZE; i++)
            } // i
        } // j

        System.out.println("Q Matrix values:");
        for(int i = 0; i < Q_SIZE; i++)
            for(int j = 0; j < Q_SIZE; j++)
                System.out.print(q[i][j] + ",\t");
            } // j
        } // i


    private static void test()
        // Perform tests, starting at all initial states.
        System.out.println("Shortest routes from initial states:");
        for(int i = 0; i < Q_SIZE; i++)
            currentState = INITIAL_STATES[i];
            int newState = 0;
                newState = maximum(currentState, true);
                System.out.print(currentState + ", ");
                currentState = newState;
            }while(currentState < 5);


    private static void episode(final int initialState)
        currentState = initialState;

        // Travel from state to state until goal state is reached.
        }while(currentState == 5);

        // When currentState = 5, Run through the set once more for convergence.
        for(int i = 0; i < Q_SIZE; i++)

    private static void chooseAnAction()
        int possibleAction = 0;

        // Randomly choose a possible action connected to the current state.
        possibleAction = getRandomAction(Q_SIZE);

        if(R[currentState][possibleAction] >= 0){
            q[currentState][possibleAction] = reward(possibleAction);
            currentState = possibleAction;

    private static int getRandomAction(final int upperBound)
        int action = 0;
        boolean choiceIsValid = false;

        // Randomly choose a possible action connected to the current state.
        while(choiceIsValid == false)
            // Get a random value between 0(inclusive) and 6(exclusive).
            action = new Random().nextInt(upperBound);
            if(R[currentState][action] > -1){
                choiceIsValid = true;

        return action;

    private static void initialize()
        for(int i = 0; i < Q_SIZE; i++)
            for(int j = 0; j < Q_SIZE; j++)
                q[i][j] = 0;
            } // j
        } // i

    private static int maximum(final int State, final boolean ReturnIndexOnly)
        // If ReturnIndexOnly = True, the Q matrix index is returned.
        // If ReturnIndexOnly = False, the Q matrix value is returned.
        int winner = 0;
        boolean foundNewWinner = false;
        boolean done = false;

            foundNewWinner = false;
            for(int i = 0; i < Q_SIZE; i++)
                if(i != winner){             // Avoid self-comparison.
                    if(q[State][i] > q[State][winner]){
                        winner = i;
                        foundNewWinner = true;

            if(foundNewWinner == false){
                done = true;

        if(ReturnIndexOnly == true){
            return winner;
            return q[State][winner];

    private static int reward(final int Action)
        return (int)(R[currentState][Action] + (GAMMA * maximum(Action, false)));

    public static void main(String[] args)

时间: 2024-10-23 03:40:56

Reinforcement Learning Q-learning 算法学习-3的相关文章

Reinforcement Learning Q-learning 算法学习-2

在阅读了Q-learning 算法学习-1文章之后. 我分析了这个算法的本质. 算法本质个人分析. 1.算法的初始状态是随机的,所以每个初始状态都是随机的,所以每个初始状态出现的概率都一样的.如果训练的数量够多的 话,就每种路径都会走过.所以起始的Q(X,Y) 肯定是从目标的周围开始分散开来.也就是目标状态为中心的行为分数会最高. 如 Q(1,5)  Q(4,5)  Q(5,5)这样就可以得到第一级别的经验了.并且分数最高. Q(state, action) = R(state, action)

CS294-112 深度强化学习 秋季学期(伯克利)NO.6 Value functions introduction NO.7 Advanced Q learning

--------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------- un

《Deep Learning》(深度学习)中文版 开发下载

<Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在网络上提供, 这本书是由学界领军人物 Ian Goodfellow.Yoshua Bengio 和 Aaron Courville 合力打造. 书籍原版英文目录: Deep Learning Table of Contents Acknowledgements Notation 1 Introduction

Spark MLlib Deep Learning Neural Net(深度学习-神经网络)1.1

Spark MLlib Deep Learning Neural Net(深度学习-神经网络)1.1 Spark MLlib Deep Learning工具箱,是根据现有深度学习教程<UFLDL教程>中的算法,在SparkMLlib中的实现.具体Spark MLlib Deep Learning(深度学习)目录结构: 第一章Neural Net(NN) 1.源码 2.源码解析 3.实例 第二章Deep Belief Nets(DBNs

Spark MLlib Deep Learning Neural Net(深度学习-神经网络)1.2

Spark MLlib Deep Learning Neural Net(深度学习-神经网络)1.2 第一章Neural Net(神经网络) 2基础及源码解析 2.1 Neural Net神经网络基础知识 2.1.1 神经网络 基础知识参照: 2.1.2 反向传导算法

Spark MLlib Deep Learning Neural Net(深度学习-神经网络)1.3

Spark MLlib Deep Learning Neural Net(深度学习-神经网络)1.3 第一章Neural Net(神经网络) 3实例 3.1 测试数据 3.1.1 测试函数 采用智能优化算法的经典测试函数,如下: (1)Sphere Model 函数表达式 搜索范围 全局最优值 函数简介:此函数为非线性的对称单峰函数,不同维之间是不可分离的.此函数相对比较简单,大多数算法都能够轻松地达到优化效果,其主要用于测试算法的寻优

Deep Learning 十_深度学习UFLDL教程:Convolution and Pooling_exercise(斯坦福大学深度学习教程)

前言 理论知识:UFLDL教程和 实验环境:win7, matlab2015b,16G内存,2T机械硬盘 实验内容:Exercise:Convolution and Pooling.从2000张64*64的RGB图片(它是the STL10 Dataset的一个子集)中提取特征作为训练数据集,训练softmax分类器,然后从3200张64*64的RGB图片(它是th

Deep Learning九之深度学习UFLDL教程:linear decoder_exercise(斯坦福大学深度学习教程)

前言 实验内容:Exercise:Learning color features with Sparse Autoencoders.即:利用线性解码器,从100000张8*8的RGB图像块中提取彩色特征,这些特征会被用于下一节的练习 理论知识:线性解码器和 实验基础说明: 1.为什么要用线性解码器,而不用前面用过的栈式自编码器等?即:线性解码器的作用? 这一点,Ng

Deep Learning深入研究整理学习笔记五

Deep Learning(深度学习)学习笔记整理系列 [email protected] 作者:Zouxy version 1.0 2013-04-08 声明: 1)该Deep Learning的学习系列是整理自网上非常大牛和机器学习专家所无私奉献的资料的.详细引用的资料请看參考文献.详细的版本号声明也參考原文献. 2)本文仅供学术交流,非商用.所以每一部分详细的參考资料并没有详细相应.假设某部分不小心侵犯了大家的利益,还望海涵,并联系

Learning to Rank算法介绍:RankNet,LambdaRank,LambdaMart

之前的博客:中简单介绍了Learning to Rank的基本原理,也讲到了Learning to Rank的几类常用的方法:pointwise,pairwise,listwise.前面已经介绍了pairwise方法中的 RankSVM,IR SVM,和GBRank.这篇博客主要是介绍另外三种相互之间有联系的pairwise的方法:RankNet,LambdaRank,和LambdaMart. 1.