adaptive heuristic critic 自适应启发评价 强化学习

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/node24.html

【旧知-新知   强化学习:对新知、旧知的综合】

The adaptive heuristic critic algorithm is an adaptive version of policy iteration [9] in which the value-function computation is no longer implemented by solving a set of linear equations, but is instead computed by an algorithm called TD(0). A block diagram for this approach is given in Figure 4. It consists of two components: a critic (labeled AHC), and a reinforcement-learning component (labeled RL). The reinforcement-learning component can be an instance of any of the k-armed bandit algorithms, modified to deal with multiple states and non-stationary rewards. But instead of acting to maximize instantaneous reward, it will be acting to maximize the heuristic value, v, that is computed by the critic. The critic uses the real external reinforcement signal to learn to map states to their expected discounted values given that the policy being executed is the one currently instantiated in the RL component.

Figure 4: Architecture for the adaptive heuristic critic.

We can see the analogy with modified policy iteration if we imagine these components working in alternation. The policy  implemented by RL is fixed and the critic learns the value function  for that policy. Now we fix the critic and let the RL component learn a new policy  that maximizes the new value function, and so on. In most implementations, however, both components operate simultaneously. Only the alternating implementation can be guaranteed to converge to the optimal policy, under appropriate conditions. Williams and Baird explored the convergence properties of a class of AHC-related algorithms they call ``incremental variants of policy iteration‘‘ [133].

【体验表达:初始状态、学习率-行为过滤、即刻奖励、结果状态】

It remains to explain how the critic can learn the value of a policy. We define  to be an experience tuple summarizing a single transition in the environment. Here s is the agent‘s state before the transition, a is its choice of action, r the instantaneous reward it receives, and s‘ its resulting state. The value of a policy is learned using Sutton‘s TD(0) algorithm [115] which uses the update rule

Whenever a state s is visited, its estimated value is updated to be closer to  , since r is the instantaneous reward received and V(s‘) is the estimated value of the actually occurring next state. This is analogous to the sample-backup rule from value iteration--the only difference is that the sample is drawn from the real world rather than by simulating a known model. The key idea is that  is a sample of the value of V(s), and it is more likely to be correct because it incorporates the real r. If the learning rate  is adjusted properly (it must be slowly decreased) and the policy is held fixed, TD(0) is guaranteed to converge to the optimal value function.

The TD(0) rule as presented above is really an instance of a more general class of algorithms called  , with  . TD(0) looks only one step ahead when adjusting value estimates; although it will eventually arrive at the correct answer, it can take quite a while to do so. The general rule is similar to the TD(0) rule given above,

but it is applied to every state according to its eligibility e(u), rather than just to the immediately previous state, s. One version of the eligibility trace is defined to be

The eligibility of a state s is the degree to which it has been visited in the recent past; when a reinforcement is received, it is used to update all the states that have been recently visited, according to their eligibility. When  this is equivalent to TD(0). When  , it is roughly equivalent to updating all the states according to the number of times they were visited by the end of a run. Note that we can update the eligibility online as follows:

【eligibility 量化过去被访问率】

It is computationally more expensive to execute the general  , though it often converges considerably faster for large   [3032]. There has been some recent work on making the updates more efficient [24] and on changing the definition to make  more consistent with the certainty-equivalent method [108], which is discussed in Section 5.1.



   
Next: Q-learning Up: Learning an Optimal Policy:Previous: Learning an Optimal Policy:

Leslie Pack Kaelbling 
Wed May 1 13:19:13 EDT 1996
时间: 2024-11-02 17:43:08

adaptive heuristic critic 自适应启发评价 强化学习的相关文章

深度强化学习(Deep Reinforcement Learning)入门:RL base & DQN-DDPG-A3C introduction

转自https://zhuanlan.zhihu.com/p/25239682 过去的一段时间在深度强化学习领域投入了不少精力,工作中也在应用DRL解决业务问题.子曰:温故而知新,在进一步深入研究和应用DRL前,阶段性的整理下相关知识点.本文集中在DRL的model-free方法的Value-based和Policy-base方法,详细介绍下RL的基本概念和Value-based DQN,Policy-based DDPG两个主要算法,对目前state-of-art的算法(A3C)详细介绍,其他

强化学习(五)用时序差分法(TD)求解

在强化学习(四)用蒙特卡罗法(MC)求解中,我们讲到了使用蒙特卡罗法来求解强化学习问题的方法,虽然蒙特卡罗法很灵活,不需要环境的状态转化概率模型,但是它需要所有的采样序列都是经历完整的状态序列.如果我们没有完整的状态序列,那么就无法使用蒙特卡罗法求解了.本文我们就来讨论可以不使用完整状态序列求解强化学习问题的方法:时序差分(Temporal-Difference, TD). 时序差分这一篇对应Sutton书的第六章部分和UCL强化学习课程的第四讲部分,第五讲部分. 1. 时序差分TD简介 时序差

David Silver强化学习Lecture1:强化学习简介

课件:Lecture 1: Introduction to Reinforcement Learning 视频:David Silver深度强化学习第1课 - 简介 (中文字幕) 强化学习的特征 作为机器学习的一个分支,强化学习主要的特征为: 无监督,仅有奖励信号: 反馈有延迟,不是瞬时的; 时间是重要的(由于是时序数据,不是独立同分布的); Agent的动作会影响后续得到的数据; 强化学习问题 奖励(Rewards) 奖励 \(R_t\) 是一个标量的反馈信号,表示Agent在 \(t\) 时

【推荐算法工程师技术栈系列】机器学习深度学习--强化学习

目录 强化学习基本要素 马尔科夫决策过程 策略学习(Policy Learning) 时序差分方法(TD method) Q-Learning算法 Actor-Critic方法 DQN DDPG 推荐系统强化学习建模 附录 强化学习基本要素 智能体(agent):与环境交互,负责执行动作的主体: 环境(Environment):可以分为完全可观测环境(Fully Observable Environment)和部分可观测环境(Partially Observable Environment).

【基础知识十六】强化学习

一.任务与奖赏 我们执行某个操作a时,仅能得到一个当前的反馈r(可以假设服从某种分布),这个过程抽象出来就是“强化学习”. 强化学习任务通常用马尔可夫决策过程MDP来描述: 强化学习任务的四要素 E = <X, A, P, R> E:机器处于的环境 X:状态空间 A:动作空间 P:状态转移概率 R:奖赏函数 学习目的: “策略”:机器要做的是不断尝试学得一个“策略” π,根据状态x就能得到要执行的动作 a = π(x) 策略的评价: 长期累积奖赏,常用的有“T步累积奖赏” 强化学习与监督学习的

强化学习之智能出租车项目总结

项目介绍 这是优达学院机器学习课程的第4个实习项目,需要训练智能出租车学习交通规则,然后安全可靠地到达目的地.项目通过循序渐进的方式展开,从熟悉基本的领域知识开始,再以随机动作来直观感受智能车的状态,也是在这一步,让初学者有了心潮澎湃的感觉,"车终于动了!",是的,从0开始一路走来,以游戏闯关的方式,终于来到了4级,第一次体验了传说中的"智能"了,也许是"眼见为实"吧,小车在自己算法的控制之下行动,是一种很美好的感受.然后项目通过引导,让大家开始

浅谈强化学习的方法及学习路线

介绍 目前,对于全球科学家而言,“如何去学习一种新技能”成为了一个最基本的研究问题.为什么要解决这个问题的初衷是显而易见的,如果我们理解了这个问题,那么我们可以使人类做一些我们以前可能没有想到的事.或者,我们可以训练去做更多的“人类”工作,常遭一个真正的人工智能时代. 虽然,对于上述问题,我们目前还没有一个完整的答案去解释,但是有一些事情是可以理解的.先不考虑技能的学习,我们首先需要与环境进行交互.无论我们是学习驾驶汽车还是婴儿学习走路,学习都是基于和环境的相互交互.从互动中学习是所有智力发展和

AI+游戏:高效利用样本的强化学习 | 腾讯AI Lab学术论坛演讲

3月15日,腾讯AI Lab第二届学术论坛在深圳举行,聚焦人工智能在医疗.游戏.多媒体内容.人机交互等四大领域的跨界研究与应用.全球30位顶级AI专家出席,对多项前沿研究成果进行了深入探讨与交流.腾讯AI Lab还宣布了2018三大核心战略,以及同顶级研究与出版机构自然科研的战略合作(点击 这里 查看详情). 腾讯AI Lab希望将论坛打造为一个具有国际影响力的顶级学术平台,推动前沿.原创.开放的研究与应用探讨与交流,让企业.行业和学界「共享AI+未来」. 彭健 美国伊利诺伊大学厄巴纳-香槟分校

深度强化学习泡沫及路在何方?

一.深度强化学习的泡沫 2015年,DeepMind的Volodymyr Mnih等研究员在<自然>杂志上发表论文Human-level control through deep reinforcement learning[1],该论文提出了一个结合深度学习(DL)技术和强化学习(RL)思想的模型Deep Q-Network(DQN),在Atari游戏平台上展示出超越人类水平的表现.自此以后,结合DL与RL的深度强化学习(Deep Reinforcement Learning, DRL)迅速