CS294-112 深度强化学习 秋季学期(伯克利)NO.6 Value functions introduction NO.7 Advanced Q learning

--------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------

understand that correlated samples cause problem. and how paralled solve the problem

another solution is replay buffers, fully ultilizing the advantage of off policy in Q-learning.

there‘s still a problem: Q learning is not gradient descent

divide Q function into two parts: the target net and the evolving net.

sacrifice speed to get the convergence.

overestimation of Natural DQN

get trouble in left and right dilemma of avoiding bumping on a tree

原文地址:https://www.cnblogs.com/ecoflex/p/9094123.html

时间: 2024-07-30 09:30:34

CS294-112 深度强化学习 秋季学期(伯克利)NO.6 Value functions introduction NO.7 Advanced Q learning的相关文章

CS294-112 深度强化学习 秋季学期(伯克利)NO.4 Policy gradients introduction

green bar is the reward function, blue curve is the possibility of differenct trajectories if green bars are equally increased to yellow bars, the result will change! 原文地址:https://www.cnblogs.com/ecoflex/p/9085805.html

CS294-112 深度强化学习 秋季学期(伯克利)NO.5 Actor-critic introduction

in most AC algorithms, we actually just fit value function. less common to fit Q function as well. batch:off line, monte carlo.online: bootstrap,TD 原文地址:https://www.cnblogs.com/ecoflex/p/9092566.html

CS294-112 深度强化学习 秋季学期(伯克利)NO.9 Learning policies by imitating optimal controllers

make compromise between learnt policy and minimal cost! π hat is using states π theta is using observations 原文地址:https://www.cnblogs.com/ecoflex/p/9097988.html

CS294-112 深度强化学习 秋季学期(伯克利)NO.15 Exploration 2

jump over this lecture 原文地址:https://www.cnblogs.com/ecoflex/p/9106152.html

CS294-112 深度强化学习 秋季学期(伯克利)NO.19 Guest lecture: Igor Mordatch (Optimization and Reinforcement Learning in Multi-Agent Settings)

skip over 原文地址:https://www.cnblogs.com/ecoflex/p/9112359.html

CS294-112 深度强化学习 秋季学期(伯克利)NO.21 Guest lecture: Aviv Tamar (Combining Reinforcement Learning and Planning)

contact, friction, etc. are unknown 原文地址:https://www.cnblogs.com/ecoflex/p/9114106.html

【干货总结】| Deep Reinforcement Learning 深度强化学习

在机器学习中,我们经常会分类为有监督学习和无监督学习,但是尝尝会忽略一个重要的分支,强化学习.有监督学习和无监督学习非常好去区分,学习的目标,有无标签等都是区分标准.如果说监督学习的目标是预测,那么强化学习就是决策,它通过对周围的环境不断的更新状态,给出奖励或者惩罚的措施,来不断调整并给出新的策略.简单来说,就像小时候你在不该吃零食的时间偷吃了零食,你妈妈知道了会对你做出惩罚,那么下一次就不会犯同样的错误,如果遵守规则,那你妈妈兴许会给你一些奖励,最终的目标都是希望你在该吃饭的时候吃饭,该吃零食

深度强化学习泡沫及路在何方?

一.深度强化学习的泡沫 2015年,DeepMind的Volodymyr Mnih等研究员在<自然>杂志上发表论文Human-level control through deep reinforcement learning[1],该论文提出了一个结合深度学习(DL)技术和强化学习(RL)思想的模型Deep Q-Network(DQN),在Atari游戏平台上展示出超越人类水平的表现.自此以后,结合DL与RL的深度强化学习(Deep Reinforcement Learning, DRL)迅速

深度强化学习(Deep Reinforcement Learning)入门:RL base &amp; DQN-DDPG-A3C introduction

转自https://zhuanlan.zhihu.com/p/25239682 过去的一段时间在深度强化学习领域投入了不少精力,工作中也在应用DRL解决业务问题.子曰:温故而知新,在进一步深入研究和应用DRL前,阶段性的整理下相关知识点.本文集中在DRL的model-free方法的Value-based和Policy-base方法,详细介绍下RL的基本概念和Value-based DQN,Policy-based DDPG两个主要算法,对目前state-of-art的算法(A3C)详细介绍,其他