(zhuan) 一些RL的文献(及笔记)

一些RL的文献(及笔记)

copy from: https://zhuanlan.zhihu.com/p/25770890 

Introductions

Introduction to reinforcement learning
Index of /rowan/files/rl

ICML Tutorials:
http://icml.cc/2016/tutorials/deep_rl_tutorial.pdf

NIPS Tutorials:
CS 294 Deep Reinforcement Learning, Spring 2017
https://drive.google.com/file/d/0B_wzP_JlVFcKS2dDWUZqTTZGalU/view

Deep Q-Learning


DQN:
[1312.5602] Playing Atari with Deep Reinforcement Learning (and its nature version)

Double DQN
[1509.06461] Deep Reinforcement Learning with Double Q-learning

Bootstrapped DQN
[1602.04621] Deep Exploration via Bootstrapped DQN

Priority Experienced Replay
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/prioritized-replay.pdf

Duel DQN
[1511.06581] Dueling Network Architectures for Deep Reinforcement Learning

Classic Literature

SuttonBook
http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf
Book

David Silver‘s thesis
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/thesis.pdf

Policy Gradient Methods for Reinforcement Learning with Function Approximation
https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf
(Policy gradient theorem)

1. Policy-based approach is better than value based: policy function is smooth, while using value function to pick policy is not continuous.

2. Policy Gradient method.
Objective function is averaged on the stationary distribution (starting from s0).
For average reward, it needs to be truly stationary.
For state-action (with discount), if all experience starts with s0, then the objective is averaged over a discounted distribution (not necessarily fully-stationary). If we starts with any arbitrary state, then the objective is averaged over the (discounted) stationary distribution.
Policy gradient theorem: gradient operator can “pass” through the state distribution, which is dependent on the parameters (and at a first glance, should be taken derivatives, too).

3. You can replace Q^\pi(s, a) with an approximate, which is only accurate when the approximate f(s, a) satisfies df/dw = d\pi/d\theta /\pi
If pi(s, a) is loglinear wrt some features, then f has to be linear to these features and \sum_a f(s, a) = 0 (So f is an advantage function).

4. First time to show the RL algorithm converges to a local optimum with relatively free-form functional estimator.

DAgger
https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats10-paper.pdf

Actor-Critic Models

Asynchronous Advantage Actor-Critic Model
[1602.01783] Asynchronous Methods for Deep Reinforcement Learning

Tensorpack‘s BatchA3C (ppwwyyxx/tensorpack) and GA3C ([1611.06256] Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU)
Instead of using a separate model for each actor (in separate CPU threads), they process all the data generated by actors with a single model, which is updated regularly via optimization.

On actor-critic algorithms.
http://www.mit.edu/~jnt/Papers/J094-03-kon-actors.pdf
Only read the first part of the paper. It proves that actor-critic will converge to the local minima, when the feature space used to linearly represent Q(s, a) also covers the space spanned by \nabla log \pi(a|s) (compatibility condition), and the actor learns slower than the critic.

https://dev.spline.de/trac/dbsprojekt_51_ss09/export/74/ki_seminar/referenzen/peters-ECML2005.pdf
Natural Actor-Critic
Natural gradient is applied on actor critic method. When the compatibility condition proposed by the policy gradient paper is satisfied (i.e., Q(s, a) is a linear function with respect to \nabla log pi(a|s), so that the gradient estimation using this estimated Q is the same as the true gradient which uses the unknown perfect Q function computed from the ground truth policy), then the natural gradient of the policy‘s parameters is just the linear coefficient of Q.

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients
https://hal.archives-ouvertes.fr/hal-00756747/document
Covers the above two papers.

Continuous State/Action

Reinforcement Learning with Deep Energy-Based Policies 
Use the soft-Q formulation proposed by https://arxiv.org/pdf/1702.08892.pdf (in the math section) and naturally incorporate the entropy term in the Q-learning paradigm. For continuous space, both the training (updating Bellman equation) and sampling from the resulting policy (in terms of Q) are intractable. For the former, they propose to use a surrogate action distribution, and compute the gradient with importance sampling. For the latter, they use Stein variational method that matches a deterministic function a = f(e, s) to the learned Q-distribution. In terms of performance, they are comparable with DDPG. But since the learnt Q could be diverse (multimodal) under maximal entropy principle, it can be used as a common initialization for many specific tasks (Example, pretrain=learn to run towards arbitrary direction, task=run in a maze).

Deterministic Policy Gradient Algorithms
http://jmlr.org/proceedings/papers/v32/silver14.pdf
Silver‘s paper. Learn an actor to prediction the deterministic action (rather than a conditional probability distribution \pi(a|s)) in Q-learning. When trained with Q-learning, propagate through Q to \pi. Similar to Policy Gradient Theorem (gradient operator can “pass” the state distribution, which is dependent on the parameters), there is also deterministic version of it. Also interesting comparison with stochastic offline actor-critic model (stochastic = \pi(a|s)).

Continuous control with deep reinforcement learning (DDPG)
Deep version of DPG (with DQN trick). Neural network + minibatch → not stable, so they also add target network and replay buffer.

Reward Shaping

Policy invariance under reward transformations: theory and application to reward shaping.
http://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf
Andrew Ng‘s reward shaping paper. It proves that for reward shaping, policy is invariant if and only if a difference of a potential function is added to the reward.

Theoretical considerations of potential-based reward shaping for multi-agent systems
Theoretical considerations of potential-based reward shaping for multi-agent systems
Potential based reward-shaping can help a single-agent achieve optimal solution without changing the value (or Nash Equilibrium). This paper extends it to multi-agent case.

Reinforcement Learning with Unsupervised Auxiliary Tasks
[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks
ICLR17 Oral. Add auxiliary task to improve the performance of Atari Games and Navigation. Auxiliary task includes maximizing pixel changes and maximizing the activation of individual neurons.

Navigation

Learning to Navigate in Complex Environments
https://openreview.net/forum?id=SJMGPrcle?eId=SJMGPrcle
Raia‘s group from DM. ICLR17 poster, adding depth prediction as the auxiliary task and improve the navigation performance (also uses SLAM results as network input)

[1611.05397] Reinforcement Learning with Unsupervised Auxiliary Tasks (in reward shaping)

Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments
Goal: navigation without SLAM.
Learn successor features (Q, V before the last layer, these features have a similar Bellman equation.) for transfer learning: learn k top weights simultaneously while sharing the successor features, using DQN acting on the features). In addition to successor features, also try to reconstruct the frame.

Experiments on simulation.
state: 96x96x four most recent frames.
action: four discrete actions. (still, left, right, straight(1m))
baseline: train a CNN to directly predict the action of A*

Deep Recurrent Q-Learning for Partially Observable MDPs
There is no much performance difference between stacked frame DQN versus DRQN. DRQN may be more robust when the game state is flickered (some are 0)

Counterfactual Regret Minimization

Dynamic Thresholding
http://www.cs.cmu.edu/~sandholm/dynamicThresholding.aaai17.pdf
With proofs:
http://www.cs.cmu.edu/~ckroer/papers/pruning_agt_at_ijcai16.pdf

Study game state abstraction and its effect on Ludoc Poker.
https://webdocs.cs.ualberta.ca/~bowling/papers/09aamas-abstraction.pdf

https://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf
https://arxiv.org/pdf/1603.01121v2.pdf
http://anytime.cs.umass.edu/aimath06/proceedings/P47.pdf

Decomposition:
Solving Imperfect Information Games Using Decomposition
http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8407/8476

Safe and Nested Endgame Solving for Imperfect-Information Games
https://www.cs.cmu.edu/~noamb/papers/17-AAAI-Refinement.pdf

Game-specific RL

Atari Game
http://www.readcube.com/articles/10.1038/nature14236

Go
AlphaGo https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf

DarkForest [1511.06410] Better Computer Go Player with Neural Network and Long-term Prediction

Super Smash Bros
https://arxiv.org/pdf/1702.06230.pdf

Doom
Arnold: [1609.05521] Playing FPS Games with Deep Reinforcement Learning
Intel: [1611.01779] Learning to Act by Predicting the Future
F1: https://openreview.net/forum?id=Hk3mPK5gg?eId=Hk3mPK5gg

Poker
Limited Texas hold‘ em
http://ai.cs.unibas.ch/_files/teaching/fs15/ki/material/ki02-poker.pdf

Unlimited Texas hold ‘em 
DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

时间: 2024-10-10 06:21:43

(zhuan) 一些RL的文献(及笔记)的相关文章

文献阅读笔记——group sparsity and geometry constrained dictionary

周五实验室有同学报告了ICCV2013的一篇论文group sparsity and geometry constrained dictionary learning for action recognition from depth maps.这篇文章是关于Sparsing Coding的.Sparse coding并不是我的研究方向.在此仅仅是做个文献阅读后的笔记,权当开拓下我的视野. 从标题就能够看出,这篇论文试图通过学习到group sparsity和geometry constrain

深度学习文献阅读笔记(2)

  12.深度学习的昨天.今天和明天(中文,期刊,2013年,知网) 记录了Hinton提出的两个重要观点:一是多隐层神经网络具有优异的特征学习能力,而是深度网络在训练上的难度可通过"逐层初始化"有效克服.详细描述了及机器学习的两次浪潮:浅层学习和深度学习,并指出深度学习研发面临的重大问题,属于一篇技术总结性文章. 13.基于卷积神经网络的植物叶片分类(中文,期刊,2014年,知网). 主要讲述CNN的发展历史. 14.改进的深度卷积网络及在碎纸片拼接中的应用(中文,期刊,2014年,

深度学习文献阅读笔记(4)

31.卷积神经网络及其在及其视觉中的应用(Convolutional Networks and applications in Vision)(英文,会议论文,2010年,IEEE检索) 文章对CNN的原理.结构介绍得比较详细,总结了卷积神经网络在很多方面的应用,并给出了CNN的无监督训练改进方案,做了大量对比实现,参考文献具有权威性. 32.卷积网络和非卷积网络的联合训练(Joint training of convolutional and non-convoltional neural n

深度学习文献阅读笔记(7)

61.基于PCANet-RF的人脸检测系统(中文,期刊,2016,知网) PCANet人脸检测. 62.使用人脸图像的SVM性别分类(Gender Identification using SVM Based on Human Face Images)(英文,会议,2014,EI检索) 就是单纯的使用LBP+SVM进行性别识别,之处在性别识别中多项式核要优于高斯核. 63.基于深度神经网络和树搜索的围棋博弈(英文,期刊,2016,Nature) 这是2016年击败人类的AlphaGo模型对应的论

深度学习文献阅读笔记(1)

转眼间已经研二了,突然想把以前看过的文献总结总结与大家分享,留作纪念,方便以后参考. 1.深度追踪:通过卷积网络进行差异特征学习的视觉追踪(DeepTrack:Learning Discriminative Feature Representations by Convolutional Neural Networks for visual Tracking)(英文,会议论文,2014年,EI检索) 将卷积神经网络用于目标跟踪的一篇文章,可将CNN不仅仅可以用做模式识别,做目标跟踪也是可以,毕竟

深度学习文献阅读笔记(3)

21.深度神经网络在视觉显著性中的应用(Visual Attention with Deep Neural Networks)(英文,会议论文.2015年,IEEE检索) 这篇文章主要讲CNN在显著性检測领域的应用. 22.深度学习研究进展(中文,期刊,2015年.知网) 深度学习方面的一篇综述性文章,对深度学习的由来,人脑视觉机理,CNN结构都有较为具体的描写叙述,并介绍深度学习在今后的主要改进方向. 23.深度学习研究进展(中文,期刊,2014年,知网) 强调一点.就是Hinton等人所做的

人脸性别识别文献阅读笔记(2)

11.使用自动编码去噪的性别信息模型的情感识别(Modeling gender information for emotion recognition using Denoising autoencoder)(英文,会议论文,2014年,IEEE检索) 自动去噪编码器在语音识别领域的应用,先进行性别识别,在进行情感识别. 12.通过RGB-D数据的学习来识别RGB图像(Recognizing RGB Images by Learning from RGB-D Data)(英文,会议论文,2014

人脸性别识别文献阅读笔记(1)

之前研究过一段时间的人脸性别识别,将之前查阅的论文总结总结,与大家分享一下,也方便日后汇总. 1.基于LBP,亮度.形状直方图的多尺度特征融合的性别识别(Gender Classification Based on Fusion of Different Spatial Scale Features Selected by Mutual Information From Histogram of LBP, Intensity, and Shape)(英文,期刊,2013年,IEEE检索) 在性别

如何总结和整理学术文献?

nerfing ,爱科学爱文学 收录于 知乎圆桌 . 编辑推荐 •袁霖等 2118 人赞同 第一次在知乎答题... 我认为整理文献的主要目的就是:能够在任何条件下,快速找到所需信息.任何好用的软件,都不如大批量多批次的文献阅读. 我的思路是:轻整理,重搜索.轻整理,是指不对文献分类,或者只是对文献简单分类.重搜索,是指利用不同的搜索工具,快速定位到我需要的文献.我认为在现在搜索技术已经很强大的情况下,如果利用笔记等手段整理,反而容易造成条条框框,在对于一篇文献关注太长的时间,不利于提高效率.在日