强化学习&深度强化学习笔记

一些基本概念

常见强化学习算法及分类

分类标准

Policy Gradient

一种基于策略优化的算法。在强化学习任务中,我们希望T步累计奖赏的期望最大,即
Rθ=τR(τ)pθ(τ)=Eτpθ(τ)[R(τ)]\overline{R}_{\theta} = \sum_{\tau}R(\tau)p_{\theta}(\tau) = E_{\tau \sim p_{\theta}(\tau)}[R(\tau)]想让上述期望最大,算法采用类似gradient descent方法,因此有:Rθ=τR(τ)pθ(τ)=τR(τ)pθ(τ)pθ(τ)pθ(τ)=τR(τ)pθ(τ)logpθ(τ)=Eτpθ(τ)[R(τ)logpθ(τ)]1Ni=1NR(τi)logpθ(τi)=1Ni=1Nt=1TR(τi)logpθ(atisti)θθ+ηRθ\begin{align*} \nabla \overline{R}_{\theta} &= \sum_{\tau}R(\tau)\nabla p_{\theta}(\tau)= \sum_{\tau}R(\tau)p_{\theta}(\tau)\frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)}\\ & = \sum_{\tau}R(\tau)p_{\theta}(\tau)\nabla\log p_{\theta}(\tau)\\ &= E_{\tau \sim p_{\theta}(\tau)}[R(\tau)\nabla \log p_{\theta}(\tau)] \approx \frac{1}{N}\sum_{i = 1}^N R(\tau^i)\nabla \log p_{\theta}(\tau^i)\\ & = \frac{1}{N}\sum_{i = 1}^N \sum_{t = 1}^T R(\tau^i)\nabla\log p_{\theta}(a_{t}^i\mid s_{t}^i)\\ \theta \leftarrow &\theta + \eta \nabla \overline{R}_{\theta} \end{align*}f(x)=f(x)logf(x)\nabla f(x) = f(x)\nabla \log f(x)
上述算法存在两个问题:一是环境中可能所有的状态动作对应的reward均为正,在采样数不足时,采样到的动作的weight会上升,没被采样的动作自然下降,但这显然是不合理的;二是所有状态动作对采用一样的reward也不合理,评价动作只需考虑之后的情况。如下式子可帮助改进:Rθ1Ni=1Nt=1T(t=tTγttrtib)logpθ(atisti)bE[R(τi)]Define Advantage function Aθ(st,at)=t=tTγttrtib\nabla \overline{R}_{\theta} \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t = 1}^T (\sum_{t^\prime =t}^T \gamma^{t^\prime-t} r^i_{t^\prime} - b) \nabla \log p_{\theta}(a_t^i\mid s_{t}^i)\qquad b \approx E[R(\tau^i)]\\ \text{Define Advantage function } A^\theta(s_t,a_t) = \sum_{t^\prime =t}^T \gamma^{t^\prime-t} r^i_{t^\prime} - b

PPO算法

off-policy方法,好处是可以重复利用采样到的数据训练θ\theta,原本policy descent在用τ1\tau^1更新θ\theta后就需要重新采样τ2\tau^2,效率不高。要理解PPO,首先要了解重要性采样的概念

介绍了重要性采样后,我们将其应用到policy gradient中,将其改造为off-policyRθ=Eτpθ(τ)[pθ(τ)pθ(τ)R(τ)logpθ(τ)]E(st,at)πθ[pθ(st,at)pθ(st,at)Aθ(st,at)logpθ(atisti)]E(st,at)πθ[pθ(stat)pθ(stat)pθ(st)pθ(st)Aθ(st,at)logpθ(atisti)]pθ(st)pθ(st)1\begin{align*} &\nabla \overline{R}_{\theta} = E_{\tau \sim p_{\theta^\prime}(\tau)}\left[\frac{p_\theta (\tau)}{p_{\theta^\prime}(\tau)} R(\tau) \nabla \log p_{\theta}(\tau)\right]\\ \Rightarrow & E_{(s_t,a_t) \sim \pi_{\theta^\prime}}\left[\frac{p_{\theta}(s_t,a_t)}{p_{\theta^\prime}(s_t,a_t)}A^{\theta^\prime}(s_t,a_t)\nabla \log p_{\theta}(a_t^i\mid s_t^i)\right]\\ \Rightarrow & E_{(s_t,a_t) \sim \pi_{\theta^\prime}}\left[\frac{p_{\theta}(s_t\mid a_t)}{p_{\theta^\prime}(s_t\mid a_t)}\frac{p_{\theta}(s_t)}{p_{\theta^\prime}(s_t)}A^{\theta^\prime}(s_t,a_t)\nabla \log p_{\theta}(a_t^i\mid s_t^i)\right]\quad \frac{p_{\theta}(s_t)}{p_{\theta^\prime}(s_t)} \approx 1 \end{align*}上述式子是gradient,我们利用f(x)=f(x)logf(x)\nabla f(x) = f(x)\nabla \log f(x),可以得到PPO中想要优化的式子(去掉gradient):Jθ(θ)=E(st,at)πθ[pθ(stat)pθ(stat)Aθ(st,at)]  JPPOθ(θ)=Jθ(θ)βKL(θ,θ)if  KL(θ,θ)>KLmax,  increase  βif  KL(θ,θ)<KLmin,  decrease  βJ^{\theta^\prime}(\theta) = E_{(s_t,a_t)\sim \pi_{\theta\prime}}\left[\frac{p_{\theta}(s_t\mid a_t)}{p_{\theta^\prime}(s_t\mid a_t)}A^{\theta^\prime}(s_t,a_t)\right]\\ \Rightarrow \ \ J^{\theta^\prime}_{PPO}(\theta) = J^{\theta^\prime}(\theta) - \beta KL(\theta,\theta^\prime)\\ if \ \ KL(\theta,\theta^\prime) > KL_{\max},\ \ increase \ \ \beta\\ if \ \ KL(\theta,\theta^\prime) < KL_{\min},\ \ decrease \ \ \beta上式就是PPO算法优化的公式。事实上,PPO算法的前身TRPO与其非常相似:maxθJTRPOθ=Jθ(θ) s.t. KL(θ,θ)<δ\begin{align*} &\max_{\theta} J^{\theta^\prime}_{TRPO} = J^{\theta^\prime}(\theta)\\ &\text{ s.t. } KL(\theta,\theta^\prime) < \delta \end{align*}注意上述KL散度是计算根据θ,θ\theta,\theta^\prime得到的分布的散度
PPO-clip算法,核心思想就是利用clip函数的特性来约束两个分布不要差距太大:JPPOclipθk(θ)(st,at)min(pθ(atst)pθk(atst)Aθk(st,at),clip(pθ(atst)pθk(atst),1ϵ,1+ϵ)Aθk(st,at))J^{\theta^k}_{PPO-clip}(\theta)\approx \sum_{(s_t,a_t)} \min \left(\frac{p_{\theta}(a_t\mid s_t)}{p_{\theta^k}(a_t\mid s_t)}A^{\theta^k}(s_t,a_t),clip\left(\frac{p_{\theta}(a_t\mid s_t)}{p_{\theta^k}(a_t\mid s_t)},1-\epsilon,1+\epsilon \right)A^{\theta^k}(s_t,a_t)\right)clip-algo

Actor-Critic