思维导图
探索和利用
Exploitation:使用模型直接处理
Exploration:尝试新的方法来更新当前的模型
ϵ \epsilon ϵ-贪心表示以 ϵ \epsilon ϵ概率执行探索,以 1 − ϵ 1-\epsilon 1−ϵ概率执行利用
关键概念
值函数:表示为在初始状态为s的情况下采取策略 π \pi π累积到h步的奖励期望值
V π ( s ) = E π [ ∑ i = 0 h r i ∣ s 0 = s ] V^{\pi}(s)=E_{\pi}\left[\sum_{i=0}^h r_i|s_0=s\right] Vπ(s)=Eπ[i=0∑hri∣s0=s]
在考虑折扣因子 γ \gamma γ,状态步数为无穷远时值函数表示为
V π ( s ) = E π [ ∑ i = 0 ∞ γ i r i ∣ s 0 = s ] V^{\pi}(s)=E_{\pi}\left[\sum_{i=0}^\infty\gamma^i r_i|s_0=s\right] Vπ(s)=Eπ[i=0∑∞γiri∣s0=s]
bellman方程
V π ( s ) = E π [ ∑ i = 0 h γ i r i ∣ s 0 = s ] = E π [ r 0 + ∑ i = 1 h γ i r i ∣ s 0 = s ] = π ( s ) ∑ s ′ ∈ S p ( s , s ′ ) E π [ r 0 + γ ∑ i = 0 h γ i r i ∣ s 0 = s ′ ] = π ( s ) ∑ s ′ ∈ S p ( s , s ′ ) [ r 0 + γ V π ( s ′ ) ] \begin{aligned} V^{\pi}(s) &=E_{\pi}\left[\sum_{i=0}^h \gamma^i r_i|s_0=s\right]\\ &= E_{\pi}\left[ r_0 + \sum_{i=1}^h \gamma^i r_i|s_0 = s\right] \\ & = \pi(s) \sum_{s' \in S} p(s,s')E_{\pi}\left[ r_0 +\gamma \sum_{i=0}^h \gamma^i r_i|s_0 = s' \right] \\ & = \pi(s) \sum_{s' \in S} p(s, s')\left[ r_0 + \gamma V^{\pi}(s')\right] \end{aligned} Vπ(s)=Eπ[i=0∑hγiri∣s0=s]=Eπ[r0+i=1∑hγiri∣s0=s]=π(s)s′∈S∑p(s,s′)Eπ[r0+γi=0∑hγiri∣s0=s′]=π(s)s′∈S∑p(s,s′)[r0+γVπ(s′)]
即 V π ( s ) = π ( s ) ∑ s ′ ∈ S p ( s , s ′ ) [ r 0 + γ V π ( s ′ ) ] V^{\pi}(s)= \pi(s) \sum_{s' \in S} p(s, s')\left[ r_0 + \gamma V^{\pi}(s')\right] Vπ(s)=π(s)s′∈S∑p(s,s′)[r0+γVπ(s′)]
当在某种策略中时,可表示为
V ( s ) = r + γ ∑ s ′ ∈ S p ( s , s ′ ) V ( s ′ ) V(s)= r+ \gamma \sum_{s' \in S} p(s, s')V(s') V(s)=r+γs′∈S∑p(s,s′)V(s′)
动作值函数:表示在状态s,采取动作a时得到的值函数
Q ( s , a ) = ∑ s ′ ∈ S p ( s , s ′ ) [ r + γ V π ( s ′ ) ] = r s a + γ ∑ s ′ ∈ S p ( s , s ′ ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) Q ( s ′ , a ′ ) \begin{aligned} Q(s,a) &= \sum_{s' \in S} p(s, s')\left[ r + \gamma V^\pi(s')\right] \\ &= r_s^a+\gamma \sum_{s' \in S} p(s, s') \sum_{a' \in A} \pi(a'|s') Q(s',a') \end{aligned} Q(s,a)=s′∈S∑p(s,s′)[r+γVπ(s′)]=rsa+γs′∈S∑p(s,s′)a′∈A∑π(a′∣s′)Q(s′,a′)