Reinforcement Learning - TD

RL_4TD

Model-Free Incremental RL Algorithms（TD 是个算法大类）

在 Model-Free 的情况下，用 Experience 求解 Bellman Expectation Equation

所以 TD 的思想与 Stochastic Gradient Descent (SGD) 非常相近，算是某种 Stochastic Approximation Algorithm

另外，思想相近的算法还有 Newton's Method

1. TD(0)

TD Learning of State Value，是最最最基本的用于 Policy Evaluation 的 TD 算法

只能估计 State Value of a Given Policy
不能估计 Action Value
不能搜索 Optimal Policy

1.1 Problem Formulation

Given
1. $\pi$
2. $\pi$ in the form
  ${(s_{t}, r_{t + 1}, s_{t + 1})}_{t} = {s_{0}, r_{1}, s_{1}, \dots, s_{t}, r_{t + 1}, s_{t + 1}, \dots}$
Task

$\{v_{\pi}(s)\}_{s\in\mathcal{S}}$ $\pi$
$v_{π} (s_{t}) = E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}] \forall s_{t}$

1.2 Algorithm

核心机制类似于 Stochastic Gradient Descent

\begin{matrix} {\begin{aligned} v_{t + 1} (s_{t}) & = v_{t} (s_{t}) - α_{t} (s_{t}) {v_{t} (s_{t}) - [r_{t + 1} + γ v_{t} (s_{t + 1})]} \\ v_{t + 1} (s) & = v_{t} (s) \forall s \neq s_{t} \end{aligned} \end{matrix}

where

$s_t\longrightarrow$ $t$
$v_t(s_t) \longrightarrow$ $v_{\pi}(s_t)$
$\alpha_t(s_t)\longrightarrow$ $s_t$ $t$ (a small positive number)

1.2.1 关于第一个公式

\underset{New Estimation}{\underset{⏟}{v_{t + 1} (s_{t})}} = \underset{Current Estimation}{\underset{⏟}{v_{t} (s_{t})}} - α_{t} (s_{t}) \overset{TD Error δ_{t}}{\overset{⏞}{{v_{t} (s_{t}) - \underset{TD Target {\bar{v}}_{t}}{\underset{⏟}{[r_{t + 1} + γ v_{t} (s_{t + 1})]}}}}}

TD Target

${\bar{v}}_{t} = r_{t + 1} + γ v_{t} (s_{t + 1})$

where
- $r_{t+1}\longrightarrow$ $s_t\to s_{t+1}$ 的 Transition 得到的 Reward
- $\lambda\longrightarrow$ Discount Rate
- $v_t(s_{t+1})\longrightarrow$ $s_{t+1}$ at $t$
  $v_t(s_{t+1})$ $v_{t+1}(s_t)$ 的意义是不同的！！！
  - $t$ $t+1$ $s_{t+1}$ 进行的 State Value Estimation！
  - $t+1$ $t$ 所 access 的 State $s_t$ 进行的 State Value Estimation！
$\bar{v}_t(s_t)$ $v_{\pi}(s_t)$ 的单步估计样本

将其与 Bellman Expectation Equation 对比有助于理解其含义
- State Value 定义式
  $v_{π} (s_{t}) = E [G_{t} | S_{t} = s_{t}]$
  $G_t$ $v_{\pi}(s_t)$ 的 Expectation
- State Value 递推式
  $v_{π} (s_{t}) = E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}]$
  这也是 BEE 的一种形式！（但不是 Matrix-Vector Form）（名字是我自己取的，这不重要）
- TD Target
  ${\bar{v}}_{t} = r_{t + 1} + γ v_{t} (s_{t + 1})$
  Discounted Return 公式展开是
  $G_{t} = R_{t + 1} + γ v_{π} (S_{t + 1})$
  公式结构的相似性说明，TD Target 类似是对 Discounted Return 的一次采样
  
  $v_t(s_t)$ 作差的对象就非常合理咯
TD Error

$δ_{t} = v_{t} (s_{t}) - [r_{t + 1} + γ v_{t} (s_{t + 1})]$
- Intuition
  
  TD(0) 对 State Value Approximation 的 update 方式和 Newton's Method 非常像
  
  结合前文探讨的 TD Target 意义去看下面的这个改写公式，这个 TD Error 的写法是非常合理的
  $δ_{t} = v_{t} (s_{t}) - {\bar{v}}_{t}$
  $v_t$ $\bar{v}_t$
  $\begin{aligned} v_{t + 1} (s_{t}) & = v_{t} (s_{t}) - α_{t} (s_{t}) [v_{t} (s_{t}) - {\bar{v}}_{t}] \\ [v_{t + 1} (s_{t}) - {\bar{v}}_{t}] & = [v_{t} (s_{t}) - {\bar{v}}_{t}] - α_{t} (s_{t}) [v_{t} (s_{t}) - {\bar{v}}_{t}] \\ [v_{t + 1} (s_{t}) - {\bar{v}}_{t}] & = [1 - α_{t} (s_{t})] [v_{t} (s_{t}) - {\bar{v}}_{t}] \end{aligned}$
  $\alpha_t(s_t)$ 是一个 small positive number，所以有
  $\begin{matrix} 0 < 1 - α_{t} (s_{t}) < 1 \end{matrix}$
  所以
  $| v_{t + 1} (s_{t}) - {\bar{v}}_{t} | = [1 - α_{t} (s_{t})] | v_{t} (s_{t}) - {\bar{v}}_{t} |$
  且
  $| v_{t + 1} (s_{t}) - {\bar{v}}_{t} | \leq | v_{t} (s_{t}) - {\bar{v}}_{t} |$
  每次 Time Step 更新 $v_t(s_t)$ $\bar{v}_t$ 更近！
- $v_t$ $v_{\pi}$
  
  $v_t = v_{\pi}$ ，则
  $\begin{aligned} δ_{π, t} & = v_{π} (s_{t}) - [r_{t + 1} + γ v_{π} (s_{t + 1})] \\ E [δ_{π, t} | S_{t} = s_{t}] & = v_{π} (s_{t}) - E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}] \end{aligned}$
  由于 TD Target 中提到的
  $v_{π} (s_{t}) = E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}]$
  所以有
  $E [δ_{π, t} | S_{t} = s_{t}] = v_{π} (s_{t}) - E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}] = 0$
  $= 0$ $v_t = v_{\pi}$

1.2.2 关于第二个公式

v_{t + 1} (s) = v_{t} (s) \forall s \neq s_{t}

对于所有在当前 Time Step 未被 access 的 State，其对应的 State Value 估计值不变

THAND GOD! 解释这个比上面那个烦人的公式方便多了...

1.3 Algorithm Derivation

这个算法的本质是用 Special Topics 里的 Robbins-Monro Algorithm 来解一个诡异的 BEE 的表达式

BEE 的 State Value 递推式

即 1.2.1 中提到的
$v_{π} (s_{t}) = E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}]$
RM Problem Formulation
$\begin{aligned} g (v_{π} (s)) & = v_{π} (s_{t}) - E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s_{t}] \\ = 0 \end{aligned}$
RM Algorithm

$\begin{aligned} v_{k + 1} (s_{k}) & = v_{k} (s_{k}) - α_{k} \tilde{g} (v_{k} (s_{k})) \\ = v_{k} (s_{k}) - α_{k} {v_{k} (s_{k}) - [r_{k} + γ v_{π} (s_{k}^{'})]} \end{aligned}$

$s_k$ $k$ $s_k'$ is the next state transitioned to after an action

做如下两个修改，以适应 RL 语境
- $\{(s_k, r_k, s_k')\} \longrightarrow \{(s_t, r_{t+1}, s_{t+1})\}$
  - 前者为 RM Algorithm 用到的一组样本，运算过程中需要不断采样，而它们都是独立采样的，有 independent & identically distributed 这一假设
  - 但是显然做 RL 的时候你肯定没那个心情去一组一组采，TD 依赖的是 Experience Samples (see 1.1) ("Trajectory")，而只有后者的形式才能表示一条 Trajectory
- $v_{\pi}(s_k') \longrightarrow v_{\pi}(s_{t+1}) \longrightarrow v_k(s_{t+1})$
  
  $v_{\pi}$ $v_k$ ，即其 Estimation 来代替
  
  $v_k$ 收敛
最后变成

$v_{k + 1} (s_{t}) = v_{k} (s_{t}) - α_{k} {v_{k} (s_{t}) - [r_{t + 1} + γ v_{k} (s_{t + 1})]}$

然后，由于
- $k$ $t$ 实际上是同一个东西
- $\alpha_t$ $s_t$
所以...Voila~

$v_{t + 1} (s_{t}) = v_{t} (s_{t}) - α_{t} (s_{t}) {v_{t} (s_{t}) - [r_{t + 1} + γ v_{t} (s_{t + 1})]}$

1.4 Convergence

Theorem

By the algorithm introduced in the above sections:

$\pi$ $\sum_t\alpha_t(s) = \infin$ $\sum_t\alpha_t^2(s)<\infin$ $s\in\mathcal{S}$

$v_t(s)\to v_{\pi}(s)$ $t\to\infin$ $s\in\mathcal{S}$
关于 Learning Rate

$\alpha_t(s_t)$ small constant $\sum_t\alpha_t^2(s)<\infin$ ）

$\alpha_t(s_t)$ 越来越小，到很久很久以后 Experience Samples 就会因为 Learning Rate 过小而失去作用

2. Sarsa Algorithms

TD Learning of Action Values，功能为 Policy Evaluation

可以估计 Action Value
可以搜索 Optimal Policy

$\epsilon$ -Greedy 的方式去选出合适的 Action 来更新 Policy

2.1 Problem Formulation

Given
1. $\pi$
2. $\pi$ in the form
  ${(s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1})}_{t} = {s_{0}, a_{0}, r_{1}, s_{1}, a_{1}, \dots, s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1}, \dots}$
  State, Action, Reward, State, Action = SARSA
Task

$\{q_{\pi}(s, a)\}_{s\in\mathcal{S}, a\in\mathcal{A}}$ $\pi$

2.2 Sarsa

相当于 Action Value 版本的 TD(0)

Mathematical Objective

a stochastic approximation algorithm solving
$q_{π} (s_{t}, a_{t}) = E [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) | S_{t} = s_{t}, A_{t} = a_{t}] \forall s_{t}, a_{t}$
这是一种 BEE 用 Action Value 表达的写法
Algorithm

$\{ s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\}$

$\begin{matrix} {\begin{aligned} q_{t + 1} (s_{t}, a_{t}) & = q_{t} (s_{t}, a_{t}) - α_{t} (s_{t}, a_{t}) {q_{t} (s_{t}, a_{t}) - [r_{t + 1} + γ q_{t} (s_{t + 1}, a_{t + 1})]} \\ q_{t + 1} (s, a) & = q_{t} (s, a) \forall (s, a) \neq (s_{t}, a_{t}) \end{aligned} \end{matrix}$

where
- $s_t\longrightarrow$ $t$
- $a_t \longrightarrow$ $t$
- $q_t(s_t, a_t) \longrightarrow$ $q_{\pi}(s_t, a_t)$
- $\alpha_t(s_t)\longrightarrow$ $s_t, a_t$ $t$ (a small positive number)
Pseudo-Code

2.3 Expected Sarsa

与一般版本的 Sarsa 的区别在于 TD Target 稍微变了一下

q_{t} (s_{t + 1}, a_{t + 1}) ⟶ E [q_{t} (s_{t + 1}, A)]

这样能将每组 Experience 涉及的 Random Variable 减少

{s_{t}, a_{t}, r_{t + 1}, s_{t + 1}, a_{t + 1}} ⟶ {s_{t}, a_{t}, r_{t + 1}, s_{t + 1}}

Matwhematical Objective

a stochastic approximation algorithm solving
$q_{π} (s_{t}, a_{t}) = E [R_{t + 1} + γ E [q_{π} (S_{t + 1}, A_{t + 1}) | S_{t + 1}] | S_{t} = s_{t}, A_{t} = a_{t}] \forall s_{t}, a_{t}$
这是也一种 BEE 用 Action Value 表达的写法
Algorithm

$\{ s_t, a_t, r_{t+1}, s_{t+1}\}$
$\begin{matrix} {\begin{aligned} q_{t + 1} (s_{t}, a_{t}) & = q_{t} (s_{t}, a_{t}) - α_{t} (s_{t}, a_{t}) {q_{t} (s_{t}, a_{t}) - (r_{t + 1} + γ E [q_{t} (s_{t + 1}, A)])} \\ q_{t + 1} (s, a) & = q_{t} (s, a) \forall (s, a) \neq (s_{t}, a_{t}) \end{aligned} \end{matrix}$
Notes
1. Needs more computation
2. 由于减少了涉及的 Random Variable，所以能 reduce estimation variances

2.4 n-Step Sarsa

这是个介于 2.2 的一般 Sarsa 和 Monte-Carlo 之间的算法，关键在 Action Value 的公式上

q_{π} (s, a) = E [G_{t} | S_{t} = s, A_{t} = a]

Intuition

$G_t$ can be written in different form as
$\begin{aligned} Sarsa ⟵ G_{t}^{(1)} & = R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) \\ G_{t}^{(2)} & = R_{t + 1} + γ R_{t + 2} + γ^{2} q_{π} (S_{t + 2}, A_{t + 2}) \\ ⋮ \\ n-Step Sarsa ⟵ G_{t}^{(2)} & = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} q_{π} (S_{t + n}, A_{t + n}) \\ ⋮ \\ Monte-Carlo ⟵ G_{t}^{(\infty)} & = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots \end{aligned}$

[ WARNING ]

请千万记住，以上只是 written in different forms！
$G_{t} = G_{t}^{(1)} = G_{t}^{(2)} = G_{t}^{(n)} = G_{t}^{(\infty)}$
$G_t$ 的分解程度
Mathematical Objective
- Sarsa
  
  It aims to solve
  $\begin{aligned} q_{π} (s, a) & = E [G_{t}^{(1)} | S_{t} = s, A_{t} = a] \\ = E [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) | S_{t} = s, A_{t} = a] \end{aligned}$
- Monte-Carlo
  
  It aims to solve
  $\begin{aligned} q_{π} (s, a) & = E [G_{t}^{(\infty)} | S_{t} = s, A_{t} = a] \\ = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots | S_{t} = s, A_{t} = a] \end{aligned}$
- n-Step Sarsa
  
  It aims to solve
  
  $\begin{aligned} q_{π} (s, a) & = E [G_{t}^{(n)} | S_{t} = s, A_{t} = a] \\ = E [R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} q_{π} (S_{t + n}, A_{t + n}) | S_{t} = s, A_{t} = a] \end{aligned}$
  - n-Step SarsaSarsa $n = 1$
  - n-Step SarsaMonte-Carlo $n = \infin$
Algorithm

$\{ s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \cdots, r_{t+n}, s_{t+n}, a_{t+n}\}$

$n+1$ to update $(s_t, a_t)$
$\begin{matrix} {\begin{aligned} q_{t + n + 1} (s_{t}, a_{t}) & = q_{t + n} (s_{t}, a_{t}) - α_{t + n} (s_{t}, a_{t}) {q_{t + n} (s_{t}, a_{t}) - (r_{t + 1} + γ r_{t + 2} + \dots + γ^{n - 1} r_{t + n + 1} + γ^{n} q_{t + n} (s_{t + n + 1}, A_{t + n + 1}))} \\ q_{t + 1} (s, a) & = q_{t} (s, a) \forall (s, a) \neq (s_{t}, a_{t}) \end{aligned} \end{matrix}$
Notes

Performance is a blend of Sarsa and Monte-Carlo
- if n is large, then it has a large variance but a small bias
- if n is small, then it has a large bias but a small variance

3. Q-Learning

TD Learning of Optimal Action Values

和 Sarsa 很像，但 TD Target 以及做 Policy Improvement 时所用方法的区别非常大

3.1 Algorithm

Mathematical Objective

It aims to solve
$q (s, a) = E [R_{t + 1} + γ max_{a} q (S_{t + 1}, a) | S_{t} = s, A_{t} = a]$
This is a BOE instead of BEE expressed in terms of Action Values!

所以 Q-Learning 能直接 estimate (optimal policy 对应的) optimal action value
Algorithm

$\{ s_t, a_t, r_{t+1}, s_{t+1}\}$
$\begin{matrix} {\begin{aligned} q_{t + n + 1} (s_{t}, a_{t}) & = q_{t + n} (s_{t}, a_{t}) - α_{t + n} (s_{t}, a_{t}) {q_{t + n} (s_{t}, a_{t}) - [r_{t + 1} + γ max_{α \in A} q_{t} (s_{t + 1}, a)]} \\ q_{t + 1} (s, a) & = q_{t} (s, a) \forall (s, a) \neq (s_{t}, a_{t}) \end{aligned} \end{matrix}$
$a_{t+1}$ $\max_{a\in\mathcal{A}}$ 得到的
Notes

Q-Learning 有 On-Policy 和 Off-Policy 两种版本

3.2 On-Policy Ver. Pseudo-Code

pseudo On-Policy QLearning

3.3 Off-Policy Ver. Pseudo-Code

请注意，Off-Policy Learning 的算法在 Update Policy 时不需要考虑 Target Policy 的探索性，反正 Behavior Policy 能 handle 探索的问题

$\pi_b$ $\pi_T$ 的分离，导致在 Update Policy 的时候， $\epsilon$ -Greedy 而直接用 Greedy 的方式更新

pseudo Off-Policy QLearning

[ Notes ] Universal Expression

Algorithm
Mathematical Objective

[ Notes ] On/Off-Policy Learning

Temporal-Difference Learning 的算法涉及 Generate Experience 和 Optimize Policy 两件事

Behavior Policy is used to Generate Experience Samples
Target Policy is constantly updated towards Optimal Policy

所谓 On-Policy 与 Off-Policy 的区别便在于这两种 Policy 是否一致

On-Policy Learning $\longrightarrow$ $=$ Target Policy

Policy 先被用来 Generate Experience，然后 Optimize，这两个步骤不断循环
Off-Policy Learning $\longrightarrow$ $\not=$ Target Policy

Policy A 专门用来 Generate Experience，用来 Optimize Policy B

比如 Sarsa 和 Monte-Carlo 都是 On-Policy，Q-Learning 一般是 Off-Policy 的，但也可以被 implement 成 On-Policy 的形式

Special Topics - Robbins-Monro

这个 RM Algorithm 是 Temporal Difference Method 的基础

1. Algorithm

一种用于 Stochastic Approximation 的算法

Problem Formulation

$w$ $g$ that satisfies

$g (w) = 0$

解的时候有两种情况：
1. Model-Based
  
  已知 $g$ 是什么，那么就会有各种 Numerical Method 任君挑选
2. Model-Free
  
  未知 $g$ 是什么，那么就用 Robbins-Monro (RM) Algorithm 来解
Algorithm

$w_{k + 1} = w_{k} - a_{k} \tilde{g} (w_{k}, η_{k})$

where
- $w_k \longrightarrow k^{th}$ estimation of root
- $\times a_k\longrightarrow$ 某 Positive Coefficient
- $\eta_k\longrightarrow$ 观测时的 Noise，不理解可见 3.1 的例子
- $\tilde{g}(w_k,\eta_k) = g(w_k) + \eta_k \longrightarrow$ Noisy Observation
还是那句话：没 Model 就得有 Data，没 Data 就得有 Model
Example

$g (w) = \tanh (w - 1)$

Assume parameters
- $w_1 = 3$
- $a_k = 1/k$
- $\eta_k = 0 \longrightarrow$ $\tilde{g}(w_k,\eta_k)=g(w_k)$
则 RM Algorithm 为

$w_{k + 1} = w_{k} - a_{k} g (w_{k})$

迭代效果如下图

无论初始值的正负 $w^* = 1$

2. Convergence Theorem

In RM Algorithm, if

$0 < c_1 \leq \grad_wg(w) \leq c_2$

$g$ 的图像始终递增梯度有界 $g(w^*)=0$
$\sum_{k=1}^{\infin}a_k = \infin$ and $\sum_{k=1}^{\infin}a_k^2 < \infin$

$a_k\to 0$ $k\to \infin$ $a_k$ 不会收敛得太快"
$\mathbb{E}[\eta_k|\mathcal{H}_k]=0$ and $\mathbb{E}[\eta_k^2|\mathcal{H}_k]<\infin$ where $\mathcal{H}_k = \{w_k,w_{k-1},...\}$

$\eta_k$ $\eta_k$ 有界"，noise 不一定非要是 Gaussian

$w_k$ with probability 1 (w.p.1) $w^*$ $g(w^*)=0$

“with probability 1” 表示的是概率意义上的收敛，因为涉及随机变量采样

3. More Examples

走过路过，不要错过！Mean Estimation！

3.1 嗯...

$w$ $\{x\}$ $X$

w = E [X]

Reformulate the problem to

g (w) = w - E [X] = 0

$\{x\}$ $X$ 相当于是 Noisy Observation

\begin{aligned} \tilde{g} (w, η) & = w - x \\ = (w - E [X]) + (E [X] - x) \\ = g (w) - η \end{aligned}

where the Noise is

η = E [X] - x

那么 RM Algorithm Formulation 为

w_{k + 1} = w_{k} - a_{k} (w_{k} - x_{k})

3.2 哦吼吼～

增加点难度～

$w$ $\{x\}$ $X$

w = E [v (X)]

Reformulate the problem to

g (w) = w - E [v (X)] = 0

$\{x\}$ $X$ 相当于是 Noisy Observation

\begin{aligned} \tilde{g} (w, η) & = w - v (x) \\ = (w - E [v (X)]) + [E [v (X)] - v (x)] \\ = g (w) - η \end{aligned}

where the Noise is

η = E [v (X)] - v (x)

那么 RM Algorithm Formulation 为

w_{k + 1} = w_{k} - a_{k} [w_{k} - v (x_{k})]

3.3 《你好》

再加点佐料～

$w$ $\{x\}$ $X$ and $\{r\}$ $R$

w = E [R + γ v (X)]

Reformulate the problem to

g (w) = w - E [R + γ v (X)] = 0

$\{x\}$ $X$ 相当于是 Noisy Observation

\begin{aligned} \tilde{g} (w, η) & = w - [r + γ v (x)] \\ = (w - E [R + γ v (X)]) + {E [R + γ v (X)] - [r + γ v (x)]} \\ = g (w) - η \end{aligned}

where the Noise is

η = E [R + γ v (X)] - [r + γ v (x)]

那么 RM Algorithm Formulation 为

w_{k + 1} = w_{k} - a_{k} {w_{k} - [r_{k} + γ v (x_{k})]}

1. TD(0)

1.1 Problem Formulation

1.2 Algorithm

1.2.1 关于第一个公式

1.2.2 关于第二个公式

1.3 Algorithm Derivation

1.4 Convergence

2. Sarsa Algorithms

2.1 Problem Formulation

2.2 Sarsa

2.3 Expected Sarsa

2.4 n-Step Sarsa

3. Q-Learning

3.1 Algorithm

3.2 On-Policy Ver. Pseudo-Code

3.3 Off-Policy Ver. Pseudo-Code

[ Notes ] Universal Expression

[ Notes ] On/Off-Policy Learning

Special Topics - Robbins-Monro

1. Algorithm

2. Convergence Theorem

3. More Examples

3.1 嗯...

3.2 哦吼吼～

3.3 《 你 好 》

3.3 《你好》