Reinforcement Learning - Basics

RL_1Basics

1. 基本概念

这是一个强化学习的最基本示例：

env

Task

机器人在一个 Grid of Cells 里，找到从起点走到终点的最优路径
Rules
1. 可以自由移动（不能斜着走），但是不应该进入 Forbidden Cells，会因此受罚
2. 机器人无法感知，Grid of Cells 对它而言是个黑箱

1.1 State

Status of the agent with respect to the environment

state

$s_1, s_2,..., s_9$

State Space $S = \{s_i\}^9_{i=1}$

1.2 Action

actions

$a_1$ $a_3$ $a_4$ $a_2$ $a_5$ )

理论上 $s_1,s_2,s_3$ $a_1$ ，否则会撞墙

$a_1$ ，并为了严谨性增加 “撞墙回弹的功能”

Action Space of State $A(s_i) = \{a\}^5_{i=1}$

1.3 State-Transition

它描述的是这么一个过程：When taking an Action, the agent may move from one State to another

Deterministic

100%确定 $s_5$ $a_2$ 一定 $s_6$
$s_{5} \overset{a_{2}}{\to} s_{6}$
用 State-Transition Probability 的方式表示为
$\begin{aligned} p (s_{6} | s_{5}, a_{2}) & = 1 \\ p (s_{i} | s_{5}, a_{2}) & = 0 \forall i \neq 6 \end{aligned}$
也可以用 Tabular Representation
Stochastic

很多情况下 State-Transition 是 Stochastic 的，即在同一 State 采取同一 Action 可能跳转到不同的新 State，比如：
1. $s_1,s_2,s_3$ $a_1$ ，但是采取后的效果相当于静止
2. $s_1$ $a_1$ 后撞墙，Agent 的状态有两种可能性：
  - $s_1$ 原地静止
  - $s_4$
假设 2. 中所述两种情况的概率是 50-50，那么用 Conditional Probability 可表示为

$\begin{aligned} p (s_{1} | s_{1}, a_{1}) & = 0.5 \\ p (s_{4} | s_{1}, a_{1}) & = 0.5 \\ p (s_{i} | s_{1}, a_{1}) & = 0 \forall i \neq 1, 4 \end{aligned}$

不过 Stochastic 的 State-Transition 就没法写成 Table 了
General Form

1.4 Policy

这和 State-Transition 是两个东西！各自的 Conditional Probability 定义不一样！！！

policy

$\pi$ is a (probabilistic) mapping from State Space to Action Space

有点像是一种 “场”，在当前类型的案例里一般用箭头表示；无论 Agent 在哪里 / 哪个 State，Policy 都能告诉它下一步要往哪里走

所以无论 Agent 从哪里开始任务，policy 最终都应该能把它带到目的地，就像上图一样

Deterministic

如上图，一个 State 能采取的 Action 是确定的，只有一个

$s_1$ 为例，与它相关的 Policy 用 Conditional Probability 可表示为
$\begin{aligned} π (a_{2} | s_{1}) & = 1 \\ π (a_{i} | s_{1}) & = 0 \forall i \neq 2 \end{aligned}$
Stochastic

一个 State 有概率采取不同的 Action

$s_1$ 为例，与它相关的 Policy 用 Conditional Probability 可表示为
$\begin{aligned} π (a_{2} | s_{1}) & = 1 \\ π (a_{3} | s_{1}) & = 0.5 \\ π (s_{i} | a_{1}) & = 0 \forall i \neq 2, 3 \end{aligned}$
也可以写成 Table，与 State-Transition 完全不一样，Policy 无论 Deterministic / Stochastic 都可以用 Tabular Representation

1.5 Reward

在采取某个 Action 后得到的数值

Positive Reward = Encouragement to take such Action
Negative Reward = Punishment to take such Action

好吧，其实你也可以用 Positive Number 表示 punishment，但那样 Reward 就该改名叫 Cost 了

Reward 的 Absolute Values 不重要，重要的是 Relative Values

$r = \{+1, -1\}$ $r = \{+2, 0\}$ will not make any difference

Deterministic

一个 Action 始终只会对应一个 Reward 数值

可以写成如上 Table 并作如下设定
- $r_{bound}=-1 \longrightarrow$ Agent 撞墙
- $r_{forbid}=-1\longrightarrow$ Agent 进入禁区
- $r_{target}=+1 \longrightarrow$ Agent 抵达终点
- $r = 0\longrightarrow$ For all other cases
$(s_1, a_1)$ 的 (State, Action) 组合为例子，Reward 用 Conditional Probability 表示为

$\begin{aligned} p (r = - 1 | s_{1}, a_{1}) & = 1 \\ p (r \neq - 1 | s_{1}, a_{1}) & = 0 \end{aligned}$
Stochastic

现实中一个 Action 对应的 Reward 数值并不会完全固定

比如努力学习（这是 Action），考试分数一定会提高（这是 Reward），但提高多少（这是 Reward 的数值）会取决于其它因素

这种没法写成 Table
[ Reward Hacking ]

如果奖励机制设计得很烂，训练出来的 Agent 可能会通过执行 Penalized Action 来最大化 Reward 收益

比如在本案例中，Agent 可能会故意跨入 Forbidden Cell 抄近道抵达目的地

1.6 Episode & Trajectory

Trajectory

Trajectory 是一条 State-Action-Reward Chain，如下两个例子

对应的两条 Trajectory 为
$\begin{aligned} Policy 1: & s_{1} \to_{r = 0}^{a_{2}} s_{2} \to_{r = 0}^{a_{3}} s_{5} \to_{r = 0}^{a_{3}} s_{8} \to_{r = + 1}^{a_{2}} s_{9} \\ Policy 2: & s_{1} \to_{r = 0}^{a_{3}} s_{4} \to_{r = - 1}^{a_{3}} s_{7} \to_{r = 0}^{a_{2}} s_{8} \to_{r = + 1}^{a_{2}} s_{9} \end{aligned}$
Episode

有 Terminal State 的 Finite Trajectory 是一个 Episode

上面的两个 Trajectory 就是两个 Episode
- $\longrightarrow$ Episodic Task
- $\longrightarrow$ Continuing Task
Infinite Trajectory 见 1.7 Return - Discounted Return

1.7 Return

The sum of all Rewards along the Trajectory，用于评估一个 Policy 的好坏程度

以 1.6 的两个示例为基础，两个 Policy 的 Return 为

\begin{aligned} Policy 1: & Return = 0 + 0 + 0 + 1 = 1 \\ Policy 2: & Return = 0 - 1 + 0 + 1 = 0 \end{aligned}

Policy 1 的 Return 更大，显然它更好

1.7.1 Discounted Return

有时 Policy 在 Agent 抵达目的地后仍然会继续运行，类似于 “动态静止”

infinite trajectory

如上图，此时这条 Trajectory 从定义上而言，是 Infinite 的

s_{1} \to_{r = 0}^{a_{2}} s_{2} \to_{r = 0}^{a_{3}} s_{5} \to_{r = 0}^{a_{3}} s_{8} \to_{r = + 1}^{a_{2}} s_{9} \to_{r = + 1}^{a_{5}} s_{9} \to_{r = + 1}^{a_{5}} s_{9} \to_{r = + 1}^{a_{2}} \dots

那么 Trajectory 的 Return 就 diverge 了

Return = 0 + 0 + 0 + 1 + 1 + 1 + 1 + \dots = \infty

不止这条，其实从任何起点出发的 Trajectory 的 Return 最后都会因为抵达终点而发散，若 Return 的大小都是无穷，便无法比较 Policy 之优劣，这个衡量标准就失去了意义

$\gamma \in (0,1)$

\begin{aligned} Discounted Return & = 0 + 0 γ + 0 γ^{2} + 1 γ^{3} + 1 γ^{4} + 1 γ^{5} + 1 γ^{6} + \dots \\ = γ^{3} \frac{1}{1 - γ} \end{aligned}

无论 Intuitively 还是 Mathematically，引入这个值一定能让 Return 收敛

$\gamma\rightarrow0$ Emphasis on Near Future
$\gamma\rightarrow1$ Emphasis on Far Future

上面的描述是从 “计算尚未开始” 的角度解释

若从 “计算已经结束” 的角度解释：

Near Future = 古早历史记录

Far Future = 近期历史记录

1.7.2 Bootstrapping

Consider the following example

bootstrapping

$s_i$ $v_i$ , then

\begin{aligned} v_{1} & = r_{1} + r_{2} γ + r_{3} γ^{2} + r_{4} γ^{3} + r_{1} γ^{5} + \dots = r_{1} + γ v_{2} \\ v_{2} & = r_{2} + r_{3} γ + r_{4} γ^{2} + r_{1} γ^{3} + r_{2} γ^{5} + \dots = r_{2} + γ v_{3} \\ v_{3} & = r_{3} + r_{4} γ + r_{1} γ^{2} + r_{2} γ^{3} + r_{3} γ^{5} + \dots = r_{3} + γ v_{4} \\ v_{4} & = r_{4} + r_{1} γ + r_{2} γ^{2} + r_{3} γ^{3} + r_{4} γ^{5} + \dots = r_{4} + γ v_{1} \end{aligned}

Bootstrapping = Returns rely on each other

类似于 “一直脚踩在另一只脚上，那就能走到无限高处”，概念荒唐但公式本身并不荒唐且可解

\begin{aligned} v_{1} & = r_{1} + γ v_{2} \\ v_{2} & = r_{2} + γ v_{3} \\ v_{3} & = r_{3} + γ v_{4} \\ v_{4} & = r_{4} + γ v_{1} \end{aligned}

够明显了，非齐次线性方程组，四个未知量四个公式，4 个 Return 求出来很容易

2. Markov Decision Process

2.1 Definition

Markov Decision Process (MDP) 是一个 Tuple

\begin{aligned} ⟨ S, A, P, R, γ ⟩ \\ S & ⟶ Finite State Space \\ A & ⟶ Finite Action Space \\ P & ⟶ State-Transition Probability Matrix \\ R & ⟶ Reward Function \\ γ & ⟶ Discount Rate \end{aligned}

where

State-Transition Probability Matrix

正如其名，只是把所有的概率全写在一起
$\begin{matrix} P = p (S_{t + 1} | S_{t}, A_{t}) where {\begin{aligned} S_{t + 1} & \in S \\ S_{t} & \in S \\ A_{t} & \in A \end{aligned} \end{matrix}$
$\sum_{s_{t+1}\in \mathcal{S}} p(s_{t+1}|s_t,a_t) = 1$ $(s,a)$
Reward Function

可以写成 Probability Distribution
$\begin{matrix} R \sim p (R_{t + 1} | S_{t}, A_{t}) where {\begin{aligned} R_{t + 1} & \in R \\ S_{t} & \in S \\ A_{t} & \in A \end{aligned} \end{matrix}$
也可以写成 Expectation
$\begin{matrix} R = E [R_{t + 1} | S_{t}, A_{t}] where {\begin{aligned} R_{t + 1} & \in R \\ S_{t} & \in S \\ A_{t} & \in A \end{aligned} \end{matrix}$
$\sum_{r_{t+1}\mathcal{R}} p(r_{t+1}|s_t,a_t) = 1$ $(s,a)$

$R_{t+1} / r_{t+1}$ $S_t / s_t$ 才能获得 Reward

如果阁下有不同的想法，比如 “进入 State 时获得 Reward”，事先定义一下，效果是一样的

2.2 其他特征

Policy

$\pi(\mathcal{A}|\mathcal{S}) =$ $\mathcal{A}$ $\mathcal{S}$

此处写成一个 Matrix Form

$\sum_{a_t\in\mathcal{A}} p(a_t|s_t) = 1$ $s\in\mathcal{S}$
Markov Property

鬼话：The Memoryless property of a stochastic process

人话：新决策的概率无关于决策的历史记录，MDP无记忆
$\begin{matrix} p (s_{t + 1} | s_{t}, a_{t}) = p (s_{t + 1} | s_{t}, a_{t}, s_{t - 1}, a_{t - 1}, . . ., . . ., s_{0}, a_{0}) \\ p (r_{t + 1} | s_{t}, a_{t}) = p (r_{t + 1} | s_{t}, a_{t}, s_{t - 1}, a_{t - 1}, . . ., . . ., s_{0}, a_{0}) \end{matrix}$
从数学表达式可以看出，你可以提供很多的过去决策供参考，但是全部木大

下一个 State / Reward 只会参考现在的 State & Action！

2.3 Markov Process

Markov Decision Process becomes Markov Process once the Policy is given

比如 1. 中的 Grid of Cells，左图里描述了采取的 Policy，右侧更加抽象的图就是对应的 Markov Process

markov process

3. Bellman Expectation Equation

一般的称谓是 Bellman Equation

3.1 State Value 状态价值

“告诉我们从哪个 State 出发最合适”

$v_{\pi}(s)$ Expected Return $G_t$ State $s$

\begin{aligned} Trajectory: & S_{t} \overset{A_{t}}{\to} R_{t + 1}, S_{t + 1} \overset{A_{t + 1}}{\to} R_{t + 2}, S_{t + 2} \overset{A_{t + 2}}{\to} R_{t + 3}, \dots \\ Discounted Return: & G_{t} = R_{t + 1} + R_{t + 2} γ + R_{t + 3} γ^{2} + \dots \\ State Value: & v_{π} (s) = E [G_{t} | S_{t} = s] \end{aligned}

$v_{\pi}(s)$ ：

$G_t$ 有 Expectation 在数学上合理

$G_t,R_{t+1},R_{t+2},R_{t+3}$ 均为 Random Variable

$R_{t+1},R_{t+2},R_{t+3}$ $G_t$ 也是 Random Variable
$G_t$ 有 Expectation 在直觉上合理

以某个 State 为起点寻路，能抵达终点的 Trajectory 可能会有很多条，那么也可能会有多个 Discounted Return

如果 State-Transition 是完全 Deterministic 的，那么此时以某一个 State 为起点的 Trajectory 必定只有一条，因此有
$v_{π} = G_{t}$
State Value 依赖 Policy

Policy 变了 Trajectory 也会变，这很合理

$v(\pi,s)$ 会更合适

[ Example ]

$\pi_1$ $\pi_2$ ，Stochastic 的

d v.s. s

$s_1$ 为起点的 State Value 为

\begin{aligned} Policy π_{1} : & v_{π_{1}} (s_{1}) = 1 \cdot (- 1 + 1 γ + 1 γ^{2} + \dots) + 0 \\ Policy π_{2} : & v_{π_{2}} (s_{1}) = 0.5 \cdot (- 1 + 1 γ + 1 γ^{2} + \dots) + 0.5 \cdot (0 + 1 γ + 1 γ^{2} + \dots) \end{aligned}

最终结果

\begin{aligned} Policy π_{1} : & v_{π_{1}} (s_{1}) = - 1 + \frac{γ}{1 - γ} \\ Policy π_{2} : & v_{π_{2}} (s_{1}) = - 0.5 + \frac{γ}{1 - γ} \end{aligned}

3.2 Elementwise Form

Bellman Expectation Equation 用于计算 State Value

Consider the following Random Trajectory

S_{t} \overset{A_{t}}{\to} R_{t + 1}, S_{t + 1} \overset{A_{t + 1}}{\to} R_{t + 2}, S_{t + 2} \overset{A_{t + 2}}{\to} R_{t + 3}, \dots

其 Discounted Return 为

\begin{aligned} G_{t} & = R_{t + 1} + R_{t + 2} γ + R_{t + 3} γ^{2} + \dots \\ = R_{t + 1} + γ G_{t + 1} \end{aligned}

那么 State Value 公式可作如下处理

\begin{aligned} v_{π} (s_{t}) & = E [G_{t} | S_{t} = s_{t}] \\ = E [R_{t + 1} + γ G_{t + 1} | S_{t} = s_{t}] \\ = \underset{E {Immediate Reward}}{\underset{⏟}{E [R_{t + 1} | S_{t} = s_{t}]}} + γ \underset{E {Future Reward}}{\underset{⏟}{E [G_{t + 1} | S_{t} = s_{t}]}} \end{aligned}

分开计算

"Immediate Reward" Term
$\begin{aligned} E [R_{t + 1} | S_{t} = s_{t}] & = \sum_{a_{t}} {π (a_{t} | s_{t}) E [R_{t + 1} | S_{t} = s_{t}, A_{t} = a_{t}]} \\ = \sum_{a_{t}} {π (a_{t} | s_{t}) \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}]} \end{aligned}$
"Future Reward"Term

第二步的 Expectation 部分通过 Markov Property 化简
$\begin{aligned} E [G_{t + 1} | S_{t} = s_{t}] & = \sum_{s_{t + 1}} {E [G_{t + 1} | S_{t + 1} = s_{t + 1}, S_{t} = s_{t}] p (s_{t + 1} | s_{t})} \\ = \sum_{s_{t + 1}} {E [G_{t + 1} | S_{t + 1} = s_{t + 1}] \sum_{a_{t}} [π (a | s_{t}) p (s_{t + 1} | s_{t}, a_{t})]} \\ = \sum_{s_{t + 1}} {v_{π} (s_{t + 1}) \sum_{a_{t}} [π (a | s_{t}) p (s_{t + 1} | s_{t}, a_{t})]} \end{aligned}$

两部分合并可得

\begin{aligned} v_{π} (s_{t}) & = E [R_{t + 1} | S_{t} = s_{t}] + γ E [G_{t + 1} | S_{t} = s_{t}] \\ = \sum_{a_{t}} {π (a_{t} | s_{t}) \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}]} + γ \cdot \sum_{s_{t + 1}} {v_{π} (s_{t + 1}) \sum_{a_{t}} [π (a | s_{t}) p (s_{t + 1} | s_{t}, a_{t})]} \\ = \sum_{a_{t}} {π (a_{t} | s_{t}) \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}]} + γ \cdot \sum_{a_{t}} \sum_{s_{t + 1}} {π (a | s_{t}) p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})} \\ = \sum_{a_{t}} {π (a_{t} | s_{t}) \cdot \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}]} + \sum_{a_{t}} {π (a | s_{t}) \cdot γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})]} \\ Bellman Equation: v_{π} (s_{t}) & = \sum_{a_{t}} {π (a_{t} | s_{t}) \cdot (\sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})])} \end{aligned}

where

$v_{\pi}(s_t), v_{\pi}(s_{t+1}) \longrightarrow$ State Values to be computed

with Bootstrapping in 1.7.2
$\pi(a_t|s_t) \longrightarrow$ given Policy
$p(r_{t+1}|s_t,a_t), p(s_{t+1}|s_t,a_t) \longrightarrow$ Environment Models
1. 当前语境下，这些 Model 是 known 的
2. 若 Model unknown，那就是 Model-Free RL 的内容了

$\gamma$ ）的数值后，State Values 就很好求了

略微偷懒一点的写法（公式项的包含关系还是上面的那个更详细）

Bellman Equation: v_{π} (s_{t}) = \sum_{a_{t}} π (a_{t} | s_{t}) [\sum_{r_{t + 1}} p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1} + γ \sum_{s_{t + 1}} p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})] \forall s_{t} \in S

3.3 Matrix-Vector Form

3.3.1 Derivation

Elementwise Form 的 BEE 与 State 数量挂钩，即有多少个 State 就会有多少个这样的公式，把它们联立会非常 nice

先对 Elementwise Form 进行修改，退回到 3.2 的 “两部分合并”，第三行

\begin{aligned} v_{π} (s_{t}) & = \sum_{a_{t}} {π (a_{t} | s_{t}) \cdot (\sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})])} \\ = \sum_{a_{t}} {π (a_{t} | s_{t}) \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}]} + γ \sum_{a_{t}} \sum_{s_{t + 1}} {π (a_{t} | s_{t}) p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})} \\ = \underset{r_{π} (s_{t})}{\underset{⏟}{\sum_{a_{t}} {π (a_{t} | s_{t}) \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}]}}} + γ \sum_{s_{t + 1}} {\underset{p_{π} (s_{t + 1} | s_{t})}{\underset{⏟}{\sum_{a_{t}} [π (a_{t} | s_{t}) p (s_{t + 1} | s_{t}, a_{t})]}} v_{π} (s_{t + 1})} \\ = r_{π} (s_{t}) + γ \sum_{s_{t + 1}} {p_{π} (s_{t + 1} | s_{t}) v_{π} (s_{t + 1})} \end{aligned}

where

$r_{\pi}(s_t) \longrightarrow$ Expectation of Immediate Reward
$p_{\pi}(s_{t+1}|s_t)\longrightarrow$ $s_t$ $s_{t+1}$ 的概率

$s_i = s_t, s_j = s_{t+1}$ $s_i$ 为

v_{π} (s_{i}) = r_{π} (s_{i}) + γ \sum_{s_{j}} {p_{π} (s_{j} | s_{i}) v_{π} (s_{j})}

然后写出所有 State 对应的上式，联立后得到 Matrix-Vector Form

Bellman Equation: v_{π} = r_{π} + γ P_{π} v_{π}

where

\begin{matrix} State Values: v_{π} = [\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ ⋮ \\ v_{π} (s_{n}) \end{matrix}] \in R^{n} Expected Immediate Rewards: r_{π} = [\begin{matrix} r_{π} (s_{1}) \\ r_{π} (s_{2}) \\ ⋮ \\ r_{π} (s_{n}) \end{matrix}] \in R^{n} \\ State-Transition Matrix: P_{π} \in R^{n} with [P_{π}]_{i j} = p_{π} (s_{j} | s_{i}) \end{matrix}

3.3.2 Example

$n=4$ ，then

\begin{matrix} \underset{v_{π}}{\underset{⏟}{[\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}]}} = \underset{r_{π}}{\underset{⏟}{[\begin{matrix} r_{π} (s_{1}) \\ r_{π} (s_{2}) \\ r_{π} (s_{3}) \\ r_{π} (s_{4}) \end{matrix}]}} + γ \cdot \underset{P_{π}}{\underset{⏟}{[\begin{matrix} p_{π} (s_{1} | s_{1}) & p_{π} (s_{2} | s_{1}) & p_{π} (s_{3} | s_{1}) & p_{π} (s_{4} | s_{1}) \\ p_{π} (s_{1} | s_{2}) & p_{π} (s_{2} | s_{2}) & p_{π} (s_{3} | s_{2}) & p_{π} (s_{4} | s_{2}) \\ p_{π} (s_{1} | s_{3}) & p_{π} (s_{2} | s_{3}) & p_{π} (s_{3} | s_{3}) & p_{π} (s_{4} | s_{3}) \\ p_{π} (s_{1} | s_{4}) & p_{π} (s_{2} | s_{4}) & p_{π} (s_{3} | s_{4}) & p_{π} (s_{4} | s_{4}) \end{matrix}]}} \underset{v_{π}}{\underset{⏟}{[\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}]}} \end{matrix}

如果 Policy 和 Reward 如下

那么 Matrix-Vector Form 变成

[\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}] = [\begin{matrix} 0.5 (0) + 0.5 (- 1) \\ 1 \\ 1 \\ 1 \end{matrix}] + γ \cdot [\begin{matrix} 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{matrix}] [\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}]

3.3.3 How to Solve

Closed-Form Solution

好消息是它很简洁很方便，坏消息是涉及 Matrix Inversion，所以大概是用不上了
$v_{π} = (I - γ P_{π})^{- 1} r_{π}$
Iterative Solution

坏消息是证明起来很烦，好消息是好理解也很好用

$v_{k + 1} = r_{π} + γ P_{π} v_{k}$

$k=0$ $v_0$ $v_1$

$\{v_0, v_1, v_2,...\}$ ，最终会发现

$v_{k} \to v_{π} = (I - γ P_{π})^{- 1} r_{π} as k \to \infty$

[ Tips ]
1. $k$ $t$ ，它和 State 没有关系
2. 类似于级数逼近 $k\rightarrow \infin$
  
  收敛阈值 $\epsilon$ $\lVert v_k - v_{\pi} \rVert < \epsilon$ 即可

3.4 Action Value 动作价值

“告诉我们采取 State 中的哪个 Action 最合适”

3.4.1 Definition

$q_{\pi}(s,a)$ Expected Return $G_t$ State $s$ Action $a$

\begin{aligned} Trajectory: & S_{t} \overset{A_{t}}{\to} R_{t + 1}, S_{t + 1} \overset{A_{t + 1}}{\to} R_{t + 2}, S_{t + 2} \overset{A_{t + 2}}{\to} R_{t + 3}, \dots \\ Discounted Return: & G_{t} = R_{t + 1} + R_{t + 2} γ + R_{t + 3} γ^{2} + \dots \\ Action Value: & q_{π} (s, a) = E [G_{t} | S_{t} = s, A_{t} = a] \end{aligned}

Action Value 也称 Q-Value

3.4.2 与 State Value 的关系

$\to$ State"

Action Value 与 3.1 中定义的 State Value 只差了一个 “选择某个 Action” 的过程

以下是 State Value 的定义与公式，以供对比：

$v_{\pi}(s)$ Expected Return $G_t$ State $s$
$v_{π} (s) = E [G_{t} | S_{t} = s]$

所以 Action Value 与 State Value 的关系式为
$\begin{aligned} v_{π} (s) & = E [G_{t} | S_{t} = s] \\ = \sum_{a} {π (a | s) \cdot E [G_{t} | S_{t} = s, A_{t} = a]} \\ v_{π} (s) & = \sum_{a} [π (a | s) \cdot q_{π} (s, a)] \end{aligned}$
$\to$ Action"

与 3.2 的 Elementwise BEE 联立可发现
$v_{π} (s_{t}) = \sum_{a_{t}} {π (a_{t} | s_{t}) \cdot \underset{q_{π} (s_{t}, a_{t})}{\underset{⏟}{(\sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})])}}}$
$q_{\pi}(s_t,a_t)$ $v_{\pi}(s_{t+1})$ 存在联系
$q_{π} (s_{t}, a_{t}) = \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})]$
Summary
$\begin{aligned} AV to SV: & v_{π} (s_{t}) = \sum_{a_{t}} [π (a_{t} | s_{t}) \cdot q_{π} (s_{t}, a_{t})] \\ SV to AV: & q_{π} (s_{t}, a_{t}) = \sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})] \end{aligned}$

4. Bellman Optimality Equation

$\pi^*$ Optimal $v_{\pi^*}(s) > v_{\pi}(s)$ $s$ $\pi$

4.1 Definition

Bellman Optimality Equation (BOE):

不知道 $\pi(a_t|s_t)$
相比 BEE 前面多了一个最大化的语句

以下是两种 BOE 公式

Elementwise Form

$\begin{aligned} v_{π} (s_{t}) & = max_{π} \sum_{a_{t}} {π (a_{t} | s_{t}) \cdot (\sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v_{π} (s_{t + 1})])} \\ = max_{π} \sum_{a_{t}} [π (a_{t} | s_{t}) \cdot q_{π} (s_{t}, a_{t})] \end{aligned}$

where
- $p(r_{t+1}|s_t,a_t), p(s_{t+1}|s_t,a_t)\longrightarrow$ Known
- $v_{\pi}(s_t), v_{\pi}(s_{t+1})\longrightarrow$ Unknown, to be Computed
- $\pi(a_t|s_t)\longrightarrow$ Unknown, to be Computed
Matrix-Vector Form

$\max_{\pi}$ $(r_{\pi} + \gamma P_{\pi}v_{\pi})$ 的每一项上
$v_{π} = max_{π} [r_{π} + γ P_{π} v_{π}]$

4.2 解法前置理论

4.2.1 Elementwise Form 解

以有 Action Value 的版本为基础

v_{π} (s_{t}) = max_{π} \sum_{a_{t}} [π (a_{t} | s_{t}) \cdot q_{π} (s_{t}, a_{t})]

问题可以简化为

$c_1^*,c_2^*,c_3^*$ given

max_{c_{1}, c_{2}, c_{3}} c_{1} q_{1} + c_{2} q_{2} + c_{3} q_{3}

$c_i$ $\pi$ $c_1+c_2+c_3=1$ $c_1,c_2,c_3 \geq 0$

$q_3 > q_1,q_2$ ，那么 Optimal Coefficient 应当为

\begin{array}{r} c_{1}^{*} = 0 \\ c_{2}^{*} = 0 \\ c_{3}^{*} = 1 \end{array}

把最大的 Weight 给最大的 Term，就能最大化它们之和

\begin{aligned} c_{1}^{*} q_{1} + c_{2}^{*} q_{2} + c_{3}^{*} q_{3} & = q_{3} \cdot 1 \\ = q_{3} \cdot (c_{1} + c_{2} + c_{3}) \\ = c_{1} q_{3} + c_{2} q_{3} + c_{3} q_{3} \\ \geq c_{1} q_{1} + c_{2} q_{2} + c_{3} q_{3} \end{aligned}

所以回到 BOE 上，可以得出结论

\begin{aligned} v_{π} (s_{t}) & = max_{a_{t}} q_{π} (s_{t}, a_{t}) \\ a_{t}^{*} & = \arg max_{a_{t}} q_{π} (s_{t}, a_{t}) \\ π (a_{t} | s_{t}) & = {\begin{cases} 1 & a = a^{*} \\ 0 & a \neq a^{*} \end{cases} \end{aligned}

4.2.2 Matrix-Vector Form 解

$\pi$ $v_{\pi}$ 两个变量

v_{π} = max_{π} [r_{π} + γ P_{π} v_{π}]

类似于是在解如下类型的问题

$x,a\in\R$ that satisfies

x = max_{a} (2 x - 1 - a^{2})

$a^2 \geq 0$ $a = 0$ ，所以就变成了

\begin{aligned} x & = 2 x - 1 \\ x & = 1 \end{aligned}

说实话我写这一部分的时候都在怀疑是否真的需要写下来，但是为了防止脑子 GaGa 了，还是留个记录吧（恼

4.2.3 Contraction Mapping Theorem

Fixed Point

$x\in X$ $f:X\to X$ if
$x = f (x)$
Contraction Mapping

$f$ is a contraction mapping if
$‖ f (x_{1}) - f (x_{2}) ‖ \leq γ ‖ x_{1} - x_{2} ‖$
$\gamma\in(0,1)$

一个直观的解释是
Contraction Mapping Theorem
1. Definition
  
  $f$ unique $x^*$ $x^* = f(x^*)$
2. Algorithm
  
  $\{x_k\}$ $x_{k+1} = f(x_k)$ $x_0$ $\{x_0,x_1,...,x_{\infin}\}$ ，然后...
  $x_{k} \to x^{*} as k \to \infty$
  Convergence Rate is Exponentially Fast given any Initial Guess，非常好用
[ Example ]

$f(x)=0.5x$ $x\in\R$

则迭代公式为

$x_{k + 1} = 0.5 x_{k}$

$x_0 = 10$ $\{x_k\}$ 为

$x_{0} = 10, x_{1} = 5, x_{2} = 2.5, . . ., x_{\infty} = x^{*} = 0$

$x^* = 0$ 满足 Fixed Point 的结果（虽然实际上可以一眼出答案...）

4.3 Solution

$v_{\pi}$ 的函数，所以可以有

f (v_{π}) = max_{π} [r_{π} + γ P_{π} v_{π}]

那么 BOE 就可以写成下面这个贼简洁的公式

v_{π} = f (v_{π})

[ 重要结论 ]： $f(v_{\pi})$ 是一个 Contraction Mapping

$\gamma <0$ ；总之，可以直接套用 4.2.3 Contraction Mapping Theorem 的结论

Existence

$v_{\pi}^*$ 满足
$v_{π}^{*} = f (v_{π}^{*}) = max_{π} [r_{π} + γ P_{π} v_{π}^{*}]$
Algorithm

与 4.2.3 中的相同，could be solved iteratively by
$\begin{matrix} v_{k + 1} = f (v_{k}) = max_{π} [r_{π} + γ P_{π} v_{k}] \\ v_{k} \to v^{*} as k \to \infty \end{matrix}$
$\gamma$

4.4 Policy Optimality & 总结

State Value

$v^*$ is the Unique Solution to BOE
$v^{*} = max_{π} [r_{π} + γ P_{π} v^{*}]$
$v_{\pi}$ $\pi$
$v_{π} = r_{π} + γ P_{π} v_{π}$
Then
$v^{*} \geq v_{π} \forall π$
Policy
$\begin{aligned} π^{*} (a_{t} | s_{t}) & = \arg max_{π} [r_{π} + γ P_{π} v^{*}] \\ = \arg max_{π} \sum_{a_{t}} {π (a_{t} | s_{t}) \cdot (\sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v^{*} (s_{t + 1})])} \\ = \arg max_{π} \sum_{a_{t}} [π (a_{t} | s_{t}) \cdot q^{*} (s_{t}, a_{t})] \end{aligned}$
$s \in \mathcal{S}$ , the Deterministic Greedy Policy
$\begin{matrix} π^{*} (a_{t} | s_{t}) = {\begin{cases} 1 & a_{t} = a^{*} \\ 0 & a_{t} \neq a^{*} \end{cases} \end{matrix}$
is an Optimal Policy solving the BOE

所以其实可以这么表述
$v^{*} = v_{π^{*}}$
Action Value
$\begin{aligned} a^{*} & = \arg max_{π} q^{*} (s_{t}, a_{t}) \\ = \arg max_{a} (\sum_{r_{t + 1}} [p (r_{t + 1} | s_{t}, a_{t}) r_{t + 1}] + γ \sum_{s_{t + 1}} [p (s_{t + 1} | s_{t}, a_{t}) v^{*} (s_{t + 1})]) \end{aligned}$