Optimization - Basics
1. Matrix Derivatives
Objective Function
-
1st-Order Gradient
一般规定为一个 Row Vector
It is a linear operator mapping
into在敲代码的时候,这样的规定能让 dimension 对得上,省略很多麻烦
不过在写公式的时候用另一种 Notation 会更方便
-
2nd-Order Gradient
即 Hessian
Hessian 是 Symmetric Matrix,所以 Notation 无所谓,dimension 总是对的
可以用它对 objective function 进行 2nd-Order Taylor Expansion
Objective Function
-
1st-Order Gradient
一般规定为
的 Vector其 transpose 也叫 Jacobian
Notes:
对以上两种 Objective Function 一阶导之规定,即
和 ,是为了让 Chain Rule 能 work
2. Root Finding
Given
-
In Continuous Time
使导数为 0 即可
-
In Discrete Time
Discrete-Time Dynamics 的一般形式是
,找到其 fixed point 即可
2.1 Newton's Method
用于 root finding 的最基础的一种算法
2.1.1 Algorithm
假设 objective function
-
Fit a linear approximation to
-
Solve for
"Descent" (minimization) "Ascent" (maximization)
-
Update
Repeat Step 1-3 until convergence
Notes:
基于现有的条件,Newton's Method only converges to the closest Fixed Point to the Initial Guess
鬼知道找到的 Root 是 {global / local} {maximum / minimum} 还是 saddle point,反正你自己看着办吧
2.1.2 Regularization
在 2.1.1 的 step 2 中提及了保证 descent 的条件
If
everywhere (which means is strongly convex), then we can always find a minimum with Newton's Method
但很遗憾的是,现实中 it is usually not the case for nonlinear problems
解决方案即 regularization,做法如下
Pick a scalar hyperparameter
-
Check if Hessian
is Positive Definite -
if not, then
Repeat step 2-3 until
is Positive Definite
It guarantees descent
Also called "Damped Newton"
2.2 Line Search (Back Tracking)
Newton's Method 的 step size 并不是所谓 adaptive 的,到最后总会有那么一步会 overshoots the minimum
这个方法算是给 Newton's Method 打的补丁,to make sure step agrees with linearization within some tolerance
Specifically, we check
Let
所以首先一定有
,请记住这一点
There are several effective strategies
2.2.1 Armijo Condition
or Sufficient Decrease Condition
-
Set
percentage of reduction
tolerance
's reduction rate
-
Check if
Here,
-
Update
and repeat Step 2-3 until Step 2 is fulfilled
一般各项参数取值为
下图中 acceptable 的部分表示这个 step length 没有 overshoot the minimum
不 acceptable 的部分显然是 overshoot 了,即斜虚线比函数本身还要小

2.2.2 Curvature Condition
从 2.2.1 的图中不难看出只要
但这会导致 descent 的量不够大,而 curvature condition 能防止
where
这个 condition 的要求可以理解为 "The slope at the next iterate is 'less negative' than the current slope"
所以如果这个条件被满足,已知
比 current slope 更接近 0,函数值更接近 minimum
直接是 positive slope,函数值越过 minimum

还有一种更加严格的 curvature condition 版本
用绝对值限制之后,就只有把 slope 往 0 上靠这一个选择了
2.2.3 Wolfe Conditions
Wolfe Conditions 就是把 Armijo Condition 和 Curvature Condition 组合在一起

-
弱 Wolfe Condition
-
强 Wolfe Condition
3. Unconstrained Minimization
Find the minimum
3.1 Hessian 正定之必要性
若以 2.1 中的 Newton's Method 为基础算法进行 minimization,需要重点关注 Step 2,两侧同乘
如果对
若不能使
3.2 Local Minimum 判断标准
Assume
-
1st-Order Necessary Condition
-
2nd-Order Necessary Condition
-
2nd-Order Sufficient Condition
If this condition is fulfilled, then
is a Strict Local Minimizer of
4. Constrained Minimization
Find the minimum
我们最终要把一个 Constrained Problem 转化为一个 Unconstrained Problem 才能解它
4.1 Equality Constraints
where
The 1st-Order Necessary Condition to find the minimum is
-
只要在 FREE directions 上满足即可
-
Equality Constraint 规定了哪里是
的 Free Direction
4.1.1 Lagrangian
Lagrangian 能把上述的 Constrained Problem 转化为 Unconstrained Problem,以下是推导过程
-
关于 Equality Constraint
"What is Free Direction?" 简而言之,当
作了一个很小的位移 后仍能满足 ,那这个位移就在 "Free Direction" 上,即进行 1st-Order Taylor Expansion 得到
Notes:
是个 Jacobian由于
,所以需要满足的其实是 -
关于 Objective Function
同样,假设
在 Objective Function 上做了一个很小的位移 ,进行 1st-Order Taylor Expansion 得到如果
尚不是 minimum ,那么只要 ,它就还能继续下降。同样,如果 往相反的方向位移,即 ,那么只要
,它也能继续下降Tangent Space 是对称的,我懒得解释了你不懂就自己搜搜看吧...
反正就是,如果你不想让
下降,那么这个 就必须满足也就是
要满足 -
Summary
从上述两个部分推出的结论是
也就是
和 都需要 Normal 于移动方向 ,换而言之 must be parallel to或者写成
[ Example ]
你会看到 minimum 处的 objective function 和 equality constraint 的 gradient 是平行的

-
Lagrangian
Based on this gradient condition, we define Lagrangian as
such that the 1st-Order Necessary Condition becomes
or equivalently
4.1.2 KKT System
KKT = Karush-Kuhn-Tucker,更常见的相关概念一般是 KKT Conditions,见 4.2.2
目的是用 Newtonian 去解能使
since
重新组织一下
写成 Matrix Form,就是 KKT System
4.1.3 Gauss-Newton Method
该方法能简化 4.1.2 中提及的
The 2nd term is expensive to compute because it is a bloody Tensor!
So we just drop it! 我们不要第二项辣!蛤!蛤!蛤!
Gauss-Newton 会导致 slightly slower convergence than Full-Newton
会要跑更多的 Iterations,但是单个 Iteration 变 cheaper 了
4.2 Inequality Constraints
4.2.1 LICQ
让 KKT Condition 成立的条件
The Active Set
of any feasible point consists of the equality constraints and the inequality constraints for whichGiven a point
and the Active Set of constraints , we say that the Linear Independence Constraint Qualification (LICQ) holds if the gradients of all the active constraints are linearly independent"
4.2.2 KKT Conditions
具有不等约束的优化问题的 1st-Order-Necessary Conditions = Karush-Kuhn-Tucker (KKT) Conditions
-
完全体 Lagrangian
有
和 两个 Lagrangian Multiplier"Suppose that
is a local solution and that the LICQ holds at , and that , and are continuously differentiable, then there are Lagrange multipliers and such that the KKT Conditions are satisfied" -
KKT Conditions
前两个 condition 真没什么好说的,后两个比较 tricky
Complementary Slackness 在满足 Dual Feasibility 的情况下,其意义是 "Either
在 evaluate 一个 inequality constraint 的时候,假设
- feasible point(非边界),因此 constraint inactive ( ) - feasible point(边界上),因此constraint active ( ) - infeasible point
