Optimization - Basics


Optimization - Basics

1. Matrix Derivatives

Objective Function f(x):RnR

  • 1st-Order Gradient

    一般规定为一个 Row Vector

    fxR1×n

    It is a linear operator mapping Δx into Δf

    f(x+Δx)=f(x)+fxΔx

    敲代码的时候,这样的规定能让 dimension 对得上,省略很多麻烦

    不过在写公式的时候用另一种 Notation 会更方便

    f(x)=(fx)T=(fx1fx2fxn)Rn
  • 2nd-Order Gradient

    Hessian

    2f(x)=xf(x)=2fx2=(2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2)Rn×n

    Hessian 是 Symmetric Matrix,所以 Notation 无所谓,dimension 总是对的

    可以用它对 objective function 进行 2nd-Order Taylor Expansion

    f(x+Δx)=f(x)+fxΔx+12(Δx)T2fx2(Δx)

Objective Function g(y):RnRm

  • 1st-Order Gradient

    一般规定为 n×m 的 Vector

    g(y)=gy=(g1y1g2y1gmy1g1y2g2y2gmy2g1yng2yngmyn)Rn×m

    其 transpose 也叫 Jacobian

    J(y)=(gy)TRm×n

Notes:

对以上两种 Objective Function 一阶导之规定,即 fxR1×ngyRn×m,是为了让 Chain Rule 能 work

f(g(y+Δy))=f(g(y))+fx|x=g(y)gy|y=yΔy

 

 

 

2. Root Finding

Given f(x), find x such that f(x)=0,其实就是找到 Equilibrium of Dynamics

  • In Continuous Time

    使导数为 0 即可

    f(x)|x=x=0
  • In Discrete Time

    Discrete-Time Dynamics 的一般形式是 xk+1=f(xk),找到其 fixed point 即可

    x=f(x)

 

 

2.1 Newton's Method

用于 root finding 的最基础的一种算法

 

2.1.1 Algorithm

假设 objective function f(x) 是 smooth 的,root x 位于 f(x)=0,那么

  1. Fit a linear approximation to f(x)

    f(x+Δx)=f(x)+xf(x)Δx=f(x)+x[(f(x))TΔx]
  2. Solve for Δx

    Δx="descent"(2f(x))1"learning rate"f(x)"gradient"
    • 2f(x)>0 "Descent" (minimization)

    • 2f(x)<0 "Ascent" (maximization)

  3. Update x

    xx+Δx
  4. Repeat Step 1-3 until convergence

Notes:

基于现有的条件,Newton's Method only converges to the closest Fixed Point to the Initial Guess

鬼知道找到的 Root 是 {global / local} {maximum / minimum} 还是 saddle point,反正你自己看着办吧

 

2.1.2 Regularization

2.1.1 的 step 2 中提及了保证 descent 的条件

If 2f(x)>0 everywhere x (which means f(x) is strongly convex), then we can always find a minimum with Newton's Method

但很遗憾的是,现实中 it is usually not the case for nonlinear problems

解决方案即 regularization,做法如下

  1. Pick a scalar hyperparameter β

  2. Check if Hessian H=2f(x) is Positive Definite

    H>0
  3. if not, then

    HH+βI
  4. Repeat step 2-3 until H is Positive Definite

  • It guarantees descent

  • Also called "Damped Newton"

 

 

2.2 Line Search (Back Tracking)

Newton's Method 的 step size 并不是所谓 adaptive 的,到最后总会有那么一步会 overshoots the minimum

这个方法算是给 Newton's Method 打的补丁,to make sure step agrees with linearization within some tolerance

Specifically, we check f(x+Δx) and "back track" until we get a good reduction value

Let p be the Direction of Descent

所以首先一定有 (f(x))Tp<0,请记住这一点

There are several effective strategies

 

2.2.1 Armijo Condition

or Sufficient Decrease Condition

  1. Set

    • percentage of reduction α=1

    • tolerance c1

    • α's reduction rate b(0,1)

  2. Check if

    f(x+αp)f(x)+c1α(f(x))Tp

    Here, Δx=αp

  3. Update

    αbα

    and repeat Step 2-3 until Step 2 is fulfilled

一般各项参数取值为

b=0.5c1=0.1104

下图中 acceptable 的部分表示这个 step length 没有 overshoot the minimum

不 acceptable 的部分显然是 overshoot 了,即斜虚线比函数本身还要小

armijo condition

 

2.2.2 Curvature Condition

2.2.1 的图中不难看出只要 α 够小就能满足 Armijo Condition

但这会导致 descent 的量不够大,而 curvature condition 能防止 α 过小

(f(x+αp))Tpc2(f(x))Tp

where

c2(c1,1)

这个 condition 的要求可以理解为 "The slope at the next iterate is 'less negative' than the current slope"

所以如果这个条件被满足,已知 (f(x))Tp<0,slope 有两种情况

  1. 比 current slope 更接近 0,函数值更接近 minimum

  2. 直接是 positive slope,函数值越过 minimum

curvature condition

还有一种更加严格的 curvature condition 版本

|(f(x+αp))Tp||c2(f(x))Tp|

用绝对值限制之后,就只有把 slope 往 0 上靠这一个选择了

 

2.2.3 Wolfe Conditions

Wolfe Conditions 就是把 Armijo Condition 和 Curvature Condition 组合在一起

wolfe condition

  • 弱 Wolfe Condition

    f(x+αp)f(x)+c1α(f(x))Tp(f(x+αp))Tpc2(f(x))Tp
  • 强 Wolfe Condition

    f(x+αp)f(x)+c1α(f(x))Tp|(f(x+αp))Tp||c2(f(x))Tp|

 

 

 

3. Unconstrained Minimization

Find the minimum x of the Objective Function f(x)

minxf(x)

 

 

3.1 Hessian 正定之必要性

若以 2.1 中的 Newton's Method 为基础算法进行 minimization,需要重点关注 Step 2,两侧同乘 (f(x))T

(f(x))TΔx=(f(x))T(2f(x))1f(x)

如果对 f(x+Δx) 进行 2nd-Order Taylor Expansion

f(x+Δx)=f(x)+fxΔx+12(Δx)T2fx2(Δx)=f(x)+(f(x))TΔx+12(Δx)T2f(x)Δx=f(x)(f(x))T(2f(x))1f(x)+12[(2f(x))1f(x)]T2f(x)[(2f(x))1f(x)]=f(x)(f(x))T(2f(x))1f(x)+12(f(x))T(2f(x))1f(x)=f(x)12(f(x))T(2f(x))1f(x)

若不能使 2f(x)>0,即为 Positive Definite,那么显然在 f(x+Δx) 的过程中无法保证 f(x) 的数值会下降,即达到 Minimum 而不是 Maximum 或者 Saddle Point 这种奇奇怪怪的地方

 

 

3.2 Local Minimum 判断标准

minxf(x)

Assume x is a Local Minimizer and f(x) is continuously differentiable, then

  • 1st-Order Necessary Condition

    f(x)=0
  • 2nd-Order Necessary Condition

    2f(x)0
  • 2nd-Order Sufficient Condition

    If this condition is fulfilled, then x is a Strict Local Minimizer of f(x)

    2f(x)>0

 

 

 

4. Constrained Minimization

Find the minimum x of the Objective Function f(x) subjected to the following constraints

minxf(x)subject tog(x)=0h(x)0

我们最终要把一个 Constrained Problem 转化为一个 Unconstrained Problem 才能解它

 

 

4.1 Equality Constraints

minxf(x)subject tog(x)=0

where

  • f(x):RnR

  • g(x):RnRm

The 1st-Order Necessary Condition to find the minimum is

  1. f(x)=0

    只要在 FREE directions 上满足即可

  2. g(x)=0

    Equality Constraint 规定了哪里是 f(x) 的 Free Direction

 

4.1.1 Lagrangian

Lagrangian 能把上述的 Constrained Problem 转化为 Unconstrained Problem,以下是推导过程

  • 关于 Equality Constraint

    "What is Free Direction?" 简而言之,当 x 作了一个很小的位移 Δx 后仍能满足 g(x)=0,那这个位移就在 "Free Direction" 上,即

    g(x+Δx)=0

    进行 1st-Order Taylor Expansion 得到

    g(x+Δx)=g(x)+(g(x))TΔx

    Notes: g(x) 是个 Jacobian

    由于 g(x)=0,所以需要满足的其实是

    (g(x))TΔx=0
  • 关于 Objective Function

    同样,假设 x 在 Objective Function 上做了一个很小的位移 Δx,进行 1st-Order Taylor Expansion 得到

    f(x+Δx)=f(x)+(f(x))TΔx

    如果 x 尚不是 minimum x,那么只要 (f(x))TΔx<0,它就还能继续下降。同样,如果 x 往相反的方向位移,即 Δx,那么

    f(xΔx)=f(x)+(f(x))T(Δx)

    只要 (f(x))T(Δx)<0,它也能继续下降

    Tangent Space 是对称的,我懒得解释了你不懂就自己搜搜看吧...

    反正就是,如果你不想让 x 下降,那么这个 x 就必须满足

    (f(x))T(Δx)0(f(x))T(Δx)0

    也就是 x=x 要满足

    (f(x))TΔx=0
  • Summary

    从上述两个部分推出的结论是

    (f(x))TΔx=0(g(x))TΔx=0

    也就是 f(x)g(x) 都需要 Normal 于移动方向 Δx,换而言之 f(x) must be parallel to g(x)

    f(x)g(x)

    或者写成

    f(x)+g(x)λ=0

    [ Example ]

    你会看到 minimum 处的 objective function 和 equality constraint 的 gradient 是平行的

    minx1,x2x12+x22subject to(x12)2+x221=0

    lagrangian example

  • Lagrangian

    Based on this gradient condition, we define Lagrangian as

    L(x,λ)=f(x)+λTg(x)

    such that the 1st-Order Necessary Condition becomes

    L(x,λ)=0

    or equivalently

    xL(x,λ)=xf(x)+xg(x)λ=0λL(x,λ)=g(x)=0

 

4.1.2 KKT System

KKT = Karush-Kuhn-Tucker,更常见的相关概念一般是 KKT Conditions,见 4.2.2

目的是用 Newtonian 去解能使 L(x,λ)=0(x,λ)。首先进行 1st-Order Taylor Expansion

{xL(x,λ)=xf(x)+xg(x)λ=0λL(x,λ)=g(x)=0{xL(x+Δx,λ+Δλ)=xL(x,λ)+2Lx2Δx+2LxλΔλ=0λL(x+Δx,λ+Δλ)=g(x)+(gx)TΔx+gλΔx=0

since 2Lxλ=gx and gλ=0, then

{xL(x+Δx,λ+Δλ)=xL(x,λ)+2Lx2Δx+gxΔλ=0λL(x+Δx,λ+Δλ)=g(x)+(gx)TΔx=0

重新组织一下

2Lx2Δx+gxΔλ=xL(x,λ)(gx)TΔx=g(x)

写成 Matrix Form,就是 KKT System

[2Lx2gx(gx)T0][ΔxΔλ]=[xL(x,λ)g(x)]

 

4.1.3 Gauss-Newton Method

该方法能简化 4.1.2 中提及的 2Lx2 的运算,一般是用 1st-Order Taylor Expansion 来算

2Lx2=x2f(x)+x[xg(x)λ]This is a Tensor

The 2nd term is expensive to compute because it is a bloody Tensor!

So we just drop it! 我们不要第二项辣!蛤!蛤!蛤!

Gauss-Newton 会导致 slightly slower convergence than Full-Newton

会要跑更多的 Iterations,但是单个 Iteration 变 cheaper 了

 

 

4.2 Inequality Constraints

minxf(x)subject tog(x)=0h(x)0

 

4.2.1 LICQ

让 KKT Condition 成立的条件

  • The Active Set A(x) of any feasible point x consists of the equality constraints and the inequality constraints for which hi(x)=0

  • Given a point x and the Active Set of constraints A(x), we say that the Linear Independence Constraint Qualification (LICQ) holds if the gradients of all the active constraints are linearly independent"

 

4.2.2 KKT Conditions

具有不等约束的优化问题的 1st-Order-Necessary Conditions = Karush-Kuhn-Tucker (KKT) Conditions

  • 完全体 Lagrangian

    λμ 两个 Lagrangian Multiplier

    L(x,λ,μ)=f(x)+λTg(x)+μTh(x)

    "Suppose that x is a local solution and that the LICQ holds at x, and that f(x), gi(x) and hi(x) are continuously differentiable, then there are Lagrange multipliers λ and μ such that the KKT Conditions are satisfied"

  • KKT Conditions

    Stationarity ConditionxL(x,λ,μ)=0Primal Condition{g(x)=0h(x)0Dual Feasibilityμi0iComplementary Slacknessμihi(x)=0i

前两个 condition 真没什么好说的,后两个比较 tricky

Complementary Slackness 在满足 Dual Feasibility 的情况下,其意义是 "Either μi=0 or hi(x)=0"

在 evaluate 一个 inequality constraint 的时候,假设 h(x) 是一个点,会有三种情况:

  1. h(x)<0 - feasible point(非边界),因此 constraint inactive (μi=0)

  2. h(x)=0 - feasible point(边界上),因此constraint active (μi0)

  3. h(x)>0 - infeasible point

complementary slackness