Machine Learning - NN

ML_2NN

1. Step Function as Features

$x^{(i)}=[x^{(i)}_1, x^{(i)}_2]^T$ $y^{(i)} \in \{0, 1\}$ into to two. If we plot the separated training data into 3D-space, what we would get is a step function in 3D-space

linear classifier

Step function could be used as features, its difference from the previous case is that this time the feature itself also has parameters, consider the two step functions below

$\phi_i(x)$ $1$ if what inside the curly braces is true

step function

where

$\phi_i(x)$ $x =[x_1, x_2]^T \in \mathbb{R}^2$
$y=w_i^Tx+w_{0,i}$ is the hyperplane of a linear classifier

Construct a feature vector of data point out of the above two step functions

\begin{matrix} ϕ (x) = [\begin{matrix} ϕ_{1} (x) \\ ϕ_{2} (x) \end{matrix}] \in R^{2} \end{matrix}

Then we implement linear classification in 3D-space with this feature vector

\begin{aligned} z & = θ^{T} ϕ (x) + θ_{0} \\ = {[\begin{array}{c} θ_{1} \\ θ_{2} \end{array}]}^{T} [\begin{array}{c} ϕ_{1} (x) \\ ϕ_{2} (x) \end{array}] + θ_{0} \\ = θ_{1} ϕ_{1} (x) + θ_{2} ϕ_{2} (x) + θ_{0} \end{aligned}

We could assign a set of weight values and visualize this this classification hypothesis

\begin{aligned} z & = θ_{1} ϕ_{1} (x) + θ_{2} ϕ_{2} (x) + θ_{0} \\ = 1 \cdot ϕ_{1} (x) + 1 \cdot ϕ_{2} (x) + (- 0.5) \end{aligned}

step linear combination

If we add a third feature like this

\begin{matrix} ϕ (x) = [\begin{matrix} ϕ_{1} (x) \\ ϕ_{2} (x) \\ ϕ_{3} (x) \end{matrix}] \in R^{2} \end{matrix}

3 features

We could see that there is some overlaps in the visualization of this hypothesis, but since we have determined the threshold value is 0, overlaps would not affect the classification results

While theoretically we could be more "confident" to classify training data lying in those regions

overlapped hypothesis

[TIP] To better understand the meaning of this process, think about it in some context:

We want to know which planet is suitable for human immigration, and the most descisive elements are

$x_1:$ precipitation
$x_2:$ temperature

Then we can interpret the three features given

$\phi_1(x):$ we don't want to immigrate if there's too much rain
$\phi_2(x):$ we don't want to immigrate if it's too hot
$\phi_3(x):$ we don't want to immigrate if it's too cold

So finally, in the top view graph on the right hand side, only planets lie in the green zone is inhabitable

2. Hypothesis

The process in 1. describes a Neural Network, that has the following two layers

Construct the Features
Assign the Labels

$L^{(i)}$ ) would be used to represent indices of layer

$i=1$ means we are at layer 1

$1^{st}$ Layer, Construct the Features
- Input
  
  One Single Data Point of Features
  
  $m^{(1)}$ is the number of Features of a data point
  
  $m^{(1)}=2$
  $\begin{matrix} x = [\begin{matrix} x_{1} \\ ⋮ \\ x_{m^{(1)}} \end{matrix}] \in R^{m^{(1)}} \end{matrix}$
- Output
  
  Vector of (Step Function) Features
  
  $n^{(1)}$ is the number of (Step Function) Features
  
  $n^{(1)}=3$
  $\begin{matrix} A^{(1)} = [\begin{matrix} A_{1}^{(1)} \\ ⋮ \\ A_{n^{(1)}}^{(1)} \end{matrix}] \in R^{n^{(1)}} \end{matrix}$
- Computation
  - $i^{th}$ feature
    $A_{i}^{(1)} = f^{(1)} (w_{i}^{(1) T} x + w_{0, i}^{(1)})$
    $w_{i}^{(1)}$ $x$
    $\begin{matrix} w_{i}^{(1)} = [\begin{matrix} w_{i [1]}^{(1)} \\ ⋮ \\ w_{i [m^{(1)}]}^{(1)} \end{matrix}] \in R^{m^{(1)}} & w_{0, i}^{(1)} \in R \end{matrix}$
  - All features at once
    $A^{(1)} = f^{(1)} (W^{(1) T} x + W_{0}^{(1)})$
    $w_{i}^{(1)}$ horizontally $w_{0,i}^{(1)}$ are vertically stacked
    $\begin{matrix} W^{(1)} = [\begin{matrix} w_{1}^{(1)} & \dots & w_{n^{(1)}}^{(1)} \end{matrix}] \in R^{m^{(1)} \times n^{(1)}} & W_{0}^{(1)} = [\begin{matrix} w_{1}^{(1)} \\ ⋮ \\ w_{n^{(1)}}^{(1)} \end{matrix}] \in R^{n^{(1)}} \end{matrix}$
$2^{nd}$ Layer, Assign the Labels
- Input
  
  (Step Function) Features $1^{st}$ layer,
  
  $m^{(2)}=n^{(1)}$ is the number of (Step Function) Features
  
  $m^{(2)}=n^{(1)}=3$
  $\begin{matrix} A^{(1)} = [\begin{matrix} A_{1}^{(1)} \\ ⋮ \\ A_{n^{(1)}}^{(1)} \end{matrix}] \in R^{m^{(2)}} = R^{n^{(1)}} \end{matrix}$
- Output
  
  Vector of Labels
  
  $n^{(2)}$ is the number of labels
  
  $n^{(2)}=1$ as we are just trying to decide "yes" or "no"
  $\begin{matrix} A^{(2)} = [\begin{matrix} A_{1}^{(2)} \\ ⋮ \\ A_{n^{(2)}}^{(2)} \end{matrix}] \in R^{n^{(2)}} \end{matrix}$
  
  And yeah, all this process forms a Neural Net as noted in the graph above
- Computation
  - $i^{th}$ label
    $A_{i}^{(2)} = f^{(2)} (w_{i}^{(2) T} A^{(1)} + w_{0, i}^{(2)})$
    where
    $\begin{matrix} w_{i}^{(2)} = [\begin{matrix} w_{i [1]}^{(2)} \\ ⋮ \\ w_{i [m^{(2)}]}^{(2)} \end{matrix}] \in R^{m^{(2)}} = R^{n^{(1)}} & w_{0, i}^{(2)} \in R \end{matrix}$
  - All labels at once
    $A^{(2)} = f^{(2)} (W^{(2) T} A^{(1)} + W_{0}^{(2)})$
    $w_{i}^{(2)}$ horizontally $w_{0,i}^{(2)}$ are vertically stacked
    $\begin{matrix} W^{(2)} = [\begin{matrix} w_{1}^{(2)} & \dots & w_{n^{(2)}}^{(2)} \end{matrix}] \in R^{m^{(2)} \times n^{(2)}} = R^{n^{(1)} \times n^{(2)}} & W_{0}^{(2)} = [\begin{matrix} w_{1}^{(2)} \\ ⋮ \\ w_{n^{(2)}}^{(2)} \end{matrix}] \in R^{n^{(2)}} \end{matrix}$
Summary
- $1^{st}$ layer
  
  $A^{(1)}=f^{(1)}(W^{(1)T}x+W^{(1)}_0)$
- $2^{nd}$ layer
  
  $A^{(2)}=f^{(2)}(W^{(2)T}A^{(1)}+W^{(2)}_0)$
- The Whole Process
  
  $A^{(2)}=NN(x; W, W_0)$

3. Function Graph

The following is a function graph created based on the hypothesis we developed (see 2.)

$A^{(2)}_1$

function graph

Explanation

This is a Fully Connected, Feed-Forward neural network
- Fully Connected:
  
  A fully connected layer means that all inputs to every neuron of the layer is the same
  
  A fully connected network is composed of fully connected layers
- Feed-Forward:
  
  Well, literally, all arrows are pointing forward
  
  (There are some NNs with arrows not pointing forward, like RNN)
Circlesfunction evaluation $\sum$ $Z^{(n)}_i$ inside refers to
- $1^{st}$ layer:
  $Z_{i}^{(1)} = w_{i}^{(1) T} x + w_{0, i}^{(1)}$
- $2^{nd}$ layer:
  $Z_{i}^{(2)} = w_{i}^{(2) T} A^{(1)} + w_{0, i}^{(2)}$
Important Notation:

$Z^{(l)} \longrightarrow$ $l^{th}$ layer"2. $Z^{(l)} \in \mathbb{R}^{n^{(l)}}$

3.1 "Neuron"

Neurons/Units/Nodes are basic elements of a Neural Network

$1^{st}$ layer

neuron

$i^{th}$ feature with the equation

A_{i}^{(1)} = f^{(1)} (Z_{i}^{(1)}) = f^{(1)} (w_{i}^{(1) T} x + w_{0, i}^{(1)})

3.2 "Layer"

Multiple Neurons compose a layer

Annotations
Simplified Version

$\phi_1(x)$ $\phi_2(x)$ $\phi_3(x)$ $x_1$ $x_2$
Multiple (Hidden) Layers

$L$ layers in a Neural Network

Then each layer could be represented with
$\begin{aligned} A^{(l)} & = f^{(l)} (Z^{(l)}) \\ = f^{(l)} (W^{(l) T} A^{(l - 1)} + W_{0}^{(l)}) \end{aligned}$
$l\in\{1,..., L\}$

4. Activation Function

4.1 Example

Identity Function

"Do Nothing"

f (z) = z

Sigmoid Function

$[0, 1]$
$f (z) = σ (z) = \frac{1}{1 + e^{- z}}$
Hyperbolic Tangent

$[-1, 1]$
$f (z) = \tanh (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$
Rectified Linear Unit

Similar to Identity Function, but zero-out all values less than 0

There's one single point that is non-differentiable, but we won't go to that point most of the time
$\begin{matrix} f (z) = ReLU (z) = {\begin{cases} 0 & for z < 0 \\ z & otherwise \end{cases} \end{matrix}$

activation func

[Special] Softmax Function

A popular choice for multi-class classification, see Special Topics - 1. for more information

4.2 Function Selection

Technically, we cannot use Step Function as Activation Function due to the following concerns

Derivatives are either 0 or undefined

Cannot apply (Stochastic) Gradient Descent
Cannot take a range of values

Cannot apply Regressions
Value is completely binary with no values in between

Cannot apply NLL Loss

The Hypothesis of Neural Network of our example is (see 2.)

\begin{aligned} h (x; W, W_{0}) & = N N (x; W, W_{0}) \\ with {\begin{cases} 1^{s t} layer: A^{(1)} = f^{(1)} (W^{(1) T} x + W_{0}^{(1)}) \\ 2^{s t} layer: A^{(2)} = f^{(2)} (W^{(2) T} A^{(1)} + W_{0}^{(2)}) \end{cases} \end{aligned}

You can choose different Activation Function based on what you want to do

Regression

$f^{(1)}$ $f^{(2)}$
$f^{(2)} (z) = z$
NLL Loss

$\in [0,1]$ , so...
$f^{(2)} (z) = σ (z)$
(Stochastic) Gradient Descent

$f^{(2)}$ have non-zero derivatives
$\begin{aligned} f_{1}^{(2)} (z) = z \\ f_{2}^{(2)} (z) = σ (z) \end{aligned}$
$f^{(1)}$ is also differentiable, like...
$\begin{aligned} f_{2}^{(1)} (z) = σ (z) \\ f_{2}^{(1)} (z) = \tanh (z) \\ f_{2}^{(1)} (z) = ReLU (z) \end{aligned}$

[DEMO] Choosing different activation functions could generate different hyperplanes, for example

$f^{(1)}(z)=\sigma(z)$
$f^{(2)}(z)=z$ $f^{(2)}(z)=\sigma(z)$

5. How to Train

A general form of Learning (or Optimization) Objective is

J (W, W_{0}) = \frac{1}{n} \sum_{i = 1}^{n} L (N N (x^{(i)}; W, W_{0}), y^{(i)})

[Theorem] if optimization objective is nice and convex, then (Stochastic) Gradient Descent perform well

BUT! Unfortunately, the objective of Neural Network is SUPER NON-CONVEX function of parameters !!!

Except for 1-layer NN...

5.1 Back-Propagation

For Supervised Learning, what we want to do is

min_{W, W_{0}} J (W, W_{0})

Since (Stochastic) Gradient Descent is usually preferred, what we need is

\nabla_{W, W_{o}} L (N N (x^{(i)}; W, W_{0}), y^{(i)})

And, yeah, Back-Propagation is a very elegant way to find Gradients!

5.1.1 Some Remainders

Chain Rule

Back-Prop is based on Chain Rule
$\frac{d a}{d c} = \frac{d a}{d b} \cdot \frac{d b}{d c}$
Weight Hiding

$W_0$ $W$ $1$ $A^{(l)}$

$l^{th}$ layer could be simplified to
$Z^{(l)} = W^{(l) T} A^{(l)}$
This is the notation we are using in this section

(See Machine Learning - Basics 1.6.5)
Gradient with respect to What???

Always find the Gradient of Loss functionWeights $l^{th}$ $W^{(l)}$

That tells us how much we want to change the weights, in order to reduce the loss

5.1.2 Computation Analysis

$L$ layers

[How Training Loss depends on Weights of the Last Layer]

$L^{th}$ Layer

$\begin{aligned} L o s s & = L (A^{(L)}, y^{(i)}) \\ A^{(L)} & = f^{(L)} (Z^{(L)}) \\ Z^{(L)} & = W^{(L) T} A^{(L - 1)} \end{aligned}$

$L^{th}$ Layer

$\frac{\partial L o s s}{\partial W^{(L)}} = \underset{n^{(L)} \times 1}{\underset{⏟}{\frac{\partial L o s s}{\partial A^{(L)}}}} \cdot \underset{n^{(L)} \times n^{(L)}}{\underset{⏟}{\frac{\partial A^{(L)}}{\partial Z^{(L)}}}} \cdot \underset{m^{(L)} \times 1}{\underset{⏟}{\frac{\partial Z^{(L)}}{\partial W^{(L)}}}}$

Analysis
- First Term $\frac{\partial{Loss}}{\partial{A^{(L)}}}$
  
  2. $A^{(L)} \in\mathbb{R}^{n^{(L)}}$
  
  $Loss'$ , depending on choice of Loss Function
- Second Term $\frac{\partial{A^{(L)}}}{\partial{Z^{(L)}}}$
  
  3. $Z^{(L)} \in\mathbb{R}^{n^{(L)}}$
  
  $A^{(L)}_i$ $Z^{(L)}_i$ $n^{(L)}\times n^{(L)}$
  $\begin{matrix} \frac{\partial A^{(L)}}{\partial Z^{(L)}} = [\begin{matrix} \frac{\partial A_{1}^{(L)}}{\partial Z_{1}^{(L)}} & 0 & \dots & 0 \\ 0 & \frac{\partial A_{2}^{(L)}}{\partial Z_{2}^{(L)}} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & \frac{\partial A_{n^{(L)}}^{(L)}}{\partial Z_{n^{(L)}}^{(L)}} \end{matrix}] \end{matrix}$
- Third Term $\frac{\partial{Z^{(L)}}}{\partial{W^{(L)}}}$
  
  2. $A^{(L-1)} \in \mathbb{R}^{m^{(L)}}$
  
  $Z^{(L)}$ , you could find that this value is just
  $\frac{\partial Z^{(L)}}{\partial W^{(L)}} = A^{(L - 1)}$
- Proper Computation Sequence
  
  Obviously the dimensions of the terms of the above equation do not match
  
  The actual computation sequence is the below arrangement
  $\underset{m^{(L)} \times n^{(L)}}{\underset{⏟}{\frac{\partial L o s s}{\partial W^{(L)}}}} = \underset{m^{(L)} \times 1}{\underset{⏟}{\frac{\partial Z^{(L)}}{\partial W^{(L)}}}} \cdot \underset{1 \times n^{(L)}}{\underset{⏟}{(\underset{n^{(L)} \times n^{(L)}}{\underset{⏟}{\frac{\partial A^{(L)}}{\partial Z^{(L)}}}} \cdot \underset{n^{(L)} \times 1}{\underset{⏟}{\frac{\partial L o s s}{\partial A^{(L)}}}})^{T}}}$
  Then we could find this nice and neat general notation
  $\underset{m^{(L)} \times n^{(L)}}{\underset{⏟}{\frac{\partial L o s s}{\partial W^{(L)}}}} = \underset{m^{(L)} \times 1}{\underset{⏟}{A^{(L - 1)}}} \cdot \underset{1 \times n^{(L)}}{\underset{⏟}{(\frac{\partial L o s s}{\partial Z^{(L)}})^{T}}}$
  By the above equation, we could see that
  
  $A^{(l-1)}$ $\frac{\partial{Loss}}{\partial{Z^{(l)}}}$ to compute the gradient!!!
  
  Yeah!
[How Training Loss depends on Weights of the First Layer]

Too stupid if we still stick to the partial derivative train above...

$\frac{\partial{Loss}}{\partial{Z^{(1)}}}$ , with the following equation in the third line
$\begin{aligned} \frac{\partial L o s s}{\partial W^{(1)}} & = \frac{\partial L o s s}{\partial A^{(L)}} \cdot \frac{\partial A^{(L)}}{\partial Z^{(L)}} \cdot \frac{\partial Z^{(L)}}{\partial A^{(L - 1)}} \cdot \frac{\partial A^{(L - 1)}}{\partial Z^{(L - 1)}} \dots \frac{\partial A^{(2)}}{\partial Z^{(2)}} \cdot \frac{\partial Z^{(2)}}{\partial A^{(1)}} \cdot \frac{\partial A^{(1)}}{\partial Z^{(1)}} \cdot \frac{\partial Z^{(1)}}{\partial W^{(1)}} \\ = (\frac{\partial L o s s}{\partial A^{(L)}} \cdot \frac{\partial A^{(L)}}{\partial Z^{(L)}} \cdot W^{(L)} \cdot \frac{\partial A^{(L - 1)}}{\partial Z^{(L - 1)}} \dots \frac{\partial A^{(2)}}{\partial Z^{(2)}} \cdot W^{(2)} \cdot \frac{\partial A^{(1)}}{\partial Z^{(1)}}) \cdot x \\ = \frac{\partial L o s s}{\partial Z^{(1)}} \cdot x \end{aligned}$
$x$ $A^{(0)}$ " (see 2.)

5.1.3 General Equation

$l^{th}$ Layer]

Dimensions of terms match properly in this version
$\begin{aligned} \frac{\partial L o s s}{\partial Z^{(l)}} & = \frac{\partial A^{(l)}}{\partial Z^{(l)}} \cdot W^{(l + 1)} \cdot \frac{\partial A^{(l + 1)}}{\partial Z^{(l + 1)}} \dots W^{(L - 1)} \cdot \frac{\partial A^{(L - 1)}}{\partial Z^{(L - 1)}} \cdot W^{(L)} \cdot \frac{\partial A^{(L)}}{\partial Z^{(L)}} \cdot \frac{\partial L o s s}{\partial A^{(L)}} \\ \frac{\partial L o s s}{\partial W^{(l)}} & = A^{(l - 1)} \cdot (\frac{\partial L o s s}{\partial Z^{(l)}})^{T} \end{aligned}$
[Summary]
- Forward Pass
  
  $l = 0 \longrightarrow L$
  
  $A^{(l)}$ $Z^{(l)}$ $Loss$
- Back-Propagation
  
  $l = L \longrightarrow 0$
  
  $\frac{\partial{Loss}}{\partial{W^{(l)}}}$ and updata the weights

5.2 Stochastic GD

Here is the pseudo-code implementation of a Neural Network using Stochastic Gradient Descent

SGD NN

5.2.1 Weight Initialization

Random

Different layers of the Neural Network address different things, so do not start everything at the same weights!

Or this will lead you to some Local Optimums rather than the Global Optimum!
Not too Extreme
1. Consider a Sigmoid as activation function, if the weight is too big, the input to Sigmoid will also be very big, which means the gradient will be almost 0!
2. Consider a ReLU as activation function, if the weight is too small, the output of ReLU will lie on the left hand plane, where the gradient is 0!
3. Worst Case:
  
  $\longrightarrow$ Gradient Explodes
  
  $\longrightarrow$ Gradient Vanishes
  
  (see 5. for more information)
Generally, keep the initial weights to be random small values to ensure Gradients to be non-zero

$Loss$ $f^{(L)}$

Some common combinations of Loss Function with their respective Last Layer Activation Function

Squared Loss $\longrightarrow$ Identity

(need real numbers)
NLL Loss $\longrightarrow$ Sigmoid

(need two-class probability)
NLLM Loss $\longrightarrow$ Softmax

(need multi-class probability)

5.3 Mini-Batch GD

"Average" the benefits of Batch GD and Stochastic GD

by $k$ data points at random from dataset to compute gradients

"Mini-Batch" = "Epoch"

but people use the term "Epoch" differently in other cases, so ...

Comparisons on Weight Update Equations with the following conditions

5.1.1 $\hat{W} = W, W_0$
$n$
$k$

\begin{aligned} Batch Gradient Descent: & \hat{W} = \hat{W} - η \sum_{i = 1}^{n} \nabla_{W} L (N N (x^{(i)}; \hat{W}), y^{(i)}) \\ Mini-Batch Gradient Descent: & \hat{W} = \hat{W} - η \sum_{i = 1}^{k} \nabla_{W} L (N N (x^{(i)}; \hat{W}), y^{(i)}) \\ Stochastic Gradient Descent: & \hat{W} = \hat{W} - η \nabla_{W} L (N N (x^{(i)}; \hat{W}), y^{(i)}) \end{aligned}

[ Note ]

$k$ data points" is hard, while "Randomly shuffle dataset & iterate" is easy

5.4 Regularization

When you train something and do not stop, the training loss would always going down, but the validation loss would go up at some point due to overfitting, and this is what Regularization deals with

5.4.1 Weight Decay

(Ridge Regression) Regularizer / Penalty $R(W)=\lambda\norm{W}^{2}$ $\lambda \geq 0$ after the Objective Function

(Examples / Theories in Machine Learning - Basics 3.2.4 / 3.3.4)

Then the general Objective Function becomes

J (W) = \frac{1}{n} \sum_{i = 1}^{n} L (N N (x^{(i)}; W), y^{(i)}) + λ {‖ W ‖}^{2}

$n = 1$ ), then the gradient of Objective Function is

\begin{aligned} \nabla_{W} J (W) & = \frac{\partial}{\partial W} {\frac{1}{n} \sum_{i = 1}^{n} L (N N (x^{(i)}; W), y^{(i)}) + λ {‖ W ‖}^{2}} \\ = \frac{\partial}{\partial W} {L (N N (x^{(i)}; W), y^{(i)}) + λ {‖ W ‖}^{2}} \\ = \nabla_{W} L (N N (x^{(i)}; W), y^{(i)}) + 2 λ W \end{aligned}

Therefore the weight update equation becomes

\begin{aligned} W_{t} & = W_{t - 1} - η \cdot \nabla_{W} J (W) \\ = W_{t - 1} - η \cdot {\nabla_{W} L (N N (x^{(i)}; W_{t - 1}), y^{(i)}) + 2 λ W_{t - 1}} \\ = W_{t - 1} - 2 η λ W_{t - 1} - η \cdot \nabla_{W} L (N N (x^{(i)}; W_{t - 1}), y^{(i)}) \\ W_{t} & = (1 - 2 η λ) W_{t - 1} - η \cdot \nabla_{W} L (N N (x^{(i)}; W_{t - 1}), y^{(i)}) \end{aligned}

Recall that the weight update equation without regularizer is

\begin{matrix} W_{t} = W_{t - 1} - η \cdot \nabla_{W} L (N N (x^{(i)}; W_{t - 1}), y^{(i)}) \end{matrix}

$W_{t-1}$ Decay Coefficient $(1 - 2\eta\lambda)$ to make sure the weight is not too big

Here we assume Small Weight is more preferable with better performance to generalize the model

5.4.2 Batch Normalization

This deals with Covariate Shift, which assumes that

$P(x)$ is changed, or "shifted"
Ex. For a Cat Classifier
- Input Dataset 1 = cats at days
- Input Dataset 2 = cats at nights
$F(x,y)$ Stays the same

The cats in the above 2 datasets are from the same group

Hence the classification results should correspond

$X^{(i)}$ $j^{th}$ MB $i^{th}$ layer of Neural Network

{\hat{X}}_{j}^{(i)} = \frac{X_{j}^{(i)} - μ_{M B}}{σ_{M B}}

5.4.3 Other Practices

Dropout

$p$ $W=0$ )

The purpose is to make each node replaceable

Nodes are just like members of a team with different capacities, you need to make sure that they can partially cover each other's specializations, so that one day when someone was absent due to sickness, the rest of the team could still make some progress.
Perturbing Data

Messing with Training Data a bit, with Normal Distribution or other stuff...
${\hat{x}}^{(i)} = x^{(i)} + N where N \sim G a u s s i a n (μ = 0, σ^{2} \to small)$
Sort of a form of Data Augmentation

Special Topics

1. Softmax Regression

Softmax Regression is a special activation function for Multi-Class Classification, the layer where this function is implemented is called "Softmax Layer" whose output is like the following.

$l^{th}$ layer

$C = 4 \longrightarrow$ 4 classes available to be classified

$p(\text{Class n}|_{A^{(l-1)}})$ $\longrightarrow$ $n$ $A^{(l-1)}$

\begin{matrix} A^{(l)} = f^{(l)} (W^{(l) T} A^{(l - 1)} + W_{0}^{(l)}) = [\begin{matrix} p (Class 1 |_{A^{(l - 1)}}) \\ p (Class 2 |_{A^{(l - 1)}}) \\ p (Class 3 |_{A^{(l - 1)}}) \\ p (Class 4 |_{A^{(l - 1)}}) \end{matrix}] \end{matrix}

$A^{(l)}$ $4\times 1$ $1$

1 = \sum_{n = i}^{4} p (Class i |_{A^{(l - 1)}})

The name "Softmax" is relative to "Hardmax"

$4\times 1$ $1$ $0$ "s for the rest
$\begin{matrix} Hardmax (Z^{(l)}) = [\begin{matrix} 0 \\ 1 \\ 0 \\ 0 \end{matrix}] \end{matrix}$
"Just like One-Hot encoding"

1.1 Softmax Function

$C$

Linear Part of Softmax Layer (input to Softmax function)
$Z^{(l)} = W^{(l) T} A^{(l - 1)} + W_{0}^{(l)}$
Softmax Activation Function

It's kind of doing normalization, mapping input elements to probabilities

$A^{(l)} = Softmax (Z^{(l)}) = \frac{e^{Z^{(l)}}}{\sum_{i = 1}^{C} e^{Z_{i}^{(l)}}}$

$i^{th}$ Class is

$A_{i}^{(l)} = \frac{e^{Z_{i}^{(l)}}}{\sum_{i = 1}^{C} e^{Z_{i}^{(l)}}} \in R$

Dimensions:
- $A^{(l)}, e^{Z^{(l)}} \in\mathbb{R}^C$
- $A^{(l)}_i, e^{Z^{(l)}_i} \in\mathbb{R}$

[Example]

Given

$C = 4$
$Z^{(l)}= [5, 2, -1, 3]^T$

Then calculate some essential data

\begin{matrix} e^{Z^{(l)}} = [\begin{matrix} e^{5} \\ e^{2} \\ e^{- 1} \\ e^{3} \end{matrix}] = [\begin{matrix} 148.4 \\ 7.4 \\ 0.4 \\ 20.1 \end{matrix}] & \sum_{i = 1}^{4} e^{Z_{i}^{(l)}} = 176.3 \end{matrix}

Now we can find the output with the above

\begin{matrix} A^{(l)} = [\begin{matrix} 148.4 / 176.3 \\ 7.4 / 176.3 \\ 0.4 / 176.3 \\ 20.1 / 176.3 \end{matrix}] = [\begin{matrix} 0.842 \to 84.2 % \\ 0.042 \to 4.2 % \\ 0.002 \to 0.2 % \\ 0.114 \to 11.4 % \end{matrix}] \end{matrix}

1.2 How to Train

$i^{th}$ Data Point

$4$ classes to be classified, with the following example data point
$\begin{aligned} Label: & y^{(i)} = [\begin{array}{c} 0 \\ 1 \\ 0 \\ 0 \end{array}] \\ Result: & {\hat{y}}^{(i)} = A^{(l) (i)} = [\begin{array}{c} 0.3 \\ 0.2 \\ 0.1 \\ 0.4 \end{array}] \end{aligned}$
The Loss Function is defined by
$L (y^{(i)}, \hat{y^{(i)}}) = - \sum_{j = 1}^{4} [y_{j}^{(i)} \log ({\hat{y}}_{j}^{(i)})]$
$y^{(i)}$ $1$ $0$ "s, which means only the term related to predicted class will be left

Based on the above example, we will have
$\begin{aligned} - \sum_{j = 1}^{4} [y_{j}^{(i)} \log ({\hat{y}}_{j}^{(i)})] & = - y_{2}^{(i)} \log ({\hat{y}}_{2}^{(i)}) \\ = - 1 \cdot \log ({\hat{y}}_{2}^{(i)}) \\ = - \log ({\hat{y}}_{2}^{(i)}) \end{aligned}$
$\hat{y}^{(2)}$ as big as possible to minimize this loss function

a.k.a. Make a correct prediction!
Objective Function for the Training Set

$m$ data points
$J (W^{(l)}, W_{0}^{(l)}) = \frac{1}{m} \sum_{i = 1}^{m} L (y^{(i)}, \hat{y^{(i)}})$
Then you can just use gradient descent or other methods to learn the parameter...

2. Multi-Class NLL Loss

2.1 Two-Class Version

This is the Loss Function for Logistic Regression (see Machine Learning - Basics 3.2.3), hence it is actually a 2-Class NLL Loss

$y^{(i)}$ $i^{th}$ $g^{(i)}$ $i^{th}$ data point.

$y^{(i)} \in \{0, 1\}$ , then the NLL Loss function could be defined as

L_{n l l} (g^{(i)}, y^{(i)}) = - [y^{(i)} \log (g^{(i)}) + (1 - y^{(i)}) \log (1 - g^{(i)})]

2.2 Multi-Class Version

The definition in 2.1 could be extended to multi-class classification directly

$C$ , using Softmax as output layer activation function, then

\begin{aligned} Label: & y^{(i)} = [\begin{array}{c} y_{1} \\ y_{2} \\ ⋮ \\ y_{C} \end{array}] \in R^{C} \\ Guess: & g^{(i)} = [\begin{array}{c} g_{1} \\ g_{2} \\ ⋮ \\ g_{C} \end{array}] \in R^{C} \end{aligned}

$y^{(i)} \longrightarrow$ $1$ " representing the labeled class
$g^{(i)} \longrightarrow$ $C$ classes

The probability that the Neural Network makes proper prediction is

\prod_{j = 1}^{C} g_{j}^{(i) y_{j}^{(i)}}

Definition of Multi-Class Negative-Log-Likelihood Loss Function is

L_{n l l m} (g^{(i)}, y^{(i)}) = - \sum_{j = 1}^{C} y_{j}^{(i)} \log (g_{j}^{(i)})

3. Adaptive Step-Size

$\eta$ should be independent for each weight, based on the local view of gradient update

For example

$\eta$ should be small when gradient change is small
$\eta$ should be large when gradient change is large

or you might miss something important...like the red-little optimum...

extreme case

3.1 Moving Average

This is the theoretical basis for further discussion of step-size

$T$ data points $a_t = a_1, a_2,...,a_T$

weight average $t$ $A_t = A_0,A_1,A_2,...,A_T$ with the following formulations

\begin{aligned} A_{0} & = 0 \\ A_{t} & = γ_{t} A_{t - 1} + (1 - γ_{t}) a_{t} \end{aligned}

$\gamma_t \in (0,1)$ $\gamma_t = \gamma$

$T$ data points is

\begin{aligned} A_{T} & = γ A_{T - 1} + (1 - γ) a_{T} \\ = γ (γ A_{T - 2} + (1 - γ) a_{T - 1}) + (1 - γ) a_{T} \\ = γ^{2} A_{T - 2} + γ^{1} (1 - γ) a_{T - 1} + γ^{0} (1 - γ) a_{T} \\ A_{T} & = \sum_{t = 0}^{T} γ^{T - t} (1 - γ) a_{t} \end{aligned}

closer to the end $T$ more impact $A_T$

$\gamma \in (0,1)$ , therefore

$\gamma^a \rightarrow 0$ $a\rightarrow \infin$

$\gamma^a \rightarrow 1$ $a\rightarrow 0$

[ Note ]

$\gamma_t = \frac{t-1}{t}$ , then the equation gives the actual average

3.2 Momentum

SGD and MBGD randomly select one or more data points to compute gradient, hence the direction of next step is also random

"Momentum" allows more effective weight update by taking past gradients into account

Momentum = Weighted Average of Current & Past Gradients

momentum

To be more specific, in the graph below

Red Arrows are the final choice of update of each step
Blue Dots are the computed "current gradients" Mini-Batch at each step

momentum detail

Definition of Weight Update with Momentum

\begin{aligned} V_{0} & = 0 \\ V_{t} & = γ V_{t - 1} + η \nabla_{W} J (W_{t - 1}) \\ W_{t} & = W_{t - 1} - V_{t} \end{aligned}

$\eta = \eta'(1-\gamma)$

Momentum Update
$\begin{aligned} V_{t} & = γ V_{t - 1} + η \nabla_{W} J (W_{t - 1}) \\ V_{t} & = γ V_{t - 1} + η^{'} (1 - γ) \nabla_{W} J (W_{t - 1}) \\ \frac{V_{t}}{η^{'}} & = γ \frac{V_{t - 1}}{η^{'}} + (1 - γ) \nabla_{W} J (W_{t - 1}) \end{aligned}$
$M_t = V_t / \eta'$ , then
$M_{t} = γ M_{t - 1} + (1 - γ) \nabla_{W} J (W_{t - 1})$
Initial Momentum
$M_{0} = \frac{V_{0}}{η^{'}} = \frac{0}{η^{'}} = 0$
Weight Update
$W_{t} = W_{t - 1} - η^{'} \cdot \frac{V_{t}}{η^{'}} = W_{t - 1} - η^{'} M_{t}$
Summary

$\eta'$ on a weighted moving average of gradients
$\begin{aligned} M_{0} & = 0 \\ M_{t} & = γ M_{t - 1} + (1 - γ) \nabla_{W} J (W_{t - 1}) \\ W_{t} & = W_{t - 1} - η^{'} M_{t} \end{aligned}$

3.3 AdaDelta / AdaGrad

Here is an intuitive way to understand this:

"Big Steps when Flat, Small Steps when Steep"

Just like the example provided in the beginning of this section!

Definition of Weight Update with AdaDelta

\begin{aligned} g_{t, j} & = \nabla_{W} J (W_{t - 1})_{j} = \frac{\partial J (W_{t - 1})}{\partial W_{j}} \\ G_{t, j} & = γ G_{t - 1, j} + (1 - γ) g_{t, j}^{2} \\ W_{t, j} & = W_{t - 1, j} - \underset{scaled step-size}{\underset{⏟}{\frac{η}{\sqrt{G_{t, j} + ϵ}}}} \cdot g_{t, j} \end{aligned}

where

$g_{t,j}\longrightarrow$ $j^{th}$ $j^{th}$ Component of the Gradient
$G_{t,j}\longrightarrow$ Squared Gradient $j^{th}$ Weight

This is the measure of Curvature (Flat / Steep)
- $g_{t,j} \uparrow \;\Rightarrow G_{t,j}\uparrow$ , curvature is high, scale-down step-size in weight update
- $g_{t,j} \downarrow \;\Rightarrow G_{t,j}\downarrow$ , curvature is low, scale-up step-size in weight update
$\epsilon\longrightarrow$ very small $G_{t,j} = 0$

3.4 Adam

Combines the idea of Momentum and AdaDelta, by computing the Moving Averages of both Gradient and Squared Gradient

but it may actually violate the convergence conditions of SGD

Definition of Weight Update with Adam

\begin{aligned} g_{t, j} & = \nabla_{W} J (W_{t - 1})_{j} \\ m_{t, j} & = B_{1} m_{t - 1, j} + (1 - B_{1}) g_{t, j} \\ v_{t, j} & = B_{2} v_{t - 1, j} + (1 - B_{2}) g_{t, j}^{2} \\ {\hat{m}}_{t, j} & = \frac{m_{t, j}}{1 - B_{1}^{t}} \\ {\hat{v}}_{t, j} & = \frac{v_{t, j}}{1 - B_{2}^{t}} \\ W_{t, j} & = W_{t - 1, j} - \frac{η}{\sqrt{{\hat{v}}_{t, j} + ϵ}} \cdot {\hat{m}}_{t, j} \end{aligned}

where

$g_{t,j}\longrightarrow$ $j^{th}$ $j^{th}$ Component of the Gradient
$m_{t,j}\longrightarrow$ Gradient $j^{th}$ Weight

the "Nominal" Mean
$v_{t,j}\longrightarrow$ Squared Gradient $j^{th}$ Weight

the "Nominal" Variance
$\hat{m}_{t,j}, \hat{v}_{t,j} \longrightarrow$ Corrected "Mean" & "Variance"

zero initialization $m_0=0, v_0=0$ ) without being corrected

then their values would always be too small!

4. CNN

Concolution Neural Network (CNN) is a classic architecture for Image Processing

see this cheatsheet

1. Step Function as Features

2. Hypothesis

3. Function Graph

3.1 "Neuron"

3.2 "Layer"

4. Activation Function

4.1 Example

4.2 Function Selection

5. How to Train

5.1 Back-Propagation

5.1.1 Some Remainders

5.1.2 Computation Analysis

5.1.3 General Equation

5.2 Stochastic GD

5.2.1 Weight Initialization

5.2.2 LossLoss and f(L)f^{(L)}

5.3 Mini-Batch GD

5.4 Regularization

5.4.1 Weight Decay

5.4.2 Batch Normalization

5.4.3 Other Practices

Special Topics

1. Softmax Regression

1.1 Softmax Function

1.2 How to Train

2. Multi-Class NLL Loss

2.1 Two-Class Version

2.2 Multi-Class Version

3. Adaptive Step-Size

3.1 Moving Average

3.2 Momentum

3.3 AdaDelta / AdaGrad

3.4 Adam

4. CNN

$Loss$ $f^{(L)}$