Machine Learning - NN
1. Step Function as Features
A linear classifier in 2D-space separates a group of training data with features

Step function could be used as features, its difference from the previous case is that this time the feature itself also has parameters, consider the two step functions below
is evaluated as if what inside the curly braces is true

where
is a step function of is the hyperplane of a linear classifier
Construct a feature vector of data point out of the above two step functions
Then we implement linear classification in 3D-space with this feature vector
We could assign a set of weight values and visualize this this classification hypothesis

If we add a third feature like this

We could see that there is some overlaps in the visualization of this hypothesis, but since we have determined the threshold value is 0, overlaps would not affect the classification results
While theoretically we could be more "confident" to classify training data lying in those regions

[TIP] To better understand the meaning of this process, think about it in some context:
We want to know which planet is suitable for human immigration, and the most descisive elements are
precipitation temperature
Then we can interpret the three features given
we don't want to immigrate if there's too much rain we don't want to immigrate if it's too hot we don't want to immigrate if it's too cold
So finally, in the top view graph on the right hand side, only planets lie in the green zone is inhabitable
2. Hypothesis
The process in 1. describes a Neural Network, that has the following two layers
Construct the Features
Assign the Labels
Superscript on a letter (in the form
means we are at layer 1
-
Layer, Construct the Features-
Input
One Single Data Point of Features
where
is the number of Features of a data pointin our case,

-
Output
Vector of (Step Function) Features
where
is the number of (Step Function) Featuresin our case,

-
Computation
-
The
featurewhere the dimension of
should be the same as that of -
All features at once
where
are horizontally stacked and are vertically stacked
-
-
-
Layer, Assign the Labels-
Input
Vector of (Step Function) Features, the output of the
layer,where
is the number of (Step Function) Featuresin our case,

-
Output
Vector of Labels
where
is the number of labelsin our case,
as we are just trying to decide "yes" or "no"
And yeah, all this process forms a Neural Net as noted in the graph above
-
Computation
-
The
labelwhere
-
All labels at once
where
are horizontally stacked and are vertically stacked
-
-
-
Summary
-
The
layer -
The
layer -
The Whole Process
-
3. Function Graph
The following is a function graph created based on the hypothesis we developed (see 2.)
We have only one label to predict, so the output is just

Explanation
-
This is a Fully Connected, Feed-Forward neural network
-
Fully Connected:
A fully connected layer means that all inputs to every neuron of the layer is the same
A fully connected network is composed of fully connected layers
-
Feed-Forward:
Well, literally, all arrows are pointing forward
(There are some NNs with arrows not pointing forward, like RNN)
-
-
Circles in the graph means function evaluation. The summation signs
and inside refers to-
layer: -
layer:
Important Notation:
"Linear Part of layer", if using the notations in 2., then -
3.1 "Neuron"
Neurons/Units/Nodes are basic elements of a Neural Network
The following graph is a neuron taken from the

represents the
3.2 "Layer"
Multiple Neurons compose a layer
-
Annotations

-
Simplified Version
with 3 features
, , , each is dependent on 2 parameter ,
-
Multiple (Hidden) Layers
for example, you have
layers in a Neural Network
Then each layer could be represented with
where
4. Activation Function
4.1 Example
-
Identity Function
"Do Nothing"
-
Sigmoid Function
For classification to range
-
Hyperbolic Tangent
For classification to range
-
Rectified Linear Unit
Similar to Identity Function, but zero-out all values less than 0
There's one single point that is non-differentiable, but we won't go to that point most of the time

-
[Special] Softmax Function
A popular choice for multi-class classification, see Special Topics - 1. for more information
4.2 Function Selection
Technically, we cannot use Step Function as Activation Function due to the following concerns
-
Derivatives are either 0 or undefined
Cannot apply (Stochastic) Gradient Descent
-
Cannot take a range of values
Cannot apply Regressions
-
Value is completely binary with no values in between
Cannot apply NLL Loss
The Hypothesis of Neural Network of our example is (see 2.)
You can choose different Activation Function based on what you want to do
-
Regression
If
could take in a range of values, then you can just use Identity Function as -
NLL Loss
Sigmoid Function does return values
, so... -
(Stochastic) Gradient Descent
Both the above functions for
have non-zero derivativesWe also need to make sure
is also differentiable, like...
[DEMO] Choosing different activation functions could generate different hyperplanes, for example
-

-
or
5. How to Train
A general form of Learning (or Optimization) Objective is
[Theorem] if optimization objective is nice and convex, then (Stochastic) Gradient Descent perform well
BUT! Unfortunately, the objective of Neural Network is SUPER NON-CONVEX function of parameters !!!
Except for 1-layer NN...
5.1 Back-Propagation
For Supervised Learning, what we want to do is
Since (Stochastic) Gradient Descent is usually preferred, what we need is
And, yeah, Back-Propagation is a very elegant way to find Gradients!
5.1.1 Some Remainders
-
Chain Rule
Back-Prop is based on Chain Rule
-
Weight Hiding
Bias Parameter
could be casted into to form a general weight, and we can add a " " in the end of the originalIn this way the formula of linear part of the
layer could be simplified toThis is the notation we are using in this section
(See Machine Learning - Basics 1.6.5)
-
Gradient with respect to What???
Always find the Gradient of Loss function with respect to the Weights of the
layerThat tells us how much we want to change the weights, in order to reduce the loss
5.1.2 Computation Analysis
Assume the Neural Network has
-
[How Training Loss depends on Weights of the Last Layer]
Loss Function with the part of the
LayerGradient Computation with respect to
LayerAnalysis
-
First Term
According to 2.,
It's just the derivative
, depending on choice of Loss Function -
Second Term
According to 3.,
We know each
only depends on each , so this partial derivative is a diagonal matrix of size -
Third Term
According to 2.,
If you just look at the equation of
, you could find that this value is just -
Proper Computation Sequence
Obviously the dimensions of the terms of the above equation do not match
The actual computation sequence is the below arrangement
Then we could find this nice and neat general notation
By the above equation, we could see that
since we have already found "
" in the Forward Pass, we just need to find to compute the gradient!!!Yeah!
-
-
[How Training Loss depends on Weights of the First Layer]
Too stupid if we still stick to the partial derivative train above...
We can focus on finding
, with the following equation in the third linewhere
is the initial data point, equivalent notation is kind of " " (see 2.)
5.1.3 General Equation
-
[Back-Propagation to
Layer]Dimensions of terms match properly in this version
-
[Summary]
-
Forward Pass
For layer
Compute all
, , and -
Back-Propagation
For layer
Compute all gradients
and updata the weights

-
5.2 Stochastic GD
Here is the pseudo-code implementation of a Neural Network using Stochastic Gradient Descent

5.2.1 Weight Initialization
-
Random
Different layers of the Neural Network address different things, so do not start everything at the same weights!
Or this will lead you to some Local Optimums rather than the Global Optimum!
-
Not too Extreme
Consider a Sigmoid as activation function, if the weight is too big, the input to Sigmoid will also be very big, which means the gradient will be almost 0!
Consider a ReLU as activation function, if the weight is too small, the output of ReLU will lie on the left hand plane, where the gradient is 0!
-
Worst Case:
Extremely Big
Gradient ExplodesExtremely Near 0
Gradient Vanishes(see 5. for more information)
Generally, keep the initial weights to be random small values to ensure Gradients to be non-zero
5.2.2 and
Some common combinations of Loss Function with their respective Last Layer Activation Function
-
Squared Loss
Identity(need real numbers)
-
NLL Loss
Sigmoid(need two-class probability)
-
NLLM Loss
Softmax(need multi-class probability)
5.3 Mini-Batch GD
"Average" the benefits of Batch GD and Stochastic GD
by selecting
"Mini-Batch" = "Epoch"
but people use the term "Epoch" differently in other cases, so ...
Comparisons on Weight Update Equations with the following conditions
By 5.1.1, weight hiding
Assume dataset of size
Assume mini-batch of size
[ Note ]
"Randomly select
5.4 Regularization
When you train something and do not stop, the training loss would always going down, but the validation loss would go up at some point due to overfitting, and this is what Regularization deals with
5.4.1 Weight Decay
Add a (Ridge Regression) Regularizer / Penalty
(Examples / Theories in Machine Learning - Basics 3.2.4 / 3.3.4)
Then the general Objective Function becomes
Assume using SGD (
Therefore the weight update equation becomes
Recall that the weight update equation without regularizer is
You can see that the difference is that
Here we assume Small Weight is more preferable with better performance to generalize the model
5.4.2 Batch Normalization
This deals with Covariate Shift, which assumes that
-
Distribution of Input
is changed, or "shifted"Ex. For a Cat Classifier
Input Dataset 1 = cats at days
Input Dataset 2 = cats at nights
-
Labeling Function
Stays the sameThe cats in the above 2 datasets are from the same group
Hence the classification results should correspond
What you need to do is just to normalize the input data
5.4.3 Other Practices
-
Dropout
On each forward pass, each node has a probability
to be dropped out (by setting corresponding weight )The purpose is to make each node replaceable
Nodes are just like members of a team with different capacities, you need to make sure that they can partially cover each other's specializations, so that one day when someone was absent due to sickness, the rest of the team could still make some progress.

-
Perturbing Data
Messing with Training Data a bit, with Normal Distribution or other stuff...
Sort of a form of Data Augmentation
Special Topics
1. Softmax Regression
Softmax Regression is a special activation function for Multi-Class Classification, the layer where this function is implemented is called "Softmax Layer" whose output is like the following.
Assume at
layer
4 classes available to be classified
Probability of being classified to Class given
Output of the above Softmax layer
The name "Softmax" is relative to "Hardmax"
"Hardmax" also outputs a
matrix in the example above, but only assigns " " to the predicted class and " "s for the rest "Just like One-Hot encoding"
1.1 Softmax Function
Assume number of classes to be classified is
-
Linear Part of Softmax Layer (input to Softmax function)
-
Softmax Activation Function
It's kind of doing normalization, mapping input elements to probabilities
The probability of the
Class isDimensions:
[Example]
Given
number of classes to be classified
linear part of Softmax Layer
Then calculate some essential data
Now we can find the output with the above
1.2 How to Train
-
Loss Function for the
Data PointAssume having
classes to be classified, with the following example data pointThe Loss Function is defined by
This function seems to have terms of all four classes, but if you look at the label matrix
, you'll notice that only one class is assigned with , and rest assigned with " "s, which means only the term related to predicted class will be leftBased on the above example, we will have
Therefore, you want to make
as big as possible to minimize this loss functiona.k.a. Make a correct prediction!
-
Objective Function for the Training Set
Assume the Training Set has
data pointsThen you can just use gradient descent or other methods to learn the parameter...
2. Multi-Class NLL Loss
2.1 Two-Class Version
This is the Loss Function for Logistic Regression (see Machine Learning - Basics 3.2.3), hence it is actually a 2-Class NLL Loss
Assume
If
2.2 Multi-Class Version
The definition in 2.1 could be extended to multi-class classification directly
Assume the number of classes to be classified is
one-hot encoded matrix with " " representing the labeled class output of Softmax Layer representing the probability distribution over classes
The probability that the Neural Network makes proper prediction is
Definition of Multi-Class Negative-Log-Likelihood Loss Function is
3. Adaptive Step-Size
Step size
For example
should be small when gradient change is small should be large when gradient change is large
or you might miss something important...like the red-little optimum...

3.1 Moving Average
This is the theoretical basis for further discussion of step-size
Define a sequence of
Define a weight average of first
where
Then the weighted average of all
Data input closer to the end
Noted that
, therefore
as
as
[ Note ]
Setting
3.2 Momentum
SGD and MBGD randomly select one or more data points to compute gradient, hence the direction of next step is also random
"Momentum" allows more effective weight update by taking past gradients into account
Momentum = Weighted Average of Current & Past Gradients

To be more specific, in the graph below
Red Arrows are the final choice of update of each step
Blue Dots are the computed "current gradients" Mini-Batch at each step

Definition of Weight Update with Momentum
To arrange the above equations into standard form of Moving Average, define
-
Momentum Update
Define that
, then -
Initial Momentum
-
Weight Update
-
Summary
Now this looks exactly like update weights with step-size of
on a weighted moving average of gradients
3.3 AdaDelta / AdaGrad
Here is an intuitive way to understand this:
"Big Steps when Flat, Small Steps when Steep"
Just like the example provided in the beginning of this section!
Definition of Weight Update with AdaDelta
where
Gradient of the Weight, or the Component of the Gradient-
Weighted Moving Average of the Squared Gradient of the WeightThis is the measure of Curvature (Flat / Steep)
, curvature is high, scale-down step-size in weight update , curvature is low, scale-up step-size in weight update
a very small "Safeguard" in case the step-size goes crazy when
3.4 Adam
Combines the idea of Momentum and AdaDelta, by computing the Moving Averages of both Gradient and Squared Gradient
but it may actually violate the convergence conditions of SGD
Definition of Weight Update with Adam
where
Gradient of the Weight, or the Component of the Gradient-
Weighted Moving Average of the Gradient of the Weightthe "Nominal" Mean
-
Weighted Moving Average of the Squared Gradient of the Weightthe "Nominal" Variance
-
Corrected "Mean" & "Variance"if "Nominal" Mean & Variance use zero initialization (
) without being correctedthen their values would always be too small!
4. CNN
Concolution Neural Network (CNN) is a classic architecture for Image Processing