Machine Learning - NN


ML_2NN

1. Step Function as Features

A linear classifier in 2D-space separates a group of training data with features x(i)=[x1(i),x2(i)]T and labels y(i){0,1} into to two. If we plot the separated training data into 3D-space, what we would get is a step function in 3D-space

linear classifier

Step function could be used as features, its difference from the previous case is that this time the feature itself also has parameters, consider the two step functions below

ϕi(x) is evaluated as 1 if what inside the curly braces is true

step function

where

  • ϕi(x) is a step function of x=[x1,x2]TR2

  • y=wiTx+w0,i is the hyperplane of a linear classifier

Construct a feature vector of data point out of the above two step functions

ϕ(x)=[ϕ1(x)ϕ2(x)]R2

Then we implement linear classification in 3D-space with this feature vector

z=θTϕ(x)+θ0=[θ1θ2]T[ϕ1(x)ϕ2(x)]+θ0=θ1ϕ1(x)+θ2ϕ2(x)+θ0

We could assign a set of weight values and visualize this this classification hypothesis

z=θ1ϕ1(x)+θ2ϕ2(x)+θ0=1ϕ1(x)+1ϕ2(x)+(0.5)

step linear combination

If we add a third feature like this

ϕ(x)=[ϕ1(x)ϕ2(x)ϕ3(x)]R2

3 features

We could see that there is some overlaps in the visualization of this hypothesis, but since we have determined the threshold value is 0, overlaps would not affect the classification results

While theoretically we could be more "confident" to classify training data lying in those regions

overlapped hypothesis

[TIP] To better understand the meaning of this process, think about it in some context:

We want to know which planet is suitable for human immigration, and the most descisive elements are

  • x1: precipitation

  • x2: temperature

Then we can interpret the three features given

  • ϕ1(x): we don't want to immigrate if there's too much rain

  • ϕ2(x): we don't want to immigrate if it's too hot

  • ϕ3(x): we don't want to immigrate if it's too cold

So finally, in the top view graph on the right hand side, only planets lie in the green zone is inhabitable

 

 

 

2. Hypothesis

The process in 1. describes a Neural Network, that has the following two layers

  1. Construct the Features

  2. Assign the Labels

Superscript on a letter (in the form L(i)) would be used to represent indices of layer

i=1 means we are at layer 1

  • 1st Layer, Construct the Features

    • Input

      One Single Data Point of Features

      where m(1) is the number of Features of a data point

      in our case, m(1)=2

      x=[x1xm(1)]Rm(1)

      1st layer input

    • Output

      Vector of (Step Function) Features

      where n(1) is the number of (Step Function) Features

      in our case, n(1)=3

      A(1)=[A1(1)An(1)(1)]Rn(1)

      1st layer output

    • Computation

      • The ith feature

        Ai(1)=f(1)(wi(1)Tx+w0,i(1))

        where the dimension of wi(1) should be the same as that of x

        wi(1)=[wi[1](1)wi[m(1)](1)]Rm(1)w0,i(1)R
      • All features at once

        A(1)=f(1)(W(1)Tx+W0(1))

        where wi(1) are horizontally stacked and w0,i(1) are vertically stacked

        W(1)=[w1(1)wn(1)(1)]Rm(1)×n(1)W0(1)=[w1(1)wn(1)(1)]Rn(1)
  • 2nd Layer, Assign the Labels

    • Input

      Vector of (Step Function) Features, the output of the 1st layer,

      where m(2)=n(1) is the number of (Step Function) Features

      in our case, m(2)=n(1)=3

      A(1)=[A1(1)An(1)(1)]Rm(2)=Rn(1)

      2nd layer input

    • Output

      Vector of Labels

      where n(2) is the number of labels

      in our case, n(2)=1 as we are just trying to decide "yes" or "no"

      A(2)=[A1(2)An(2)(2)]Rn(2)

      2nd layer output

      And yeah, all this process forms a Neural Net as noted in the graph above

    • Computation

      • The ith label

        Ai(2)=f(2)(wi(2)TA(1)+w0,i(2))

        where

        wi(2)=[wi[1](2)wi[m(2)](2)]Rm(2)=Rn(1)w0,i(2)R
      • All labels at once

        A(2)=f(2)(W(2)TA(1)+W0(2))

        where wi(2) are horizontally stacked and w0,i(2) are vertically stacked

        W(2)=[w1(2)wn(2)(2)]Rm(2)×n(2)=Rn(1)×n(2)W0(2)=[w1(2)wn(2)(2)]Rn(2)
  • Summary

    • The 1st layer

      A(1)=f(1)(W(1)Tx+W0(1))

    • The 2nd layer

      A(2)=f(2)(W(2)TA(1)+W0(2))

    • The Whole Process

      A(2)=NN(x;W,W0)

 

 

 

3. Function Graph

The following is a function graph created based on the hypothesis we developed (see 2.)

We have only one label to predict, so the output is just A1(2)

function graph

Explanation

  • This is a Fully Connected, Feed-Forward neural network

    • Fully Connected:

      A fully connected layer means that all inputs to every neuron of the layer is the same

      A fully connected network is composed of fully connected layers

    • Feed-Forward:

      Well, literally, all arrows are pointing forward

      (There are some NNs with arrows not pointing forward, like RNN)

  • Circles in the graph means function evaluation. The summation signs and Zi(n) inside refers to

    • 1st layer:

      Zi(1)=wi(1)Tx+w0,i(1)
    • 2nd layer:

      Zi(2)=wi(2)TA(1)+w0,i(2)

    Important Notation:

    Z(l) "Linear Part of lth layer", if using the notations in 2., then Z(l)Rn(l)

 

 

3.1 "Neuron"

Neurons/Units/Nodes are basic elements of a Neural Network

The following graph is a neuron taken from the 1st layer

neuron

represents the ith feature with the equation

Ai(1)=f(1)(Zi(1))=f(1)(wi(1)Tx+w0,i(1))

 

 

3.2 "Layer"

Multiple Neurons compose a layer

  • Annotations

    annotated neural network

  • Simplified Version

    with 3 features ϕ1(x), ϕ2(x), ϕ3(x), each is dependent on 2 parameter x1, x2

    simplified

  • Multiple (Hidden) Layers

    for example, you have L layers in a Neural Network

    multiple layers

    Then each layer could be represented with

    A(l)=f(l)(Z(l))=f(l)(W(l)TA(l1)+W0(l))

    where l{1,...,L}

 

 

 

4. Activation Function

4.1 Example

  • Identity Function

    "Do Nothing"

f(z)=z
  • Sigmoid Function

    For classification to range [0,1]

    f(z)=σ(z)=11+ez
  • Hyperbolic Tangent

    For classification to range [1,1]

    f(z)=tanh(z)=ezezez+ez
  • Rectified Linear Unit

    Similar to Identity Function, but zero-out all values less than 0

    There's one single point that is non-differentiable, but we won't go to that point most of the time

    f(z)=ReLU(z)={0for z<0zotherwise

 

activation func

  • [Special] Softmax Function

    A popular choice for multi-class classification, see Special Topics - 1. for more information

 

 

4.2 Function Selection

Technically, we cannot use Step Function as Activation Function due to the following concerns

  • Derivatives are either 0 or undefined

    Cannot apply (Stochastic) Gradient Descent

  • Cannot take a range of values

    Cannot apply Regressions

  • Value is completely binary with no values in between

    Cannot apply NLL Loss

The Hypothesis of Neural Network of our example is (see 2.)

h(x;W,W0)=NN(x;W,W0)with {1st layer: A(1)=f(1)(W(1)Tx+W0(1))2st layer: A(2)=f(2)(W(2)TA(1)+W0(2))

You can choose different Activation Function based on what you want to do

  • Regression

    If f(1) could take in a range of values, then you can just use Identity Function as f(2)

    f(2)(z)=z
  • NLL Loss

    Sigmoid Function does return values [0,1], so...

    f(2)(z)=σ(z)
  • (Stochastic) Gradient Descent

    Both the above functions for f(2) have non-zero derivatives

    f1(2)(z)=zf2(2)(z)=σ(z)

    We also need to make sure f(1) is also differentiable, like...

    f2(1)(z)=σ(z)f2(1)(z)=tanh(z)f2(1)(z)=ReLU(z)

[DEMO] Choosing different activation functions could generate different hyperplanes, for example

  • f(1)(z)=σ(z)

    f1

  • f(2)(z)=z or f(2)(z)=σ(z)

    f2

 

 

 

5. How to Train

A general form of Learning (or Optimization) Objective is

J(W,W0)=1ni=1nL(NN(x(i);W,W0),y(i))

[Theorem] if optimization objective is nice and convex, then (Stochastic) Gradient Descent perform well

BUT! Unfortunately, the objective of Neural Network is SUPER NON-CONVEX function of parameters !!!

Except for 1-layer NN...

 

 

5.1 Back-Propagation

For Supervised Learning, what we want to do is

minW,W0J(W,W0)

Since (Stochastic) Gradient Descent is usually preferred, what we need is

W,WoL(NN(x(i);W,W0),y(i))

And, yeah, Back-Propagation is a very elegant way to find Gradients!

 

5.1.1 Some Remainders
  • Chain Rule

    Back-Prop is based on Chain Rule

    dadc=dadbdbdc
  • Weight Hiding

    Bias Parameter W0 could be casted into W to form a general weight, and we can add a "1" in the end of the original A(l)

    In this way the formula of linear part of the lth layer could be simplified to

    Z(l)=W(l)TA(l)

    This is the notation we are using in this section

    (See Machine Learning - Basics 1.6.5)

  • Gradient with respect to What???

    Always find the Gradient of Loss function with respect to the Weights of the lth layer W(l)

    That tells us how much we want to change the weights, in order to reduce the loss

 

5.1.2 Computation Analysis

Assume the Neural Network has L layers

  • [How Training Loss depends on Weights of the Last Layer]

    Loss Function with the part of the Lth Layer

    Loss=L(A(L),y(i))A(L)=f(L)(Z(L))Z(L)=W(L)TA(L1)

    Gradient Computation with respect to Lth Layer

    LossW(L)=LossA(L)n(L)×1A(L)Z(L)n(L)×n(L)Z(L)W(L)m(L)×1

    Analysis

    • First Term LossA(L)

      According to 2., A(L)Rn(L)

      It's just the derivative Loss, depending on choice of Loss Function

    • Second Term A(L)Z(L)

      According to 3., Z(L)Rn(L)

      We know each Ai(L) only depends on each Zi(L), so this partial derivative is a diagonal matrix of size n(L)×n(L)

      A(L)Z(L)=[A1(L)Z1(L)000A2(L)Z2(L)000An(L)(L)Zn(L)(L)]
    • Third Term Z(L)W(L)

      According to 2., A(L1)Rm(L)

      If you just look at the equation of Z(L), you could find that this value is just

      Z(L)W(L)=A(L1)
    • Proper Computation Sequence

      Obviously the dimensions of the terms of the above equation do not match

      The actual computation sequence is the below arrangement

      LossW(L)m(L)×n(L)=Z(L)W(L)m(L)×1(A(L)Z(L)n(L)×n(L)LossA(L)n(L)×1)T1×n(L)

      Then we could find this nice and neat general notation

      LossW(L)m(L)×n(L)=A(L1)m(L)×1(LossZ(L))T1×n(L)

      By the above equation, we could see that

      since we have already found "A(l1)" in the Forward Pass, we just need to find LossZ(l) to compute the gradient!!!

      Yeah!

  • [How Training Loss depends on Weights of the First Layer]

    Too stupid if we still stick to the partial derivative train above...

    We can focus on finding LossZ(1), with the following equation in the third line

    LossW(1)=LossA(L)A(L)Z(L)Z(L)A(L1)A(L1)Z(L1)A(2)Z(2)Z(2)A(1)A(1)Z(1)Z(1)W(1)=(LossA(L)A(L)Z(L)W(L)A(L1)Z(L1)A(2)Z(2)W(2)A(1)Z(1))x=LossZ(1)x

    where x is the initial data point, equivalent notation is kind of "A(0)" (see 2.)

 

5.1.3 General Equation
  • [Back-Propagation to lth Layer]

    Dimensions of terms match properly in this version

    LossZ(l)=A(l)Z(l)W(l+1)A(l+1)Z(l+1)W(L1)A(L1)Z(L1)W(L)A(L)Z(L)LossA(L)LossW(l)=A(l1)(LossZ(l))T
  • [Summary]

    • Forward Pass

      For layer l=0L

      Compute all A(l), Z(l), and Loss

    • Back-Propagation

      For layer l=L0

      Compute all gradients LossW(l) and updata the weights

    back-prop

 

 

5.2 Stochastic GD

Here is the pseudo-code implementation of a Neural Network using Stochastic Gradient Descent

SGD NN

 

5.2.1 Weight Initialization
  • Random

    Different layers of the Neural Network address different things, so do not start everything at the same weights!

    Or this will lead you to some Local Optimums rather than the Global Optimum!

  • Not too Extreme

    1. Consider a Sigmoid as activation function, if the weight is too big, the input to Sigmoid will also be very big, which means the gradient will be almost 0!

    2. Consider a ReLU as activation function, if the weight is too small, the output of ReLU will lie on the left hand plane, where the gradient is 0!

    3. Worst Case:

      Extremely Big Gradient Explodes

      Extremely Near 0 Gradient Vanishes

      (see 5. for more information)

  • Generally, keep the initial weights to be random small values to ensure Gradients to be non-zero

 

5.2.2 Loss and f(L)

Some common combinations of Loss Function with their respective Last Layer Activation Function

  • Squared Loss Identity

    (need real numbers)

  • NLL Loss Sigmoid

    (need two-class probability)

  • NLLM Loss Softmax

    (need multi-class probability)

 

 

5.3 Mini-Batch GD

"Average" the benefits of Batch GD and Stochastic GD

by selecting k data points at random from dataset to compute gradients

"Mini-Batch" = "Epoch"

but people use the term "Epoch" differently in other cases, so ...

Comparisons on Weight Update Equations with the following conditions

  • By 5.1.1, weight hiding W^=W,W0

  • Assume dataset of size n

  • Assume mini-batch of size k

Batch Gradient Descent:W^=W^ηi=1nWL(NN(x(i);W^),y(i))Mini-Batch Gradient Descent:W^=W^ηi=1kWL(NN(x(i);W^),y(i))Stochastic Gradient Descent:W^=W^ηWL(NN(x(i);W^),y(i))

[ Note ]

"Randomly select k data points" is hard, while "Randomly shuffle dataset & iterate" is easy

 

 

5.4 Regularization

When you train something and do not stop, the training loss would always going down, but the validation loss would go up at some point due to overfitting, and this is what Regularization deals with

 

5.4.1 Weight Decay

Add a (Ridge Regression) Regularizer / Penalty R(W)=λW2 with λ0 after the Objective Function

(Examples / Theories in Machine Learning - Basics 3.2.4 / 3.3.4)

Then the general Objective Function becomes

J(W)=1ni=1nL(NN(x(i);W),y(i))+λW2

Assume using SGD (n=1), then the gradient of Objective Function is

WJ(W)=W{1ni=1nL(NN(x(i);W),y(i))+λW2}=W{L(NN(x(i);W),y(i))+λW2}=WL(NN(x(i);W),y(i))+2λW

Therefore the weight update equation becomes

Wt=Wt1ηWJ(W)=Wt1η{WL(NN(x(i);Wt1),y(i))+2λWt1}=Wt12ηλWt1ηWL(NN(x(i);Wt1),y(i))Wt=(12ηλ)Wt1ηWL(NN(x(i);Wt1),y(i))

Recall that the weight update equation without regularizer is

Wt=Wt1ηWL(NN(x(i);Wt1),y(i))

You can see that the difference is that Wt1 now has a Decay Coefficient (12ηλ) to make sure the weight is not too big

Here we assume Small Weight is more preferable with better performance to generalize the model

 

5.4.2 Batch Normalization

This deals with Covariate Shift, which assumes that

  • Distribution of Input P(x) is changed, or "shifted"

    Ex. For a Cat Classifier

    • Input Dataset 1 = cats at days

    • Input Dataset 2 = cats at nights

  • Labeling Function F(x,y) Stays the same

    The cats in the above 2 datasets are from the same group

    Hence the classification results should correspond

What you need to do is just to normalize the input data X(i) for the jth Mini-Batch at the ith layer of Neural Network

X^j(i)=Xj(i)μMBσMB
5.4.3 Other Practices
  • Dropout

    On each forward pass, each node has a probability p to be dropped out (by setting corresponding weight W=0)

    The purpose is to make each node replaceable

    Nodes are just like members of a team with different capacities, you need to make sure that they can partially cover each other's specializations, so that one day when someone was absent due to sickness, the rest of the team could still make some progress.

    dropout

  • Perturbing Data

    Messing with Training Data a bit, with Normal Distribution or other stuff...

    x^(i)=x(i)+Nwhere NGaussian(μ=0,σ2small)

    Sort of a form of Data Augmentation

 

 

 

 

 

Special Topics

1. Softmax Regression

Softmax Regression is a special activation function for Multi-Class Classification, the layer where this function is implemented is called "Softmax Layer" whose output is like the following.

Assume at lth layer

  • C=4 4 classes available to be classified

  • p(Class n|A(l1)) Probability of being classified to Class n given A(l1)

A(l)=f(l)(W(l)TA(l1)+W0(l))=[p(Class 1|A(l1))p(Class 2|A(l1))p(Class 3|A(l1))p(Class 4|A(l1))]

Output of the above Softmax layer A(l) is a 4×1 matrix of whose elements are probabilities with a sum of 1

1=n=i4p(Class i|A(l1))

The name "Softmax" is relative to "Hardmax"

"Hardmax" also outputs a 4×1 matrix in the example above, but only assigns "1" to the predicted class and "0"s for the rest

Hardmax(Z(l))=[0100]

"Just like One-Hot encoding"

 

 

1.1 Softmax Function

Assume number of classes to be classified is C

  • Linear Part of Softmax Layer (input to Softmax function)

    Z(l)=W(l)TA(l1)+W0(l)
  • Softmax Activation Function

    It's kind of doing normalization, mapping input elements to probabilities

    A(l)=Softmax(Z(l))=eZ(l)i=1CeZi(l)

    The probability of the ith Class is

    Ai(l)=eZi(l)i=1CeZi(l)R

    Dimensions:

    • A(l),eZ(l)RC

    • Ai(l),eZi(l)R

[Example]

Given

  • number of classes to be classified C=4

  • linear part of Softmax Layer Z(l)=[5,2,1,3]T

Then calculate some essential data

eZ(l)=[e5e2e1e3]=[148.47.40.420.1]i=14eZi(l)=176.3

Now we can find the output with the above

A(l)=[148.4/176.37.4/176.30.4/176.320.1/176.3]=[0.84284.2%0.0424.2%0.0020.2%0.11411.4%]

 

 

1.2 How to Train

  • Loss Function for the ith Data Point

    Assume having 4 classes to be classified, with the following example data point

    Label: y(i)=[0100]Result: y^(i)=A(l)(i)=[0.30.20.10.4]

    The Loss Function is defined by

    L(y(i),y(i)^)=j=14[yj(i)log(y^j(i))]

    This function seems to have terms of all four classes, but if you look at the label matrix y(i), you'll notice that only one class is assigned with 1, and rest assigned with "0"s, which means only the term related to predicted class will be left

    Based on the above example, we will have

    j=14[yj(i)log(y^j(i))]=y2(i)log(y^2(i))=1log(y^2(i))=log(y^2(i))

    Therefore, you want to make y^(2) as big as possible to minimize this loss function

    a.k.a. Make a correct prediction!

  • Objective Function for the Training Set

    Assume the Training Set has m data points

    J(W(l),W0(l))=1mi=1mL(y(i),y(i)^)

    Then you can just use gradient descent or other methods to learn the parameter...

 

 

 

2. Multi-Class NLL Loss

2.1 Two-Class Version

This is the Loss Function for Logistic Regression (see Machine Learning - Basics 3.2.3), hence it is actually a 2-Class NLL Loss

Assume y(i) is the labeled value for the ith data point, g(i) is the guessed value for the ith data point.

If y(i){0,1}, then the NLL Loss function could be defined as

Lnll(g(i),y(i))=[y(i)log(g(i))+(1y(i))log(1g(i))]

 

 

2.2 Multi-Class Version

The definition in 2.1 could be extended to multi-class classification directly

Assume the number of classes to be classified is C, using Softmax as output layer activation function, then

Label: y(i)=[y1y2yC]RCGuess: g(i)=[g1g2gC]RC
  • y(i) one-hot encoded matrix with "1" representing the labeled class

  • g(i) output of Softmax Layer representing the probability distribution over C classes

The probability that the Neural Network makes proper prediction is

j=1Cgj(i)yj(i)

Definition of Multi-Class Negative-Log-Likelihood Loss Function is

Lnllm(g(i),y(i))=j=1Cyj(i)log(gj(i))

 

 

 

3. Adaptive Step-Size

Step size η should be independent for each weight, based on the local view of gradient update

For example

  • η should be small when gradient change is small

  • η should be large when gradient change is large

or you might miss something important...like the red-little optimum...

extreme case

 

 

3.1 Moving Average

This is the theoretical basis for further discussion of step-size

Define a sequence of T data points at=a1,a2,...,aT

Define a weight average of first t data points At=A0,A1,A2,...,AT with the following formulations

A0=0At=γtAt1+(1γt)at

where γt(0,1) is the weight, assume constant γt=γ

Then the weighted average of all T data points is

AT=γAT1+(1γ)aT=γ(γAT2+(1γ)aT1)+(1γ)aT=γ2AT2+γ1(1γ)aT1+γ0(1γ)aTAT=t=0TγTt(1γ)at

Data input closer to the end T have more impact on weighted average AT

Noted that γ(0,1), therefore

  • γa0 as a

  • γa1 as a0

[ Note ]

Setting γt=t1t, then the equation gives the actual average

 

 

3.2 Momentum

SGD and MBGD randomly select one or more data points to compute gradient, hence the direction of next step is also random

"Momentum" allows more effective weight update by taking past gradients into account

Momentum = Weighted Average of Current & Past Gradients

momentum

To be more specific, in the graph below

  • Red Arrows are the final choice of update of each step

  • Blue Dots are the computed "current gradients" Mini-Batch at each step

momentum detail

Definition of Weight Update with Momentum

V0=0Vt=γVt1+ηWJ(Wt1)Wt=Wt1Vt

To arrange the above equations into standard form of Moving Average, define η=η(1γ)

  • Momentum Update

    Vt=γVt1+ηWJ(Wt1)Vt=γVt1+η(1γ)WJ(Wt1)Vtη=γVt1η+(1γ)WJ(Wt1)

    Define that Mt=Vt/η, then

    Mt=γMt1+(1γ)WJ(Wt1)
  • Initial Momentum

    M0=V0η=0η=0
  • Weight Update

    Wt=Wt1ηVtη=Wt1ηMt
  • Summary

    Now this looks exactly like update weights with step-size of η on a weighted moving average of gradients

    M0=0Mt=γMt1+(1γ)WJ(Wt1)Wt=Wt1ηMt

 

 

3.3 AdaDelta / AdaGrad

Here is an intuitive way to understand this:

"Big Steps when Flat, Small Steps when Steep"

Just like the example provided in the beginning of this section!

Definition of Weight Update with AdaDelta

gt,j=WJ(Wt1)j=J(Wt1)WjGt,j=γGt1,j+(1γ)gt,j2Wt,j=Wt1,jηGt,j+ϵscaled step-sizegt,j

where

  • gt,j Gradient of the jth Weight, or the jth Component of the Gradient

  • Gt,j Weighted Moving Average of the Squared Gradient of the jth Weight

    This is the measure of Curvature (Flat / Steep)

    • gt,jGt,j, curvature is high, scale-down step-size in weight update

    • gt,jGt,j, curvature is low, scale-up step-size in weight update

  • ϵ a very small "Safeguard" in case the step-size goes crazy when Gt,j=0

 

 

3.4 Adam

Combines the idea of Momentum and AdaDelta, by computing the Moving Averages of both Gradient and Squared Gradient

but it may actually violate the convergence conditions of SGD

Definition of Weight Update with Adam

gt,j=WJ(Wt1)jmt,j=B1mt1,j+(1B1)gt,jvt,j=B2vt1,j+(1B2)gt,j2m^t,j=mt,j1B1tv^t,j=vt,j1B2tWt,j=Wt1,jηv^t,j+ϵm^t,j

where

  • gt,j Gradient of the jth Weight, or the jth Component of the Gradient

  • mt,j Weighted Moving Average of the Gradient of the jth Weight

    the "Nominal" Mean

  • vt,j Weighted Moving Average of the Squared Gradient of the jth Weight

    the "Nominal" Variance

  • m^t,j,v^t,j Corrected "Mean" & "Variance"

    if "Nominal" Mean & Variance use zero initialization (m0=0,v0=0) without being corrected

    then their values would always be too small!

 

 

 

4. CNN

Concolution Neural Network (CNN) is a classic architecture for Image Processing

see this cheatsheet