Machine Learning - Basics


ML_1Basics

1. Introduction: Linear Classifier

1.1 Training Data

Training Data are data points, and usually their labels are given in advance

  • We have i data points each in the form x(i)

  • Each Data Point has a feature vector containing d features in the form xd(i)

  • Each Data Point is associated with label value in the form y(i)

Mathematically, Training Data with n data points (Dn) is defined as the following

Dn={(x(1),y(1)),...,(x(n),y(n))} where {x(i)=[x1(i),x2(i),...,xd(i)]TRdy(i){+1,1}i{1,2,...,n}

You can set Label Value y(i) to whatever you favor, the above binary labeling method is just an arbitrary example.

Assume for each data point, there is only 2 features ( d=2 ) , that is

x(i)=[x1(i)x2(i)]R2

We could visualize the provided training data, where we have

  • 2 axes x1/x2 to represent features

    TIP: x1/x2 actually means x1(i)/x2(i)

  • colored "+/" signs to represent labeled values

2-feature

 

 

 

1.2 Hypothesis

Hypothesis is the resulted model yielded after learning from the training data

Or we could also say it is a way to go from data to labels

For the "Hypothesis" here, we actually mean a way to label NEW data points, it is a function that takes in a couple of features, and then return a label...if we adopt the example above, then its hypothesis h should be

h:Rd{+1,1}

Of course, you can also pass the data points in training data into h to do testings...What I imply here is that h(x(i)) might not give the same value as y(i)!

Hypothesis Class H is the set of all hypotheses: hH

 

 

 

1.3 Basics

1.3.1 Hypothesis Class

An example of hypothesis class for linear classifier is

  • H: All hypotheses that label +1 on one side of a line and 1 on the other side

The following two lines both represent hypotheses with H

  • Data points in green shadow are predicted to be +1

  • Data points in red shadow are predicted to be 1

TIP: The label of the two axes actually means x1(i) and x2(i)

Linear Classifiers

If we test both of them with the labeled data points in the training data, the hypothesis on the left is terribly awful...

 

 

1.3.2 Math :Vector Projection

This is essential for 1.3.3

The Dot Product of two vectors a and b is defined as below, where θ is the angle between a and b

ab=abcos(θ)

The amount of projection of Vector a on Vector b in the graph below is denoted as Scalor l

vec proj

We could induce the formula for vector projection with

l=acos(θ)=acos(θ)bb=abcos(θ)bl=abb

 

 

1.3.3 Math: Hyperplane

The "line", or the Linear Classifier Hyperplane could be represented using one single vector θ in the Feature Space

Hyperplane is always 1-Dimension lower than the the space dimension (leave one dimension to do classification)

  • The hyperplane in the 2-D feature space below is 1-D, which is a line

  • The hyperplane in the 3-D feature space would be 2-D

  • ... ...

LC_math_00

Notation in the graph:

  • θ,x(i)Rd

    Vectors, d=2 in the case above, but the theory applies to higher dimensions

  • bR

    Scalor, the amount of x(i) projected onto θ

The amount of this projection (b) could be calculated using the equation in the graph: dot product of the two vectors, divided by the length of θ. We can also check the dimensions:

b=θTx(i)θR1×dRd×1R=R

Just because b is a PROJECTION, we could see that the above equation gives exactly the SAME result for EVERY x(i) on the red dotted line perpendicular to θ in the graph below

LC_math_01

[ !! WARNING !! ] For the convenience of discussion, we could consider the projection onto the opposite direction of θ, and denote the magnitude of this projection with a

and therefore the signed distance is a

LC_math_02

The Hyperplane represented by the red dotted line could be expressed with the following equation

x(i):θTx(i)θ=aθTx(i)=aθθTx(i)+aθ=0

If we make aθ=θ0, then it would became:

x(i):θTx(i)+θ0=0

 

 

1.3.4 Math: Point-to-Hyperplane Distance

Q: What is the Signed Distance from a Hyperplane to an arbitrary point x in 2-D space?

margin

The purple distance in the graph is what we want, and what we need to do is just to subtract the distance between hyperplane and origin from the amount of projection of x onto θ

Distance=lb=θTxθθ0θ=θTxθ0θ

Tip: If the hyperplane is below the origin, as in the last picture of 1.3.3, then we need to change notation of b to a, but they have the same equation (remember that a is the magnitude and a is the Signed Distance)

b=a=θ0θ
  • When Hyperplane is above origin

    θTx(i)θ=bθTx(i)bθ+θ0=0bθ=θ0b=θ0θ
  • When Hyperplane is below origin

    θTx(i)θ=aθTx(i)+aθθ0=0aθ=θ0a=θ0θ

 

 

1.3.5 Definition of Linear Classifier

With the Hypothesis Class and "Line" Definition, we could derive the full definition of a hypothesis of Linear Classifier

h(x)=h(x(i);θ,θ0)=sgn(θTx(i)+θ0)={+1for θTx(i)+θ0>01for θTx(i)+θ00

where θ,θ0 are parameters

 

 

 

1.4 Loss Evaluation

1.4.1 Loss Function L(g,a)

L determines the loss of data points, could be single, could be plural

g=guess and a=actual

You can choose whatever rule is appropriate for your data points

  • Ex. 0-1 Loss

    Very Classic, Very Common, Very Boring...

    L(g,a)={0if g=a1else
  • Ex. Asymmetric Loss

    The Loss value depends on the case of deviation between g and a

    L(g,a)={1if g=1,a=1100if g=1,a=10else

    For example, if the model is guessing whether the patient is sick, 1 indicates healthy and 1 indicates not

    • If you finds someone sick ( g=1 ) but he's not ( a=1 ), then he might just have to do some useless checkings, not a very seriously consequence, so the loss could just be small as 1

    • But if you find someone healthy ( g=1 ) but he's not ( a=1 ), then...Well, 100 is a very fair loss value

 

 

1.4.2 Training Error

Assume we have n data points in Training Data, then Training Error En(h) is

En(h)=1ni=1nL(h(x(i)),y(i))
  • We prefer hypothesis ha to hb if En(ha)<En(hb)

 

 

1.4.3 Test Error

Assume we have n new data points for testing, then Test Error E(h) is

E(h)=1ni=n+1n+nL(h(x(i)),y(i))

Almost the same as Training Error, the difference is just the index

 

 

 

1.5 Random Linear Classifier

Clumsy, but not so clumsy. Here comes's the pseudo-code

rand-lin-class

Hyperparameter k is the times of iterations, or the number of sets of random parameters (θ,θ0) to generate

  • Cons

    Not very efficient because parameter sets are generated at random, and you might have to test some very crazy hypotheses

  • Pros

    In one round of training, you'll always get a better h returned as k increases

    I won't say that with multiple rounds of training...

    • First Training, k=1

      You have chances (although extremely small) to hit the optimal parameter set in the only one iteration

    • Second Training, k=2

      It's highly possible that you get two sets of insanely awful parameters...and get a huge loss

 

 

 

1.6 Perceptron

1.6.1 Algorithm

This algorithm is more organized with less randomness (I guess), pseudo-code here

perceptron

Hyperparameter τ is the times to iterate through all the n data points in the Training Data Dn

The key component here is the evaluation & update in the nested for-loop, where a is actual label value, g is guess label value

  • Evaluation

    y(i)a(θTx(i)+θ0)g0

    The evaluation would be true when at the initial case, or when actual value does not corresponds to guess value

    Tip: a,g{+1,1}, and θ,θ0 are zero when initialized

    • Initial Case

      y(i)(θTx(i)+θ0)=y(i)([0,...,0]Tx(i)+0)=y(i)0=0
    • Actual Guess

      y(i)(θTx(i)+θ0)=(+1)(1) or (1)(+1)=1

    When all the predictions made on the training data set are correct, that means

    y(i)=θTx(i)+θ0

    And then the evaluation becomes

    y(i)(θTx(i)+θ0)=(+1)(+1) or (1)(1)=1>0
  • Update

    We update the value of θ and θ0 when the guess label value of any data point does not equals to its actual label value

    θneo=θ+y(i)x(i)θ0,neo=θ0+y(i)

    Then for the next data point, the evaluation equation would become

    y(i)(θneoTx(i)+θ0,neo)=y(i)[(θ+y(i)x(i))Tx(i)+(θ0+y(i))]=y(i)(θTx(i)+θ0)+(y(i))2=1(x(i)Tx(i)=x(i)2+1)=y(i)(θTx(i)+θ0)+(x(i)2+1)

    The evaluation formula with updated θ and θ0 is "original evaluation + a strictly positive term"

    x(i)20x(i)2+1>0

    So this update is definitely trying to push the evaluation towards false, moving the parameters θ and θ0 to achieve a more correct classification

 

 

1.6.2 Linear Separability

A set of training data set Dn is Linearly Separable if there exists θ,θ0 such that for i{1,2,...,n} it satifies any of the below

  • y(i)(θTx(i)+θ0)>0

  • h(x(i);θ,θ0)=y(i)

  • En(h)=0

Well, they are actually completely equivalent...

The data set in the picture below is definitely not linearly separable

not linearly separable

 

 

1.6.3 Margin of Training Data

Margin of training data set Dn with respect to the hyperplane defined by θ,θ0 is the minimum of the distances from n data points to the hyperplane

Margin=mini{1,2,...,n}y(i)(θTx(i)+θ0θ)

(Point-to-Hyperplane Distance - see 1.3.4)

This equation would be positive if and only if all data points are correctly predicted

if any data point is incorrectly classified, y(i) and θTx(i)+θ0θ will have different signs, and their product would be negative, therefore guarantees the Margin to be negative (the equation returns the minimum product)

 

 

1.6.4 Perceptron Convergence Theorem

Assume:

  • Hypothesis Class H: classifiers with hyperplanes passing through the origin (θ0=0)

    (way to achieve: see 1.6.5)

  • There exists θ and γ>0 such that y(i)(θTx(i)θ)γ for i{1,2,...,n}

    we have a minimum value for Point-to-Hyperplane distance

  • All data points have a magnitude bounded to some value x(i)R for i{1,2,...,n}

    we have a maximum range for data points

theory

Conclusion:

  • Then the perceptron algorithm would make at most (Rγ)2 updates to θ and θ0, then the Training Error of this Hypothesis would be 0

 

 

1.6.5 Classifier Offset

We could expand the Feature Space to "get rid of" the offset θ0 in the hyperplane definition

  • Hyperplane with Offset

    θRd, x(i)Rd

    x(i):θTx(i)+θ0=0

    Notice that θ0=θ0×1

  • Hyperplane without Offset

    we could move θ0 and 1 into the feature space of θ and x

    θneoRd+1, xneoRd+1

    • θneo=[θ1,θ2,...,θd,θ0]T

    • xneo(i)=[x1(i),x2(i),...,xd(i),1]T

    And it is exactly the same equation as the one with offset (we just hide it)

    θneoTxneo(i)=θTx(i)+θ0

    With the above modification, we have the new definition for Hyperplane as below

    xneo[1:d](i):θneoTxneo(i)=0

 

 

 

1.7 Algorithm Evaluation

1.7.1 Nonlinear Boundary

For data sets that are not linearly separable, we could use polynomial to create nonlinear boundaries to do classification

One way to implement these nonlinear boundaries is to use Taylor Expansion to approximate a smooth function, below is the table of term in the Taylor Polynomial when each data point has d features

taylor poly

Usually, a kth order Taylor Polynomial could create a boundary for data separated into k groups like below

  • k=2 for the donut-shape example at the top

  • k=3 for the weird shape example at the bottom

nonlinbound

As you see, classifiers of training data in 2D-space could be expressed as hyperplanes in 3D-space, this not only applies to nonlinear boundaries, but also linear boundaries

lin vs nonlin

Mathematically speaking, this kind of boundsare just nonlinear in x(i), but still linear in θ, with x(i) in the equation

They are still linear classifiers!

  • Classifier of Linear Boundary - Example on the top

    Hyperplane expressed by

    z=θTx(i)+θ0

    where x(i)=[x1(i),x2(i)]T

  • Classifier of Nonlinear Boundary - Example at the bottom

    Hyperplane expressed by

    z=θTϕ(i)(x)+θ0

    where ϕ(i)(x)=[x1(i),x2(i),x1(i)2,x2(i)2,x1(i)x2(i)]T

 

 

1.7.2 Overfitting

Actually we could apply those super-flexible functions (or nonlinear boundaries) in 1.7.1 to any data set, but there are some extreme cases where you don't want such superiority...

overfit

Training Error would be 0 with the boundary we develop above, but are you sure this is what you want?

This is Overfitting

I'm 99% sure you can't generalize this boundary to new data points in the tests...

 

 

1.7.3 Cross Validation

This is a way to prevent overfitting

  1. Shuffle the training data Dn, then divide it into k groups: Dn,1,Dn,2,...,Dn,k

    sectioned data set

  2. Repeat the following steps k times

    1. Train your hi on all groups of Dn except one group Dn,i

      The orange section of Dn

    2. Test hi on the group left, compute Test Error E(hi,Dn,i)

      The blue section of Dn

     

  3. Compute average Test Error 1ki=1kE(hi,Dn,i)

! [WARNING] ! Do please shuffle the order of your data set, unless you want to train your h on some very biased group of data like the one in the grey box from the graph below

biased data

Trust me, you don't want this...

 

 

 

 

 

2. Feature Encoding

The following is an example data table before encoding

data ex

 

 

 

2.1 Numerical Data

Order & Differences of data values Does Matter

In the example data given above, resting heart rate, age, family income are numerical data

Although age does looks categorical or ordinal, by definition above, it's still numerical

Usually we just leave numerical data as they are, but...

 

 

2.1.1 Level of Detail

Please be careful when pre-processing the data...

When people dealing with "vague" data like age in the table above, they might do something like rounding "40s" to "45"

DON'T DO THAT unless you are sure this rounding is needed and would definitly not negatively impact the result

  • One suggested practice is to just use "4" for "40s"

Gentle Remainder Again: Don't Do Anything Unnecessary

 

 

2.1.2 Data Standardization

Sometimes we need to visualize the data...Do you know what's gonna happen if we plot resting heart rate versus family income in the table above with their linear classifier?

no data std

You could find a linear classifier that works properly, it just looks like hell...

Solution: Standardize the data!

For numerical data, the simplist way is what we did to an arbitrary normal distribution to make it a standard normal distribution. The dth feature in the data could be standardized to zd(i) as below

zd(i)=xd(i)μdσd

where μd and σd are the mean and the standard deviation of all dth features in data set

 

2.2 Categorical Data

Order of data values Does Not Matter

In the example data given above, pain?, job, medicines are categorical data

 

 

2.2.1 Boolean Encoding

A typical way to convert something like "Yes/No", "True/False", for example

pain? {Yes,No}{+1,1}

You can also use {1,0} depending on your need

 

 

2.2.2 One-Hot Encoding

Converting "Single Choice" data. Turn each category into a unique "0/1 Feature", the feature become 1 when we "select" it, otherwise that bit would always be 0

Example: job

We could turn 5 jobs into 5 features: j1,j2,j3,j4,j5 where ji{0,1}

If you want to group nurse and doctor together saying that they have some features in common, then you can consider Factored Encoding (see 2.2.3)

 

 

2.2.3 Factored Encoding

Converting "Multiple Choices" data, rather than "Single Choice" data like the job we discussed above, for example

Example: medicine

It seems like we have 4 different medicine selection, but in fact, it is just combinations of 2 kinds of medicine, and therefore we only have to deal with 2 features

{pain,beta blocker}{m1,m2}

 

 

 

2.3 Ordinal Data

Order of data values Does Matter but Differences Does Not

Converting "Scales" like "Levels of Pain", "Levels of Agreement"

No example on the data table above, but we could provide one like this

ordinal

This kind of feature encoding method is called "Thermometer Code" or "Unary Code"

 

 

 

2.4 Processed Data Example

The encoded and standardized example data table

encoded & std

 

 

 

 

 

3. Regression

3.1 Gradient Descent

3.1.1 Gradient

Gradient tells us the direction to increase "height of function" (at the axis of dependent variable) the most Mathematically, it is a vector of partial derivatives with respect to each of the variables, the Gradient of J with respect to Θ is

Θf(Θ)=[f(Θ)Θ1f(Θ)Θ2f(Θ)Θm]

Where parameter Θ={Θ1,Θ2,...,Θm}Rm

gradient

In the picture above, graph of the function f(Θ) with ΘR2 is on the left hand side; another contour graph with top view for this function is on the right hand side, where the green arrow denotes the direction of gradient at the red dot.

 

 

3.1.2 Convex Function

A Real-Valued Function is Convex if the line segment of any two distinct points on the graph of this function lies completely above or on this graph

Two sets of examples are presented below

convexity

When a function is convex, it is easier to minimize

 

 

3.1.3 (Batch) Gradient Descent

When saying "Gradient Descent", this is the version we talk about

We are actually interested in how to minimize (instead of maximize) the function since talking about Descent; in terms of the contour graph in 3.1.1, the red arrow denotes the desired direction

descent direction

And by going down the contour lines step by step, we could (probabily) reach the minimum of the function, the "bottom of the valley". The picture below shows the simplest process of gradient descent

gradient descent demo

Here comes the pseudo-code

gd pseudo code

  • Parameters Explanation

    • Θinit= Initial "Position"

    • Θ(t)= Parameters for f(Θ) at tth Step

    • f(Θ)= Convex Function

    • Θf(Θ)= Gradient of f(Θ) with respect to Θ as a Function

    • η= Step Size

  • Stop Condition

    You could choose different stop_conditions for the while loop here

    • Maximum Number of Iteration T

      Always a good choice

    • |f(Θ(t))f(Θ(t1))|ϵ

      stop when step difference of function value is small enough

    • |Θ(t)Θ(t1)|ϵ

      stop when you can hardly "improve" the parameter set for your convex function

    • Θf(Θ(t))ϵ

      stop when gradient is small enough

  • Performance Theorem

    Based on the pseudo-code in 3.1.3, if select any ϵ>0, then

    • Assume

      • η is sufficiently small

      • f(Θ) is sufficiently smooth an d convex

      • f(Θ) has at least one global optimum

        I mean minimum here

    • Conclusion

      If the code runs long enough, Gradient Descent will return a value within ϵ of global optimum Θ

 

 

3.1.4 Stochastic Gradient Descent

Sort of a "Drunk" Gradient Descent

If you look at how gradients are updated in (Batch) Gradient Descent 3.2.5 & 3.3.5, you shall see that we sum the gradients of the whole training data set, then take the average

In Stochastic Gradient Descent, we randomly select one data point at a time, and just look at its gradient. This would be noisier, but since we don't have to do summations, iterations become cheaper

GD vs SGD

Here's the pseudo-code for it

sgd pseudo code

About changes in parameters

  • η(t) would be smaller as t gets larger

  • Θfi(Θ(t1)) is the gradient of a single data point

[Note] In practice, SGD uses a small collection of data points, not just one, called minibatch

 

 

 

3.2 Logistic Regression

Common linear classifiers could only divide linearly separable data, but it has no idea on the uncertainty of y(i), or to be more specific, linearly unseparable data, where different labeled values y(i) are mixed up in some regions

linearly unseparable

For example, at what temperature would you wear a coat?

  • x(i)R with only one feature: Temperature

  • y(i)R is the probability of wearing a coat

logistic regression uncertainty

The graph on the left hand side shows that someone will wear a coat when temperature is under a definite value bound, and will not when above. But the real life experience is that there is a certain range of temperature in which you are uncertain about if you need a coat, just like the graph on the right hand side.

  • Definitely need a coat when under 10C

  • Be 70% sure that you need a coat at 13C, but in some very sunny days, you don't

  • Be 30% sure that you do not need a coat at 17C, but in some very rainy days, you do need

  • Definitely do not need a coat when above 20C

The shape of the graph on the right is a Sigmoid/Logistic Function, or Linear Logistic Classifier

 

 

3.2.1 Logistic Function
  • Definition

    σ(z)=11+ez

    logistic regression

  • Shifting/Scaling

    Add a scaling constant θ and shifting constant θ0 to the variable of Logistic Function

    σ(θz+θ0)=11+e(θz+θ0)

    Suppose we replace z with a multi-feature vector x(i)Rd, then we would find something familiar...

    σ(θTx(i)+θ0)=11+e(θTx(i)+θ0)

    If d=2, then the graph of the function above would be the following

    logistic visualize

Tip: Logistic Function σ(z)(0,1) and it is not a classifier

 

 

3.2.2 Prediction Threshold

How to make prediction decision from {+1,1} with Logistic Function?

  • Default Practice

    The output of Logistic Function is Probability, so...

    • +1 if Probability>0.5

    • 1 if Probability0.5

    The value 0.5 above is the Prediction Threshold

  • Simplification

    Probability>0.5σ(θTx(i)+θ0)>0.511+e(θTx(i)+θ0)>0.5e(θTx(i)+θ0)<1θTx(i)+θ0>0

    The last equation should be very familiar to you (see 1.3.5). We could see the default hypothesis for Logistic Function is exactly the same as that for Linear Classifier!

    h(x)=sgn(θTx(i)+θ0)={+1if θTx(i)+θ0>01if θTx(i)+θ00

 

 

3.2.3 NLL Loss

The loss function we define here is negative-log-likelihood (NLL). To be honest this is not a "Loss" Function, but a function that would help us learn a Linear Logistic Classifier

We know the output of logistic function is in (0,1), which is a probability, in our case, it is the probability that a data point is labeled as "+1", or the probability that you need a coat

Assume each data point are completely independent, then we could find the probability that the whole training data is properly labeled by multiplying the probability that each data point is properly labeled

  • The probability that a data point is labeded as "+1" is denoted as g(i)

    g(i)=σ(θTx(i)+θ0)
  • The probability that data point is properly labeled is denoted as P(x(i))

    • if y(i)=+1, then P(x(i))=g(i)

    • if y(i)+1, then P(x(i))=1g(i)

We could find the probability that the whole training data is properly labeled as the equation below

P(data)=i=1nP(x(i))=i=1n{g(i)if y(i)=+11g(i)if y(i)+1=i=1n(g(i))1{y(i)=+1}(1g(i))1{y(i)+1}

Explanation for why we need negative-log-likelihood

  • Q: Why logarithm?

    A: Calculating products would have some effect on the precision of results on the computer, so we would like to convert our formula from product to summation

  • Q: Why negative?

    Our objective is to maximize the probability that the whole training data is properly labeled, but we prefer to minimize by taking the negative sign

The Loss of Data (not NLL Loss) in general is defined as below

L(data)=1nlog(P(data))=1nlog[i=1n(g(i))1{y(i)=+1}(1g(i))1{y(i)+1}]=1ni=1n[1{y(i)=+1}log(g(i))+1{y(i)+1}log(1g(i))]

Tip: Logarithm Properties

  • logb(xy)=logb(x)+logb(y)

  • logb(xp)=plogb(x)

NLL Loss in the equation above is defined by the following (g is guess value and a is actual value)

Lnll(g(i),y(i))=[1{y(i)=+1}log(g(i))+1{y(i)+1}log(1g(i))]

You can use logarithm of any base

Default base value =e

For convenience, if y(i){0,1}, the NLL Loss function could be defined more rigorously as

Lnll(g(i),y(i))=[y(i)log(g(i))+(1y(i))log(1g(i))]

We could reorganize L(data) to the following objective Jlr(Θ) to minimize

Jlr(Θ)=Jlr(θ,θ0)=1ni=1nLnll(σ(θTx(i)+θ0),y(i))

lr= logistic regression

Θ= parameters

(a multi-class version is available in Machine Learning - NN Special Topics 2.)

 

 

3.2.4 Overfitting & Regularizer

First of all, Jlr(Θ)=Jlr(θ,θ0) is Differentiable & Convex

This is a claim, not an assumption

Assume we want to train a linear logistic classifier for the training data below

data example

Run the pseudo-code in 3.1.3 with the NLL Loss function defined in 3.2.3

  • Gradient_Descent (Θinit,η,Jlr(Θ),f(Θ),ϵ)

Then we need to ask: How far should we go?

  • The whole idea of NLL Loss is to increase the probability of the whole data set

  • We can increase this probability by making the graph g(i) steeper (see 3.2.3)

  • So ideally, we will always get the graph at Step 114514 in the graph below in the end

logistic regression

We can plot the objective function and see what's happening (assume θ0=0)

loss function plot

Value of NLL Loss will go down as θ increases all the time

Even after some value of θ in the positive axis, despite that the graph of the function would look like a horizontal line, its value is still decreasing

But being overly certain is pointless, we don't need it.

The solution is to add a Regularizer / Penalty R(θ)=λθ2 with λ0 that penalizes being overly certain, while still keep the objective Jlr differentiable & convex

R(θ) is a quadratic function...a quadratic function is always convex

Jlr(Θ)=Jlr(θ,θ0)=1ni=1nLnll(σ(θTx(i)+θ0),y(i))+λθ2

You could see its effect as below

regularizer

 

 

3.2.5 Gradient Descent for Logistic Regression

We could set up our Logistic Regression Learning Algorithm based on the pseudo-code of gradient descent in 3.1.3

  • Gradients

    we need to update all the parameters (Θ=[θ,θ0]) with respective gradients

    Tips:

    1. Default logarithm base value =e

    2. g(i)=σ(θTx(i)+θ0)

    3. Derivatives

      ddxln(x)=1x

      ddxσ(x)=σ(x)(1σ(x))

    • Gradient of θ

      θJlr(Θ)=1ni=1nLnllθ+θ(λθ2)=1ni=1n[y(i)1g(i)g(i)θ+(1y(i))11g(i)(1g(i))θ]+2λθ=1ni=1n[y(i)g(i)1y(i)1g(i)]g(i)θ+2λθ=1ni=1ny(i)y(i)g(i)g(i)+y(i)g(i)g(i)(1g(i))g(i)(1g(i))θ(θTx(i)+θ0)+2λθ=1ni=1n(y(i)g(i))x(i)+2λθ

      Reorganize the equation, then we have the following gradient equation

      θJlr(Θ)=1nn=1n[σ(θTx(i)+θ0)y(i)]x(i)+2λθ
    • Gradient of θ0

      θ0Jlr(Θ)=1ni=1nLnllθ0+θ0(λθ2)=1ni=1n[y(i)1g(i)g(i)θ0+(1y(i))11g(i)(1g(i))θ0]=1ni=1n[y(i)g(i)1y(i)1g(i)]g(i)θ0=1ni=1ny(i)y(i)g(i)g(i)+y(i)g(i)g(i)(1g(i))g(i)(1g(i))θ0(θTx(i)+θ0)=1ni=1n(y(i)g(i))1

      Reorganize the equation, then we have the following gradient equation

      θ0Jlr(Θ)=1nn=1n[σ(θTx(i)+θ0)y(i)]
  • Summary

    Jlr(θ,θ0)=1ni=1nLnll(σ(θTx(i)+θ0),y(i))+λθ2θJlr(Θ)=1ni=1n[σ(θTx(i)+θ0)y(i)]x(i)+2λθθ0Jlr(Θ)=1ni=1n[σ(θTx(i)+θ0)y(i)]
  • Algorithm

    Here's the pseudo-code of gradient descent for logistic regression

    gradient descent for logistic regression

 

 

 

3.3 Linear Regression

3.3.1 Definition
  • Hypothesis

    Hypothesis of Linear Regression is a Hyperplane...but we are just trying to fit the data, without making any conclusions from this hyperplane like classification we did previously

    h(x(i);θ,θ0)=θTx(i)+θ0
  • Training Loss

    The loss function we use for linear regression is Squared Loss

    L(g,a)=(ga)2

    While the optimization (minimization) objective uses Mean-Squared Loss

    J(θ,θ0)=1ni=1nL(h(x(i);θ,θ0),y(i))=1ni=1n(θTx(i)+θ0y(i))2

    Notice that using the same trick in 1.6.5, offset θ0 could be merged into θTx(i) (with θ,x(i)RdRd+1)

    Then We could expand and stack features x(i) and results y(i) as matrices X~Rn×(d+1) and Y~Rn

    X~=[x1(1)xd(1)1x1(n)xd(n)1]Y~=[y(1)y(n)]

    With the above, rewrite the optimization objective in matrix form as below

    J(θ,θ0)=1n(X~θY~Rn)T(X~θY~)=1nX~θY~2

 

 

3.3.2 Direct Solution

A function could be uniquely minimized at a point, if Gradient at that point is 0 and Function "curves up"

The Gradient of optimization objective in 3.3.1 is the following matrix derivative with θJ(θ,θ0)Rd+1

θJ(θ,θ0)=2nX~T(X~θY~)

By setting θJ(θ,θ0)=0, we could (probabily) find desired unique minimizer θ

2nX~T(X~θY~)=0X~TX~θX~TY~=0X~TX~θ=X~TY~(X~TX~)1X~TX~θ=(X~TX~)1X~TY~θ=(X~TX~)1X~TY~

[IMPORTANT]

(X~TX~) is a matrix of 2nd derivatives. If the Function does not "curves up", (X~TX~) will not be invertible!

 

 

3.3.3 Demo & Possible Errors

This is the regular situation

regular situation

The following are possible errors you might encounter in real-life practice

  • No Unique Best Hyperplane

    Very simple, the shape of your data happens to be awful...

    awful data shape

    • Noisy Data

      Maybe you could find a solution, but it's just because of the noise of data like the uncertainty of thermometer...

      It looks like you have found the solution, but no, you shouldn't accept it!

      noisy data

  • Feature Encoding Problems

    • Correlated Real-life features

      Ex. Doctors and nurses might have the same probability of exposure to some diseases, but they are categorized with one-hot encoding...

    • Too Many Features

      If the training data has only one data point, with only one feature, then you could have infinite hyperplanes as solutions, this is very very very bad!

      so many features

      Generally speaking, your training data should have more data points than features to allow for a unique hyperplane as solution...but sometimes feature dimensions is too high comparing to data point numbers...

  • Strategy

    When we are unsure which hyperplane to choose, just choose the θ near 0 unless there's a strong reason not to

    The way to implement this strategy is Ridge Regression (see 3.3.4)

 

 

3.3.4 Ridge Regression

This is for the problem that sometimes there isn't a best Hyperplane as our solution in 3.3.3 (and also overfitting)

Under these situations, (X~TX~) is usually not invertible

The solution is to add a Regularizer / Penalty R(θ)=λθ2 with λ>0 to the optimization objective in 3.3.1, this is the same thing as what we did for logistic regression (*see 3.2.4 for *)

Jridge(θ,θ0)=1ni=1n(θTx(i)+θ0y(i))2+λθ2=1nX~θY~2+λθ2

The Gradient of the optimization objective of Ridge Regression θJridge(θ,θ0) is

θJridge(θ,θ0)=2nX~T(X~θY~)+2λθ

Minimizing Jridge(θ,θ0) by setting θJridge(θ,θ0)=0 leads to the following

2nX~T(X~θY~)+2λθ=01nX~TX~θ1nX~TY~+λθ=01nX~TX~θ+λθ=1nX~TY~X~TX~θ+nλθ=X~TY~(X~TX~+nλI)θ=X~TY~θ=(X~TX~+nλI)1X~TY~

where I is an identity matrix with IR(d+1)×(d+1)

This is called Ridge Regression because we are adding a "Ridge" of λ values along the diagonal of the matrix before inverting it

[IMPORTANT]

  • (X~TX~+nλI) is a matrix of 2nd derivatives. This is ALWAYS invertible with λ>0

  • When using Ridge Regression, we assume features are standardized as in 2.1.2

 

 

3.3.5 Gradient Descent Solution

You may ask: "Why do we still need gradient descent when we could derive θ directly?" (see 3.3.2/4)

Here's why: Matrix Inversion has a time complexity of O(n2)O(n3)...

This Gradient Descent Solution is based on Ridge Regression

  • Gradient of θ

    θJridge(θ,θ0)=2nX~T(X~θY~)+2λθ=2ni=1n(θTx(i)+θ0y(i))x(i)+2λθ
  • Gradient of θ0

    θ0Jridge(θ,θ0)=2ni=1n(θTx(i)+θ0y(i))

Here's the pseudo-code for Gradient Descent for Linear Regression (using Ridge Regression)

gradient descent for linear regression using ridge regression