Machine Learning - Basics

1. Introduction: Linear Classifier
2. Feature Encoding
3. Regression

ML_1Basics

1. Introduction: Linear Classifier

1.1 Training Data

Training Data are data points, and usually their labels are given in advance

$i$ data points $x^{(i)}$
$d$ features $x^{(i)}_{d}$
label value $y^{(i)}$

Mathematically, $n$ $D_n$ ) is defined as the following

\begin{aligned} D_{n} = {(x^{(1)}, y^{(1)}), . . ., (x^{(n)}, y^{(n)})} \\ where \underset{i \in {1, 2, . . ., n}}{\underset{⏟}{{\begin{cases} x^{(i)} = [x_{1}^{(i)}, x_{2}^{(i)}, . . ., x_{d}^{(i)}]^{T} \in R^{d} \\ y^{(i)} \in {+ 1, - 1} \end{cases}}} \end{aligned}

$y^{(i)}$ to whatever you favor, the above binary labeling method is just an arbitrary example.

$d=2$ ) , that is

\begin{matrix} x^{(i)} = [\begin{matrix} x_{1}^{(i)} \\ x_{2}^{(i)} \end{matrix}] \in R^{2} \end{matrix}

We could visualize the provided training data, where we have

$x_1/x_2$ to represent features

TIP: $x_1/x_2$ $x_1^{(i)}/x_2^{(i)}$
$+/-$ " signs to represent labeled values

2-feature

1.2 Hypothesis

Hypothesis is the resulted model yielded after learning from the training data

Or we could also say it is a way to go from data to labels

For the "Hypothesis" here, we actually mean a way to label NEW data pointsfunction $h$ should be

h : R^{d} \to {+ 1, - 1}

$h$ to do testings... $h(x^{(i)})$ $y^{(i)}$ !

$H$ $h \in H$

1.3 Basics

1.3.1 Hypothesis Class

An example of hypothesis class for linear classifier is

$H$ $+1$ $-1$ on the other side

$H$

$+1$
$-1$

$x_1^{(i)}$ $x_2^{(i)}$

Linear Classifiers

If we test both of them with the labeled data points in the training data, the hypothesis on the left is terribly awful...

1.3.2 Math :Vector Projection

This is essential for 1.3.3

Dot Product $a$ $b$ $\theta$ $a$ $b$

a \cdot b = ‖ a ‖ ‖ b ‖ \cos (θ)

Vector $a$ Vector $b$ Scalor $l$

vec proj

We could induce the formula for vector projection with

\begin{aligned} l & = ‖ a ‖ \cos (θ) \\ = ‖ a ‖ \cos (θ) \cdot \frac{‖ b ‖}{‖ b ‖} \\ = \frac{‖ a ‖ ‖ b ‖ \cos (θ)}{‖ b ‖} \\ l & = \frac{a \cdot b}{‖ b ‖} \end{aligned}

1.3.3 Math: Hyperplane

Linear Classifier Hyperplane $\theta$ in the Feature Space

Hyperplane is always 1-Dimension lower than the the space dimension (leave one dimension to do classification)

The hyperplane in the 2-D feature space below is 1-D, which is a line

The hyperplane in the 3-D feature space would be 2-D

... ...

LC_math_00

Notation in the graph:

$\theta, x^{(i)} \in \mathbb{R}^d$

$d=2$ in the case above, but the theory applies to higher dimensions
$b\in \mathbb{R}$

Scalor, $x^{(i)}$ $\theta$

$b$ ) could be calculated using the equation in the graph: $\theta$ . We can also check the dimensions:

b = \frac{θ^{T} x^{(i)}}{‖ θ ‖} \in \frac{R^{1 \times d} \cdot R^{d \times 1}}{R} = R

$b$ is a PROJECTION, we could see that the above equation $x^{(i)}$ $\theta$ in the graph below

LC_math_01

[ !! WARNING !! ] $\theta$ magnitude $a$

signed distance $-a$

LC_math_02

The Hyperplane represented by the red dotted line could be expressed with the following equation

\begin{aligned} x^{(i)} : \frac{θ^{T} x^{(i)}}{‖ θ ‖} & = - a \\ θ^{T} x^{(i)} & = - a ‖ θ ‖ \\ θ^{T} x^{(i)} + a ‖ θ ‖ & = 0 \end{aligned}

$a\norm{\theta} = \theta_0$ , then it would became:

x^{(i)} : θ^{T} x^{(i)} + θ_{0} = 0

1.3.4 Math: Point-to-Hyperplane Distance

Q:Signed $x^*$ in 2-D space?

margin

The purple distance in the graph is what we want, and what we need to do is just to subtract the distance between hyperplane and origin from the $x^*$ $\theta$

\begin{aligned} D i s t a n c e & = l^{*} - b \\ = \frac{θ^{T} x^{*}}{‖ θ ‖} - \frac{- θ_{0}}{‖ θ ‖} \\ = \frac{θ^{T} x^{*} - θ_{0}}{‖ θ ‖} \end{aligned}

Tip: If the hyperplane is below the origin, as in the last picture of 1.3.3 $b$ $-a$ , but they have the same equation $a$ $-a$ is the Signed Distance)
$b = - a = \frac{- θ_{0}}{‖ θ ‖}$

When Hyperplane is above origin
$\begin{aligned} \frac{θ^{T} x^{(i)}}{‖ θ ‖} & = b \\ θ^{T} x^{(i)} \underset{+ θ_{0}}{\underset{⏟}{- b ‖ θ ‖}} & = 0 \\ - b ‖ θ ‖ & = θ_{0} \\ b & = \frac{- θ_{0}}{‖ θ ‖} \end{aligned}$

When Hyperplane is below origin
$\begin{aligned} \frac{θ^{T} x^{(i)}}{‖ θ ‖} & = - a \\ θ^{T} x^{(i)} + \underset{θ_{0}}{\underset{⏟}{a ‖ θ ‖}} & = 0 \\ a ‖ θ ‖ & = θ_{0} \\ - a & = \frac{- θ_{0}}{‖ θ ‖} \end{aligned}$

1.3.5 Definition of Linear Classifier

With the Hypothesis Class and "Line" Definition, we could derive the full definition of a hypothesis of Linear Classifier

\begin{aligned} h (x) & = h (x^{(i)}; θ, θ_{0}) \\ = s g n (θ^{T} x^{(i)} + θ_{0}) \\ = {\begin{cases} + 1 & for θ^{T} x^{(i)} + θ_{0} > 0 \\ - 1 & for θ^{T} x^{(i)} + θ_{0} \leq 0 \end{cases} \end{aligned}

$\theta, \theta_0$ are parameters

1.4 Loss Evaluation

$L(g, a)$

$L$ determines the loss of data points, could be single, could be plural

$g = guess$ $a = actual$

You can choose whatever rule is appropriate for your data points

Ex. 0-1 Loss

Very Classic, Very Common, Very Boring...
$\begin{matrix} L (g, a) = {\begin{cases} 0 & if g = a \\ 1 & e l s e \end{cases} \end{matrix}$
Ex. Asymmetric Loss

$g$ $a$

$\begin{matrix} L (g, a) = {\begin{cases} 1 & if g = 1, a = - 1 \\ 100 & if g = - 1, a = 1 \\ 0 & e l s e \end{cases} \end{matrix}$

$-1$ $1$ indicates not
- $g=1$ $a=-1$ $1$
- $g=-1$ $a=1$ ), then...Well, 100 is a very fair loss value

1.4.2 Training Error

$n$ $\Epsilon_n(h)$ is

E_{n} (h) = \frac{1}{n} \sum_{i = 1}^{n} L (h (x^{(i)}), y^{(i)})

$h_a$ $h_b$ $\Epsilon_n(h_a) < \Epsilon_n(h_b)$

1.4.3 Test Error

$n'$ $\Epsilon(h)$ is

E (h) = \frac{1}{n^{'}} \sum_{i = n + 1}^{n + n^{'}} L (h (x^{(i)}), y^{(i)})

Almost the same as Training Error, the difference is just the index

1.5 Random Linear Classifier

Clumsy, but not so clumsy. Here comes's the pseudo-code

rand-lin-class

$k$ $(\theta, \theta_0)$ to generate

Cons

Not very efficient because parameter sets are generated at random, and you might have to test some very crazy hypotheses
Pros

$h$ $k$ increases
I won't say that with multiple rounds of training...
- $k=1$
  
  You have chances (although extremely small) to hit the optimal parameter set in the only one iteration
- $k=2$
  
  It's highly possible that you get two sets of insanely awful parameters...and get a huge loss

1.6 Perceptron

1.6.1 Algorithm

This algorithm is more organized with less randomness (I guess), pseudo-code here

perceptron

$\tau$ $n$ $D_n$

the evaluation & update in the nested for-loop $a$ actual label value $g$ is guess label value

Evaluation

$\underset{a}{\underset{⏟}{y^{(i)}}} \underset{g}{\underset{⏟}{(θ^{T} x^{(i)} + θ_{0})}} \leq 0$

The evaluation would be true when at the initial case, or when actual value does not corresponds to guess value

Tip: $a, g \in \{+1, -1\}$ $\theta, \theta_0$ are zero when initialized
- Initial Case
  $\begin{aligned} y^{(i)} (θ^{T} x^{(i)} + θ_{0}) & = y^{(i)} ([0, . . ., 0]^{T} \cdot x^{(i)} + 0) \\ = y^{(i)} \cdot 0 \\ = 0 \end{aligned}$
- Actual $\not=$ Guess
  $\begin{aligned} y^{(i)} (θ^{T} x^{(i)} + θ_{0}) & = (+ 1) \cdot (- 1) or (- 1) \cdot (+ 1) \\ = - 1 \end{aligned}$
When all the predictions made on the training data set are correct, that means

$y^{(i)} = θ^{T} x^{(i)} + θ_{0}$

And then the evaluation becomes

$\begin{aligned} y^{(i)} (θ^{T} x^{(i)} + θ_{0}) & = (+ 1) \cdot (+ 1) or (- 1) \cdot (- 1) \\ = 1 > 0 \end{aligned}$
Update

$\theta$ $\theta_0$ when the guess label value of any data point does not equals to its actual label value
$\begin{matrix} θ_{n e o} = θ + y^{(i)} x^{(i)} \\ θ_{0, n e o} = θ_{0} + y^{(i)} \end{matrix}$
Then for the next data point, the evaluation equation would become
$\begin{aligned} y^{(i)} (θ_{n e o}^{T} x^{(i)} + θ_{0, n e o}) & = y^{(i)} [(θ + y^{(i)} x^{(i)})^{T} x^{(i)} + (θ_{0} + y^{(i)})] \\ = y^{(i)} (θ^{T} x^{(i)} + θ_{0}) + \underset{= 1}{\underset{⏟}{(y^{(i)})^{2}}} (\underset{= {‖ x^{(i)} ‖}^{2}}{\underset{⏟}{x^{(i) T} x^{(i)}}} + 1) \\ = y^{(i)} (θ^{T} x^{(i)} + θ_{0}) + ({‖ x^{(i)} ‖}^{2} + 1) \end{aligned}$
$\theta$ $\theta_0$ is "original evaluation + a strictly positive term"
${‖ x^{(i)} ‖}^{2} \geq 0 ⟶ {‖ x^{(i)} ‖}^{2} + 1 > 0$
push the evaluation towards false $\theta$ $\theta_0$ to achieve a more correct classification

1.6.2 Linear Separability

$D_n$ Linearly Separable $\theta, \theta_0$ $i\in\{1, 2,...,n\}$ it satifies any of the below

$y^{(i)}(\theta^Tx^{(i)}+\theta_0) > 0$
$h(x^{(i)}; \theta, \theta_0)=y^{(i)}$
$\Epsilon_n(h)=0$

Well, they are actually completely equivalent...

The data set in the picture below is definitely not linearly separable

not linearly separable

1.6.3 Margin of Training Data

$D_n$ $\theta, \theta_0$ $n$ data points to the hyperplane

M a r g i n = min_{i \in {1, 2, . . ., n}} y^{(i)} (\frac{θ^{T} x^{(i)} + θ_{0}}{‖ θ ‖})

(Point-to-Hyperplane Distance - see 1.3.4)

This equation would be positive if and only if all data points are correctly predicted

if anyincorrectly classified $y^{(i)}$ $\frac{\theta^Tx^{(i)}+\theta_0}{\norm{\theta}}$ will have different signs, and their product would be negative, therefore $Margin$ to be negative (the equation returns the minimum product)

1.6.4 Perceptron Convergence Theorem

Assume:

$H$ $\theta_0=0$ )

(way to achieve: see 1.6.5)
$\theta^*$ $\gamma > 0$ $y^{(i)}(\frac{\theta^{*T}x^{(i)}}{\norm{\theta}})\geq\gamma$ $i\in\{1, 2, ..., n\}$

we have a minimum value for Point-to-Hyperplane distance
$\norm{x^{(i)}} \leq R$ $i\in\{1, 2, ..., n\}$

we have a maximum range for data points

theory

Conclusion:

$(\frac{R}{\gamma})^2$ $\theta$ $\theta_0$ , then the Training Error of this Hypothesis would be 0

1.6.5 Classifier Offset

expand the Feature Space $\theta_0$ in the hyperplane definition

Hyperplane with Offset

$\theta \in \mathbb{R}^d$ $x^{(i)}\in \mathbb{R}^d$
$x^{(i)} : θ^{T} x^{(i)} + θ_{0} = 0$
$\theta_0=\theta_0 \times1$
Hyperplane without Offset

$\theta_0$ $\theta$ $x$

$\theta_{neo} \in \mathbb{R}^{d+1}$ $x_{neo} \in \mathbb{R}^{d+1}$
- $\theta_{neo}=[\theta_1, \theta_2,...,\theta_d, \theta_0]^T$
- $x^{(i)}_{neo}=[x^{(i)}_{1}, x^{(i)}_{2}, ..., x^{(i)}_{d}, 1]^T$
And it is exactly the same equation as the one with offset (we just hide it)

$θ_{n e o}^{T} x_{n e o}^{(i)} = θ^{T} x^{(i)} + θ_{0}$

With the above modification, we have the new definition for Hyperplane as below

$x_{n e o [1 : d]}^{(i)} : θ_{n e o}^{T} x_{n e o}^{(i)} = 0$

1.7 Algorithm Evaluation

1.7.1 Nonlinear Boundary

For data sets that are not linearly separable, we could use polynomial to create nonlinear boundaries to do classification

One way to implement these nonlinear boundaries is to use Taylor ExpansionTaylor Polynomial $d$ features

taylor poly

$k^{th}$ $k$ groups like below

$k=2$ for the donut-shape example at the top
$k=3$ for the weird shape example at the bottom

nonlinbound

As you see, classifiers of training data in 2D-space could be expressed as hyperplanes in 3D-space, this not only applies to nonlinear boundaries, but also linear boundaries

lin vs nonlin

$x^{(i)}$ $\theta$ $x^{(i)}$ in the equation

They are still linear classifiers!

Classifier of Linear Boundary - Example on the top

Hyperplane expressed by
$z = θ^{T} x^{(i)} + θ_{0}$
$x^{(i)} = [x^{(i)}_1, x^{(i)}_2]^T$
Classifier of Nonlinear Boundary - Example at the bottom

Hyperplane expressed by
$z = θ^{T} ϕ^{(i)} (x) + θ_{0}$
$\phi^{(i)}(x)=[x^{(i)}_1, x^{(i)}_2, x^{(i)2}_1, x^{(i)2}_2, x^{(i)}_1x^{(i)}_2]^T$

1.7.2 Overfitting

Actually we could apply those super-flexible functions (or nonlinear boundaries) in 1.7.1 to any data set, but there are some extreme cases where you don't want such superiority...

overfit

Training Error would be 0 with the boundary we develop above, but are you sure this is what you want?

This is Overfitting

I'm 99% sure you can't generalize this boundary to new data points in the tests...

1.7.3 Cross Validation

This is a way to prevent overfitting

$D_n$ $k$ $D_{n,1}, D_{n,2}, ..., D_{n,k}$
$k$ times
1. $h_i$ $D_n$ $D_{n,i}$
  
  $D_n$
2. $h_i$ $\Epsilon(h_i, D_{n,i})$
  
  $D_n$
$\frac{1}{k}\sum^{k}_{i=1}\Epsilon(h_i, D_{n,i})$

! [WARNING] ! Do please shuffle the orderunless $h$ on some very biased group of data like the one in the grey box from the graph below

biased data

Trust me, you don't want this...

2. Feature Encoding

The following is an example data table before encoding

data ex

2.1 Numerical Data

Order & Differences of data values Does Matter

In the example data given above, resting heart rate, age, family income are numerical data

Although age does looks categorical or ordinal, by definition above, it's still numerical

Usually we just leave numerical data as they are, but...

2.1.1 Level of Detail

Please be careful when pre-processing the data...

When people dealing with "vague" data like age in the table above, they might do something like rounding "40s" to "45"

DON'T DO THAT unless you are sure this rounding is needed and would definitly not negatively impact the result

One suggested practice is to just use "4" for "40s"

Gentle Remainder Again: Don't Do Anything Unnecessary

2.1.2 Data Standardization

Sometimes we need to visualize the data...Do you know what's gonna happen if we plot resting heart rate versus family income in the table above with their linear classifier?

no data std

You could find a linear classifier that works properly, it just looks like hell...

Solution: Standardize the data!

$d^{th}$ $z^{(i)}_d$ as below

z_{d}^{(i)} = \frac{x_{d}^{(i)} - μ_{d}}{σ_{d}}

$\mu_d$ $\sigma_d$ are the meanstandard deviation $d^{th}$ features in data set

2.2 Categorical Data

Order of data values Does Not Matter

In the example data given above, pain?, job, medicines are categorical data

2.2.1 Boolean Encoding

A typical way to convert something like "Yes/No", "True/False", for example

pain? $\in \{\text{Yes}, \text{No}\} \rightarrow \{+1, -1\}$

$\{1, 0\}$ depending on your need

2.2.2 One-Hot Encoding

Converting "Single Choice"unique $0/1$ Feature", the feature become 1 when we "select" it, otherwise that bit would always be 0

Example: job

We could turn 5 jobs5 features $j_1, j_2, j_3, j_4, j_5$ $j_i \in \{0, 1\}$








xxxxxxxxxx
 6
 
 
 
   
 
 1
 *job*   [j1 j2 j3 j4 j5]
 
  2
 nurse   [1, 0, 0, 0, 0]
 
  3
 admin   [0, 1, 0, 0, 0]
 
  4
 doctor  [0, 0, 1, 0, 0]
 
  5
 teacher [0, 0, 0, 1, 0]
 
  6
 worker  [0, 0, 0, 0, 1]

If you want to group nurse and doctor together saying that they have some features in common, then you can consider Factored Encoding (see 2.2.3)

2.2.3 Factored Encoding

Converting "Multiple Choices" data, rather than "Single Choice" data like the job we discussed above, for example

Example: medicine

It seems like we have 4 different medicine selection, but in fact, it is just combinations of 2 kinds of medicine, and therefore we only have to deal with 2 features

{pain, beta blocker} \to {m_{1}, m_{2}}








xxxxxxxxxx
 5
 
 
 
   
 
 1
 *selection*         [m1 m2]
 
  2
 pain                [1, 0]
 
  3
 pain & beta blocker [1, 1]
 
  4
 beta blocker        [0, 1]
 
  5
 none                [0, 0]

2.3 Ordinal Data

Order of data values Does Matter but Differences Does Not

Converting "Scales" like "Levels of Pain", "Levels of Agreement"

No example on the data table above, but we could provide one like this

ordinal

This kind of feature encoding method is called "Thermometer Code" or "Unary Code"

2.4 Processed Data Example

The encoded and standardized example data table

encoded & std

3. Regression

3.1 Gradient Descent

3.1.1 Gradient

Gradient tells us the direction to increase "height of function" (at the axis of dependent variable) the most partial derivatives $J$ $\Theta$ is

\begin{matrix} \nabla_{Θ} f (Θ) = [\begin{matrix} \frac{\partial f (Θ)}{\partial Θ_{1}} \\ \frac{\partial f (Θ)}{\partial Θ_{2}} \\ ⋮ \\ \frac{\partial f (Θ)}{\partial Θ_{m}} \end{matrix}] \end{matrix}

$\Theta = \{\Theta_1, \Theta_2, ..., \Theta_m\} \in \mathbb{R}^m$

gradient

$f(\Theta)$ $\Theta \in \mathbb{R}^2$ is on the left hand side; another contour graph with top view for this function is on the right hand side, where the green arrow denotes the direction of gradient at the red dot.

3.1.2 Convex Function

A Real-Valued Function is Convex if the line segment of any two distinct points on the graph of this function lies completely above or on this graph

Two sets of examples are presented below

convexity

When a function is convex, it is easier to minimize

3.1.3 (Batch) Gradient Descent

When saying "Gradient Descent", this is the version we talk about

We are actually interested in how to minimize (instead of maximize) the function since talking about Descent; in terms of the contour graph in 3.1.1, the red arrow denotes the desired direction

descent direction

And by going down the contour lines step by step, we could (probabily) reach the minimum of the function, the "bottom of the valley". The picture below shows the simplest process of gradient descent

gradient descent demo

Here comes the pseudo-code

gd pseudo code

Parameters Explanation
- $\Theta_{init} =$ Initial "Position"
- $\Theta^{(t)}=$ $f(\Theta)$ $t^{th}$ Step
- $f(\Theta)=$ Convex Function
- $\grad_{\Theta} f(\Theta)=$ $f(\Theta)$ $\Theta$ as a Function
- $\eta=$ Step Size

Stop Condition

You could choose different stop_conditions for the while loop here
- $T$
  
  Always a good choice
- $\abs{f(\Theta^{(t)})-f(\Theta^{(t-1)})} \geq \epsilon$
  
  stop when step difference of function value is small enough
- $\abs{\Theta^{(t)}-\Theta^{(t-1)}} \geq \epsilon$
  
  stop when you can hardly "improve" the parameter set for your convex function
- $\norm{\grad_{\Theta} f(\Theta^{(t)})} \geq \epsilon$
  
  stop when gradient is small enough
Performance Theorem

3.1.3 $\epsilon >0$ , then
- Assume
  - $\eta$ is sufficiently small
  - $f(\Theta)$ is sufficiently smooth an d convex
  - $f(\Theta)$ has at least one global optimum
    
    I mean minimum here
- Conclusion
  
  within $\epsilon$ $\Theta$

3.1.4 Stochastic Gradient Descent

Sort of a "Drunk" Gradient Descent

If you look at how gradients are updated in (Batch) Gradient Descent 3.2.5 & 3.3.5, you shall see that we sum the gradients of the whole training data set, then take the average

In Stochastic Gradient Descent, we randomly select one data point at a time, and just look at its gradient. This would be noisier, but since we don't have to do summations, iterations become cheaper

GD vs SGD

Here's the pseudo-code for it

sgd pseudo code

About changes in parameters

$\eta(t)$ $t$ gets larger
$\nabla_{\Theta} f_i(\Theta^{(t-1)})$ is the gradient of a single data point

[Note] In practice, SGD uses a small collection of data points, not just one, called minibatch

3.2 Logistic Regression

uncertainty $y^{(i)}$ linearly unseparable data $y^{(i)}$ are mixed up in some regions

linearly unseparable

For example, at what temperature would you wear a coat?

$x^{(i)} \in \mathbb{R}$ with only one feature: Temperature
$y^{(i)} \in \mathbb{R}$ is the probability of wearing a coat

logistic regression uncertainty

The graph on the left hand side shows that someone will wear a coat when temperature is under a definite value bound, and will not when above. But the real life experience is that there is a certain range of temperature in which you are uncertain about if you need a coat, just like the graph on the right hand side.

Definitely need a coat when under 10C
Be 70% sure that you need a coat at 13C, but in some very sunny days, you don't
Be 30% sure that you do not need a coat at 17C, but in some very rainy days, you do need
Definitely do not need a coat when above 20C

The shape of the graph on the right is a Sigmoid/Logistic Function, or Linear Logistic Classifier

3.2.1 Logistic Function

Definition
$σ (z) = \frac{1}{1 + e^{- z}}$
Shifting/Scaling

$\theta$ $\theta_0$ to the variable of Logistic Function
$σ (θ z + θ_{0}) = \frac{1}{1 + e^{- (θ z + θ_{0})}}$
$z$ $x^{(i)}\in \mathbb{R}^{d}$ , then we would find something familiar...
$σ (θ^{T} x^{(i)} + θ_{0}) = \frac{1}{1 + e^{- (θ^{T} x^{(i)} + θ_{0})}}$
$d=2$ , then the graph of the function above would be the following

Tip $\sigma(z) \in (0, 1)$ and it is not a classifier

3.2.2 Prediction Threshold

$\{+1, -1\}$ with Logistic Function?

Default Practice

The output of Logistic Function is Probability, so...
- $+1$ $Probability>0.5$
- $-1$ $Probability\leq 0.5$
$0.5$ above is the Prediction Threshold
Simplification
$\begin{aligned} P r o b a b i l i t y & > 0.5 \\ σ (θ^{T} x^{(i)} + θ_{0}) & > 0.5 \\ \frac{1}{1 + e^{- (θ^{T} x^{(i)} + θ_{0})}} & > 0.5 \\ e^{- (θ^{T} x^{(i)} + θ_{0})} & < 1 \\ θ^{T} x^{(i)} + θ_{0} & > 0 \end{aligned}$
The last equation should be very familiar to you (see 1.3.5). We could see the default hypothesis for Logistic Function is exactly the same as that for Linear Classifier!
$\begin{matrix} h (x) = s g n (θ^{T} x^{(i)} + θ_{0}) = {\begin{cases} + 1 & if θ^{T} x^{(i)} + θ_{0} > 0 \\ - 1 & if θ^{T} x^{(i)} + θ_{0} \leq 0 \end{cases} \end{matrix}$

3.2.3 NLL Loss

The loss function we define here is negative-log-likelihood (NLL). To be honest this is not a "Loss" Function, but a function that would help us learn a Linear Logistic Classifier

$(0,1)$ , which is a probability, in our case, it is the $+1$ ", or the probability that you need a coat

Assume each data point are completely independent, then we could find the probability that the whole training data is properly labeled by multiplying the probability that each data point is properly labeled

$+1$ " $g^{(i)}$
$g^{(i)} = σ (θ^{T} x^{(i)} + θ_{0})$
probability that data point is properly labeled $P(x^{(i)})$
- $y^{(i)}=+1$ $P(x^{(i)})=g^{(i)}$
- $y^{(i)}\not=+1$ $P(x^{(i)})=1-g^{(i)}$

We could find the probability that the whole training data is properly labeled as the equation below

\begin{aligned} P (d a t a) & = \prod_{i = 1}^{n} P (x^{(i)}) \\ = \prod_{i = 1}^{n} {\begin{cases} g^{(i)} & if y^{(i)} = + 1 \\ 1 - g^{(i)} & if y^{(i)} \neq + 1 \end{cases} \\ = \prod_{i = 1}^{n} (g^{(i)})^{1 {y^{(i)} = + 1}} (1 - g^{(i)})^{1 {y^{(i)} \neq + 1}} \end{aligned}

Explanation for why we need negative-log-likelihood

Q: Why logarithm?

A:precision $\prod$ $\sum$
Q: Why negative?

Our objective is to maximize the probability that the whole training data is properly labeled, but we prefer to minimize by taking the negative sign

The Loss of Data (not NLL Loss) in general is defined as below

\begin{aligned} L (d a t a) & = - \frac{1}{n} \cdot \log (P (d a t a)) \\ = - \frac{1}{n} \cdot \log [\prod_{i = 1}^{n} (g^{(i)})^{1 {y^{(i)} = + 1}} (1 - g^{(i)})^{1 {y^{(i)} \neq + 1}}] \\ = - \frac{1}{n} \sum_{i = 1}^{n} [1 {y^{(i)} = + 1} \log (g^{(i)}) + 1 {y^{(i)} \neq + 1} \log (1 - g^{(i)})] \end{aligned}

Tip: Logarithm Properties

$\log_{b}(xy)=\log_{b}(x)+\log_{b}(y)$

$\log_{b}(x^p)=p\log_{b}(x)$

NLL Loss $g$ guess value $a$ is actual value)

L_{n l l} (g^{(i)}, y^{(i)}) = - [1 {y^{(i)} = + 1} \log (g^{(i)}) + 1 {y^{(i)} \neq + 1} \log (1 - g^{(i)})]

You can use logarithm of any base

$=e$

$y^{(i)} \in \{0, 1\}$ , the NLL Loss function could be defined more rigorously as

L_{n l l} (g^{(i)}, y^{(i)}) = - [y^{(i)} \log (g^{(i)}) + (1 - y^{(i)}) \log (1 - g^{(i)})]

$L(data)$ objective $J_{lr}(\Theta)$ to minimize

J_{l r} (Θ) = J_{l r} (θ, θ_{0}) = \frac{1}{n} \sum_{i = 1}^{n} L_{n l l} (σ (θ^{T} x^{(i)} + θ_{0}), y^{(i)})

$lr=$ logistic regression

$\Theta=$ parameters

(a multi-class version is available in Machine Learning - NN Special Topics 2.)

3.2.4 Overfitting & Regularizer

$J_{lr}(\Theta)=J_{lr}(\theta, \theta_0)$ is Differentiable & Convex

This is a claim, not an assumption

Assume we want to train a linear logistic classifier for the training data below

data example

Run the pseudo-code in 3.1.3 with the NLL Loss function defined in 3.2.3

Gradient_Descent $(\Theta_{init}, \eta, J_{lr}(\Theta), \nabla f(\Theta), \epsilon)$

Then we need to ask: How far should we go?

The whole idea of NLL Loss is to increase the probability of the whole data set
$g^{(i)}$ steeper (see 3.2.3)
So ideally, we will always get the graph at Step 114514 in the graph below in the end

logistic regression

$\theta_0=0$ )

loss function plot

go down $\theta$ increases all the time

$\theta$ in the positive axis, despite that the graph of the function would look like a horizontal line, its value is still decreasing

But being overly certain is pointless, we don't need it.

Regularizer / Penalty $R(\theta)=\lambda\norm{\theta}^{2}$ $\lambda \geq 0$ $J_{lr}$ differentiable & convex

$R(\theta)$ is a quadratic function...a quadratic function is always convex

J_{l r} (Θ) = J_{l r} (θ, θ_{0}) = \frac{1}{n} \sum_{i = 1}^{n} L_{n l l} (σ (θ^{T} x^{(i)} + θ_{0}), y^{(i)}) + λ {‖ θ ‖}^{2}

You could see its effect as below

regularizer

3.2.5 Gradient Descent for Logistic Regression

We could set up our Logistic Regression Learning Algorithm based on the pseudo-code of gradient descent in 3.1.3

Gradients

all the parameters $\Theta=[\theta, \theta_0]$ ) with respective gradients
Tips:
1. $=e$
2. $g^{(i)}=\sigma(\theta^Tx^{(i)}+\theta_0)$
3. Derivatives
  
  $\frac{d}{dx}\ln(x)=\frac{1}{x}$
  
  $\frac{d}{dx}\sigma(x)=\sigma(x)(1-\sigma(x))$
- $\theta$
  $\begin{aligned} \nabla_{θ} J_{l r} (Θ) & = \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial L_{n l l}}{\partial θ} + \frac{\partial}{\partial θ} (λ {‖ θ ‖}^{2}) \\ = \frac{1}{n} \sum_{i = 1}^{n} - [y^{(i)} \frac{1}{g^{(i)}} \cdot \frac{\partial g^{(i)}}{\partial θ} + (1 - y^{(i)}) \frac{1}{1 - g^{(i)}} \cdot \frac{\partial (1 - g^{(i)})}{\partial θ}] + 2 λ θ \\ = \frac{1}{n} \sum_{i = 1}^{n} - [\frac{y^{(i)}}{g^{(i)}} - \frac{1 - y^{(i)}}{1 - g^{(i)}}] \cdot \frac{\partial g^{(i)}}{\partial θ} + 2 λ θ \\ = \frac{1}{n} \sum_{i = 1}^{n} - \frac{y^{(i)} - y^{(i)} g^{(i)} - g^{(i)} + y^{(i)} g^{(i)}}{g^{(i)} (1 - g^{(i)})} \cdot g^{(i)} (1 - g^{(i)}) \cdot \frac{\partial}{\partial θ} (θ^{T} x^{(i)} + θ_{0}) + 2 λ θ \\ = \frac{1}{n} \sum_{i = 1}^{n} - (y^{(i)} - g^{(i)}) \cdot x^{(i)} + 2 λ θ \end{aligned}$
  Reorganize the equation, then we have the following gradient equation
  $\nabla_{θ} J_{l r} (Θ) = \frac{1}{n} \sum_{n = 1}^{n} [σ (θ^{T} x^{(i)} + θ_{0}) - y^{(i)}] x^{(i)} + 2 λ θ$
- $\theta_0$
  $\begin{aligned} \nabla_{θ_{0}} J_{l r} (Θ) & = \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial L_{n l l}}{\partial θ_{0}} + \frac{\partial}{\partial θ_{0}} (λ {‖ θ ‖}^{2}) \\ = \frac{1}{n} \sum_{i = 1}^{n} - [y^{(i)} \frac{1}{g^{(i)}} \cdot \frac{\partial g^{(i)}}{\partial θ_{0}} + (1 - y^{(i)}) \frac{1}{1 - g^{(i)}} \cdot \frac{\partial (1 - g^{(i)})}{\partial θ_{0}}] \\ = \frac{1}{n} \sum_{i = 1}^{n} - [\frac{y^{(i)}}{g^{(i)}} - \frac{1 - y^{(i)}}{1 - g^{(i)}}] \cdot \frac{\partial g^{(i)}}{\partial θ_{0}} \\ = \frac{1}{n} \sum_{i = 1}^{n} - \frac{y^{(i)} - y^{(i)} g^{(i)} - g^{(i)} + y^{(i)} g^{(i)}}{g^{(i)} (1 - g^{(i)})} \cdot g^{(i)} (1 - g^{(i)}) \cdot \frac{\partial}{\partial θ_{0}} (θ^{T} x^{(i)} + θ_{0}) \\ = \frac{1}{n} \sum_{i = 1}^{n} - (y^{(i)} - g^{(i)}) \cdot 1 \end{aligned}$
  Reorganize the equation, then we have the following gradient equation
  $\nabla_{θ_{0}} J_{l r} (Θ) = \frac{1}{n} \sum_{n = 1}^{n} [σ (θ^{T} x^{(i)} + θ_{0}) - y^{(i)}]$
Summary
$\begin{aligned} J_{l r} (θ, θ_{0}) & = \frac{1}{n} \sum_{i = 1}^{n} L_{n l l} (σ (θ^{T} x^{(i)} + θ_{0}), y^{(i)}) + λ {‖ θ ‖}^{2} \\ \nabla_{θ} J_{l r} (Θ) & = \frac{1}{n} \sum_{i = 1}^{n} [σ (θ^{T} x^{(i)} + θ_{0}) - y^{(i)}] x^{(i)} + 2 λ θ \\ \nabla_{θ_{0}} J_{l r} (Θ) & = \frac{1}{n} \sum_{i = 1}^{n} [σ (θ^{T} x^{(i)} + θ_{0}) - y^{(i)}] \end{aligned}$

Algorithm

Here's the pseudo-code of gradient descent for logistic regression

3.3 Linear Regression

3.3.1 Definition

Hypothesis

Hypothesis of Linear Regression is a Hyperplane...but we are just trying to fit the data, without making any conclusions from this hyperplane like classification we did previously
$h (x^{(i)}; θ, θ_{0}) = θ^{T} x^{(i)} + θ_{0}$
Training Loss

The loss function we use for linear regression is Squared Loss
$L (g, a) = (g - a)^{2}$
While the optimization (minimization) objective uses Mean-Squared Loss
$\begin{aligned} J (θ, θ_{0}) & = \frac{1}{n} \sum_{i = 1}^{n} L (h (x^{(i)}; θ, θ_{0}), y^{(i)}) \\ = \frac{1}{n} \sum_{i = 1}^{n} (θ^{T} x^{(i)} + θ_{0} - y^{(i)})^{2} \end{aligned}$
1.6.5 $\theta_0$ merged $\theta^Tx^{(i)}$ $\theta, x^{(i)} \in \mathbb{R}^d \longrightarrow \mathbb{R}^{d+1}$ )

$x^{(i)}$ $y^{(i)}$ $\tilde{X} \in \mathbb{R}^{n\times(d+1)}$ $\tilde{Y} \in \mathbb{R}^{n}$
$\begin{matrix} \tilde{X} = [\begin{matrix} x_{1}^{(1)} & \dots & x_{d}^{(1)} & 1 \\ ⋮ & ⋱ & ⋮ & ⋮ \\ x_{1}^{(n)} & \dots & x_{d}^{(n)} & 1 \end{matrix}] & \tilde{Y} = [\begin{matrix} y^{(1)} \\ ⋮ \\ y^{(n)} \end{matrix}] \end{matrix}$
With the above, rewrite the optimization objective in matrix form as below
$\begin{aligned} J (θ, θ_{0}) & = \frac{1}{n} (\underset{R^{n}}{\underset{⏟}{\tilde{X} θ - \tilde{Y}}})^{T} (\tilde{X} θ - \tilde{Y}) \\ = \frac{1}{n} {‖ \tilde{X} θ - \tilde{Y} ‖}^{2} \end{aligned}$

3.3.2 Direct Solution

A function could be uniquely minimized at a point, if Gradient at that point is 0 and Function "curves up"

3.3.1 $\grad_\theta J(\theta, \theta_0) \in \mathbb{R}^{d+1}$

\nabla_{θ} J (θ, θ_{0}) = \frac{2}{n} {\tilde{X}}^{T} (\tilde{X} θ - \tilde{Y})

$\nabla_\theta J(\theta, \theta_0) = 0$ $\theta$

\begin{aligned} \frac{2}{n} {\tilde{X}}^{T} (\tilde{X} θ - \tilde{Y}) & = 0 \\ {\tilde{X}}^{T} \tilde{X} θ - {\tilde{X}}^{T} \tilde{Y} & = 0 \\ {\tilde{X}}^{T} \tilde{X} θ & = {\tilde{X}}^{T} \tilde{Y} \\ ({\tilde{X}}^{T} \tilde{X})^{- 1} {\tilde{X}}^{T} \tilde{X} θ & = ({\tilde{X}}^{T} \tilde{X})^{- 1} {\tilde{X}}^{T} \tilde{Y} \\ θ & = ({\tilde{X}}^{T} \tilde{X})^{- 1} {\tilde{X}}^{T} \tilde{Y} \end{aligned}

[IMPORTANT]

$(\tilde{X}^T\tilde{X})$ $2^{nd}$ $(\tilde{X}^T\tilde{X})$ will not be invertible!

3.3.3 Demo & Possible Errors

This is the regular situation

regular situation

The following are possible errors you might encounter in real-life practice

No Unique Best Hyperplane

Very simple, the shape of your data happens to be awful...
- Noisy Data
  
  Maybe you could find a solution, but it's just because of the noise of data like the uncertainty of thermometer...
  
  It looks like you have found the solution, but no, you shouldn't accept it!
Feature Encoding Problems
- Correlated Real-life features
  
  Ex. Doctors and nurses might have the same probability of exposure to some diseases, but they are categorized with one-hot encoding...
- Too Many Features
  
  If the training data has only one data point, with only one feature, then you could have infinite hyperplanes as solutions, this is very very very bad!
  
  Generally speaking, your training data should have more data points than features to allow for a unique hyperplane as solution...but sometimes feature dimensions is too high comparing to data point numbers...
Strategy

When we are unsure which hyperplane to choose, just $\theta$ near 0 unless there's a strong reason not to

The way to implement this strategy is Ridge Regression (see 3.3.4)

3.3.4 Ridge Regression

This is for the problem that sometimes there isn't a best Hyperplane as our solution in 3.3.3 (and also overfitting)

$(\tilde{X}^T\tilde{X})$ is usually not invertible

Regularizer / Penalty $R(\theta)=\lambda\norm{\theta}^{2}$ $\lambda > 0$ to the optimization objective in 3.3.1, this is the same thing as what we did for logistic regression (*see 3.2.4 for *)

\begin{aligned} J_{r i d g e} (θ, θ_{0}) & = \frac{1}{n} \sum_{i = 1}^{n} (θ^{T} x^{(i)} + θ_{0} - y^{(i)})^{2} + λ {‖ θ ‖}^{2} \\ = \frac{1}{n} {‖ \tilde{X} θ - \tilde{Y} ‖}^{2} + λ {‖ θ ‖}^{2} \end{aligned}

$\grad_{\theta}J_{ridge}(\theta, \theta_0)$ is

\nabla_{θ} J_{r i d g e} (θ, θ_{0}) = \frac{2}{n} {\tilde{X}}^{T} (\tilde{X} θ - \tilde{Y}) + 2 λ θ

$J_{ridge}(\theta, \theta_0)$ $\nabla_{\theta}J_{ridge}(\theta, \theta_0)=0$ leads to the following

\begin{aligned} \frac{2}{n} {\tilde{X}}^{T} (\tilde{X} θ - \tilde{Y}) + 2 λ θ & = 0 \\ \frac{1}{n} {\tilde{X}}^{T} \tilde{X} θ - \frac{1}{n} {\tilde{X}}^{T} \tilde{Y} + λ θ & = 0 \\ \frac{1}{n} {\tilde{X}}^{T} \tilde{X} θ + λ θ & = \frac{1}{n} {\tilde{X}}^{T} \tilde{Y} \\ {\tilde{X}}^{T} \tilde{X} θ + n λ θ & = {\tilde{X}}^{T} \tilde{Y} \\ ({\tilde{X}}^{T} \tilde{X} + n λ I) θ & = {\tilde{X}}^{T} \tilde{Y} \\ θ & = ({\tilde{X}}^{T} \tilde{X} + n λ I)^{- 1} {\tilde{X}}^{T} \tilde{Y} \end{aligned}

$I$ $I \in \mathbb{R}^{(d+1)\times (d+1)}$

Ridge Regression $\lambda$ values along the diagonal of the matrix before inverting it