Machine Learning - Basics
1. Introduction: Linear Classifier
1.1 Training Data
Training Data are data points, and usually their labels are given in advance
We have
data points each in the formEach Data Point has a feature vector containing
features in the formEach Data Point is associated with label value in the form
Mathematically, Training Data with
You can set Label Value
Assume for each data point, there is only 2 features (
We could visualize the provided training data, where we have
-
2 axes
to represent featuresTIP:
actually means colored "
" signs to represent labeled values

1.2 Hypothesis
Hypothesis is the resulted model yielded after learning from the training data
Or we could also say it is a way to go from data to labels
For the "Hypothesis" here, we actually mean a way to label NEW data points, it is a function that takes in a couple of features, and then return a label...if we adopt the example above, then its hypothesis
Of course, you can also pass the data points in training data into
Hypothesis Class
1.3 Basics
1.3.1 Hypothesis Class
An example of hypothesis class for linear classifier is
: All hypotheses that label on one side of a line and on the other side
The following two lines both represent hypotheses with
Data points in green shadow are predicted to be
Data points in red shadow are predicted to be
TIP: The label of the two axes actually means
and

If we test both of them with the labeled data points in the training data, the hypothesis on the left is terribly awful...
1.3.2 Math :Vector Projection
This is essential for 1.3.3
The Dot Product of two vectors
The amount of projection of Vector

We could induce the formula for vector projection with
1.3.3 Math: Hyperplane
The "line", or the Linear Classifier Hyperplane could be represented using one single vector
Hyperplane is always 1-Dimension lower than the the space dimension (leave one dimension to do classification)
The hyperplane in the 2-D feature space below is 1-D, which is a line
The hyperplane in the 3-D feature space would be 2-D
... ...

Notation in the graph:
-
Vectors,
in the case above, but the theory applies to higher dimensions -
Scalor, the amount of
projected onto
The amount of this projection (
Just because

[ !! WARNING !! ] For the convenience of discussion, we could consider the projection onto the opposite direction of
and therefore the signed distance is

The Hyperplane represented by the red dotted line could be expressed with the following equation
If we make
1.3.4 Math: Point-to-Hyperplane Distance
Q: What is the Signed Distance from a Hyperplane to an arbitrary point

The purple distance in the graph is what we want, and what we need to do is just to subtract the distance between hyperplane and origin from the amount of projection of
Tip: If the hyperplane is below the origin, as in the last picture of 1.3.3, then we need to change notation of
to , but they have the same equation (remember that is the magnitude and is the Signed Distance)
When Hyperplane is above origin
When Hyperplane is below origin
1.3.5 Definition of Linear Classifier
With the Hypothesis Class and "Line" Definition, we could derive the full definition of a hypothesis of Linear Classifier
where
1.4 Loss Evaluation
1.4.1 Loss Function
You can choose whatever rule is appropriate for your data points
-
Ex. 0-1 Loss
Very Classic, Very Common, Very Boring...
-
Ex. Asymmetric Loss
The Loss value depends on the case of deviation between
andFor example, if the model is guessing whether the patient is sick,
indicates healthy and indicates notIf you finds someone sick (
) but he's not ( ), then he might just have to do some useless checkings, not a very seriously consequence, so the loss could just be small asBut if you find someone healthy (
) but he's not ( ), then...Well, 100 is a very fair loss value
1.4.2 Training Error
Assume we have
We prefer hypothesis
to if
1.4.3 Test Error
Assume we have
Almost the same as Training Error, the difference is just the index
1.5 Random Linear Classifier
Clumsy, but not so clumsy. Here comes's the pseudo-code

Hyperparameter
-
Cons
Not very efficient because parameter sets are generated at random, and you might have to test some very crazy hypotheses
-
Pros
In one round of training, you'll always get a better
returned as increasesI won't say that with multiple rounds of training...
-
First Training,
You have chances (although extremely small) to hit the optimal parameter set in the only one iteration
-
Second Training,
It's highly possible that you get two sets of insanely awful parameters...and get a huge loss
-
1.6 Perceptron
1.6.1 Algorithm
This algorithm is more organized with less randomness (I guess), pseudo-code here

Hyperparameter
The key component here is the evaluation & update in the nested for-loop, where
-
Evaluation
The evaluation would be
truewhen at the initial case, or when actual value does not corresponds to guess valueTip:
, and are zero when initialized-
Initial Case
-
Actual
Guess
When all the predictions made on the training data set are correct, that means
And then the evaluation becomes
-
-
Update
We update the value of
and when the guess label value of any data point does not equals to its actual label valueThen for the next data point, the evaluation equation would become
The evaluation formula with updated
and is "original evaluation + a strictly positive term"So this update is definitely trying to push the evaluation towards
false, moving the parameters and to achieve a more correct classification
1.6.2 Linear Separability
A set of training data set
Well, they are actually completely equivalent...
The data set in the picture below is definitely not linearly separable

1.6.3 Margin of Training Data
Margin of training data set
(Point-to-Hyperplane Distance - see 1.3.4)
This equation would be positive if and only if all data points are correctly predicted
if any data point is incorrectly classified,
1.6.4 Perceptron Convergence Theorem
Assume:
-
Hypothesis Class
: classifiers with hyperplanes passing through the origin ( )(way to achieve: see 1.6.5)
-
There exists
and such that forwe have a minimum value for Point-to-Hyperplane distance
-
All data points have a magnitude bounded to some value
forwe have a maximum range for data points

Conclusion:
Then the perceptron algorithm would make at most
updates to and , then the Training Error of this Hypothesis would be 0
1.6.5 Classifier Offset
We could expand the Feature Space to "get rid of" the offset
-
Hyperplane with Offset
,Notice that
-
Hyperplane without Offset
we could move
and 1 into the feature space of and ,And it is exactly the same equation as the one with offset (we just hide it)
With the above modification, we have the new definition for Hyperplane as below
1.7 Algorithm Evaluation
1.7.1 Nonlinear Boundary
For data sets that are not linearly separable, we could use polynomial to create nonlinear boundaries to do classification
One way to implement these nonlinear boundaries is to use Taylor Expansion to approximate a smooth function, below is the table of term in the Taylor Polynomial when each data point has

Usually, a
for the donut-shape example at the top for the weird shape example at the bottom

As you see, classifiers of training data in 2D-space could be expressed as hyperplanes in 3D-space, this not only applies to nonlinear boundaries, but also linear boundaries

Mathematically speaking, this kind of boundsare just nonlinear in
They are still linear classifiers!
-
Classifier of Linear Boundary - Example on the top
Hyperplane expressed by
where
-
Classifier of Nonlinear Boundary - Example at the bottom
Hyperplane expressed by
where
1.7.2 Overfitting
Actually we could apply those super-flexible functions (or nonlinear boundaries) in 1.7.1 to any data set, but there are some extreme cases where you don't want such superiority...

Training Error would be 0 with the boundary we develop above, but are you sure this is what you want?
This is Overfitting
I'm 99% sure you can't generalize this boundary to new data points in the tests...
1.7.3 Cross Validation
This is a way to prevent overfitting
-
Shuffle the training data
, then divide it into groups:
-
Repeat the following steps
times-
Train your
on all groups of except one groupThe orange section of
-
Test
on the group left, compute Test ErrorThe blue section of
-
Compute average Test Error
! [WARNING] ! Do please shuffle the order of your data set, unless you want to train your

Trust me, you don't want this...
2. Feature Encoding
The following is an example data table before encoding

2.1 Numerical Data
Order & Differences of data values Does Matter
In the example data given above, resting heart rate, age, family income are numerical data
Although
agedoes looks categorical or ordinal, by definition above, it's still numerical
Usually we just leave numerical data as they are, but...
2.1.1 Level of Detail
Please be careful when pre-processing the data...
When people dealing with "vague" data like age in the table above, they might do something like rounding "40s" to "45"
DON'T DO THAT unless you are sure this rounding is needed and would definitly not negatively impact the result
One suggested practice is to just use "4" for "40s"
Gentle Remainder Again: Don't Do Anything Unnecessary
2.1.2 Data Standardization
Sometimes we need to visualize the data...Do you know what's gonna happen if we plot resting heart rate versus family income in the table above with their linear classifier?

You could find a linear classifier that works properly, it just looks like hell...
Solution: Standardize the data!
For numerical data, the simplist way is what we did to an arbitrary normal distribution to make it a standard normal distribution. The
where
2.2 Categorical Data
Order of data values Does Not Matter
In the example data given above, pain?, job, medicines are categorical data
2.2.1 Boolean Encoding
A typical way to convert something like "Yes/No", "True/False", for example
pain?
You can also use
depending on your need
2.2.2 One-Hot Encoding
Converting "Single Choice" data. Turn each category into a unique "
Example: job
We could turn 5 jobs into 5 features:
xxxxxxxxxx61*job* [j1 j2 j3 j4 j5]2nurse [1, 0, 0, 0, 0]3admin [0, 1, 0, 0, 0]4doctor [0, 0, 1, 0, 0]5teacher [0, 0, 0, 1, 0]6worker [0, 0, 0, 0, 1]
If you want to group nurse and doctor together saying that they have some features in common, then you can consider Factored Encoding (see 2.2.3)
2.2.3 Factored Encoding
Converting "Multiple Choices" data, rather than "Single Choice" data like the job we discussed above, for example
Example: medicine
It seems like we have 4 different medicine selection, but in fact, it is just combinations of 2 kinds of medicine, and therefore we only have to deal with 2 features
xxxxxxxxxx51*selection* [m1 m2]2pain [1, 0]3pain & beta blocker [1, 1]4beta blocker [0, 1]5none [0, 0]
2.3 Ordinal Data
Order of data values Does Matter but Differences Does Not
Converting "Scales" like "Levels of Pain", "Levels of Agreement"
No example on the data table above, but we could provide one like this

This kind of feature encoding method is called "Thermometer Code" or "Unary Code"
2.4 Processed Data Example
The encoded and standardized example data table

3. Regression
3.1 Gradient Descent
3.1.1 Gradient
Gradient tells us the direction to increase "height of function" (at the axis of dependent variable) the most Mathematically, it is a vector of partial derivatives with respect to each of the variables, the Gradient of
Where parameter

In the picture above, graph of the function
3.1.2 Convex Function
A Real-Valued Function is Convex if the line segment of any two distinct points on the graph of this function lies completely above or on this graph
Two sets of examples are presented below

When a function is convex, it is easier to minimize
3.1.3 (Batch) Gradient Descent
When saying "Gradient Descent", this is the version we talk about
We are actually interested in how to minimize (instead of maximize) the function since talking about Descent; in terms of the contour graph in 3.1.1, the red arrow denotes the desired direction

And by going down the contour lines step by step, we could (probabily) reach the minimum of the function, the "bottom of the valley". The picture below shows the simplest process of gradient descent

Here comes the pseudo-code

-
Parameters Explanation
Initial "Position" Parameters for at Step Convex Function Gradient of with respect to as a Function Step Size
-
Stop Condition
You could choose different
stop_conditions for thewhileloop here-
Maximum Number of Iteration
Always a good choice
-
stop when step difference of function value is small enough
-
stop when you can hardly "improve" the parameter set for your convex function
-
stop when gradient is small enough
-
-
Performance Theorem
Based on the pseudo-code in 3.1.3, if select any
, then-
Assume
is sufficiently small is sufficiently smooth an d convex-
has at least one global optimumI mean minimum here
-
Conclusion
If the code runs long enough, Gradient Descent will return a value within
of global optimum
-
3.1.4 Stochastic Gradient Descent
Sort of a "Drunk" Gradient Descent
If you look at how gradients are updated in (Batch) Gradient Descent 3.2.5 & 3.3.5, you shall see that we sum the gradients of the whole training data set, then take the average
In Stochastic Gradient Descent, we randomly select one data point at a time, and just look at its gradient. This would be noisier, but since we don't have to do summations, iterations become cheaper

Here's the pseudo-code for it

About changes in parameters
would be smaller as gets larger is the gradient of a single data point
[Note] In practice, SGD uses a small collection of data points, not just one, called minibatch
3.2 Logistic Regression
Common linear classifiers could only divide linearly separable data, but it has no idea on the uncertainty of

For example, at what temperature would you wear a coat?
with only one feature: Temperature is the probability of wearing a coat

The graph on the left hand side shows that someone will wear a coat when temperature is under a definite value bound, and will not when above. But the real life experience is that there is a certain range of temperature in which you are uncertain about if you need a coat, just like the graph on the right hand side.
Definitely need a coat when under 10C
Be 70% sure that you need a coat at 13C, but in some very sunny days, you don't
Be 30% sure that you do not need a coat at 17C, but in some very rainy days, you do need
Definitely do not need a coat when above 20C
The shape of the graph on the right is a Sigmoid/Logistic Function, or Linear Logistic Classifier
3.2.1 Logistic Function
-
Definition

-
Shifting/Scaling
Add a scaling constant
and shifting constant to the variable of Logistic FunctionSuppose we replace
with a multi-feature vector , then we would find something familiar...If
, then the graph of the function above would be the following
Tip: Logistic Function
and it is not a classifier
3.2.2 Prediction Threshold
How to make prediction decision from
-
Default Practice
The output of Logistic Function is Probability, so...
if if
The value
above is the Prediction Threshold -
Simplification
The last equation should be very familiar to you (see 1.3.5). We could see the default hypothesis for Logistic Function is exactly the same as that for Linear Classifier!
3.2.3 NLL Loss
The loss function we define here is negative-log-likelihood (NLL). To be honest this is not a "Loss" Function, but a function that would help us learn a Linear Logistic Classifier
We know the output of logistic function is in
Assume each data point are completely independent, then we could find the probability that the whole training data is properly labeled by multiplying the probability that each data point is properly labeled
-
The probability that a data point is labeded as "
" is denoted as -
The probability that data point is properly labeled is denoted as
if
, thenif
, then
We could find the probability that the whole training data is properly labeled as the equation below
Explanation for why we need negative-log-likelihood
-
Q: Why logarithm?
A: Calculating products would have some effect on the precision of results on the computer, so we would like to convert our formula from product
to summation -
Q: Why negative?
Our objective is to maximize the probability that the whole training data is properly labeled, but we prefer to minimize by taking the negative sign
The Loss of Data (not NLL Loss) in general is defined as below
Tip: Logarithm Properties
NLL Loss in the equation above is defined by the following (
You can use logarithm of any base
Default base value
For convenience, if
We could reorganize
logistic regression
parameters
(a multi-class version is available in Machine Learning - NN Special Topics 2.)
3.2.4 Overfitting & Regularizer
First of all,
This is a claim, not an assumption
Assume we want to train a linear logistic classifier for the training data below

Run the pseudo-code in 3.1.3 with the NLL Loss function defined in 3.2.3
Gradient_Descent
Then we need to ask: How far should we go?
The whole idea of NLL Loss is to increase the probability of the whole data set
We can increase this probability by making the graph
steeper (see 3.2.3)So ideally, we will always get the graph at Step 114514 in the graph below in the end

We can plot the objective function and see what's happening (assume

Value of NLL Loss will go down as
Even after some value of
But being overly certain is pointless, we don't need it.
The solution is to add a Regularizer / Penalty
is a quadratic function...a quadratic function is always convex
You could see its effect as below

3.2.5 Gradient Descent for Logistic Regression
We could set up our Logistic Regression Learning Algorithm based on the pseudo-code of gradient descent in 3.1.3
-
Gradients
we need to update all the parameters (
) with respective gradientsTips:
Default logarithm base value
-
Derivatives
-
Gradient of
Reorganize the equation, then we have the following gradient equation
-
Gradient of
Reorganize the equation, then we have the following gradient equation
-
Summary
-
Algorithm
Here's the pseudo-code of gradient descent for logistic regression

3.3 Linear Regression
3.3.1 Definition
-
Hypothesis
Hypothesis of Linear Regression is a Hyperplane...but we are just trying to fit the data, without making any conclusions from this hyperplane like classification we did previously
-
Training Loss
The loss function we use for linear regression is Squared Loss
While the optimization (minimization) objective uses Mean-Squared Loss
Notice that using the same trick in 1.6.5, offset
could be merged into (with )Then We could expand and stack features
and results as matrices andWith the above, rewrite the optimization objective in matrix form as below
3.3.2 Direct Solution
A function could be uniquely minimized at a point, if Gradient at that point is 0 and Function "curves up"
The Gradient of optimization objective in 3.3.1 is the following matrix derivative with
By setting
[IMPORTANT]
3.3.3 Demo & Possible Errors
This is the regular situation

The following are possible errors you might encounter in real-life practice
-
No Unique Best Hyperplane
Very simple, the shape of your data happens to be awful...

-
Noisy Data
Maybe you could find a solution, but it's just because of the noise of data like the uncertainty of thermometer...
It looks like you have found the solution, but no, you shouldn't accept it!

-
-
Feature Encoding Problems
-
Correlated Real-life features
Ex. Doctors and nurses might have the same probability of exposure to some diseases, but they are categorized with one-hot encoding...
-
Too Many Features
If the training data has only one data point, with only one feature, then you could have infinite hyperplanes as solutions, this is very very very bad!

Generally speaking, your training data should have more data points than features to allow for a unique hyperplane as solution...but sometimes feature dimensions is too high comparing to data point numbers...
-
-
Strategy
When we are unsure which hyperplane to choose, just choose the
near 0 unless there's a strong reason not toThe way to implement this strategy is Ridge Regression (see 3.3.4)
3.3.4 Ridge Regression
This is for the problem that sometimes there isn't a best Hyperplane as our solution in 3.3.3 (and also overfitting)
Under these situations,
The solution is to add a Regularizer / Penalty
The Gradient of the optimization objective of Ridge Regression
Minimizing
where
This is called Ridge Regression because we are adding a "Ridge" of
[IMPORTANT]
is a matrix of derivatives. This is ALWAYS invertible withWhen using Ridge Regression, we assume features are standardized as in 2.1.2
3.3.5 Gradient Descent Solution
You may ask: "Why do we still need gradient descent when we could derive
Here's why: Matrix Inversion has a time complexity of
This Gradient Descent Solution is based on Ridge Regression
-
Gradient of
-
Gradient of
Here's the pseudo-code for Gradient Descent for Linear Regression (using Ridge Regression)
