ML From Scratch, Part 1: Linear Regression

stared · on Oct 19, 2019

I tried to do from scratch AND hands-on in PyTorch:

https://colab.research.google.com/github/stared/thinking-in-...

By all means, much more "from scratch".

It is a part of open-source "Thinking in Tensors, Writing in PyTorch": https://github.com/stared/thinking-in-tensors-writing-in-pyt...

semiotagonal · on Oct 19, 2019

I tried the popular Machine Learning course on Coursera, and found myself spending tons of time debugging Octave code because to me it was such an unfamiliar way of expressing things. A widely accepted alternative in a more popular programming language would be a nice thing to exist.

unoti · on Oct 19, 2019

It does exist! There are lots of answers to what you’re asking for. But here’s the best answer:

https://course.fast.ai/

This course uses python and pytorch and will have you building something that can tell 30+ breeds of cats and dogs apart in the first lesson. While the old course from Andrew ng you mentioned starts with theory and by the end gets to doing something useful, the fastai course starts with best practices and moves into the underlying theory later.

Also Andrew Ng has a newer deeplaearning.ai course on coursera that is all python and doesn’t do Octave.

semiotagonal · on Oct 19, 2019

Thanks.

graycat · on Oct 20, 2019

Can see my post here

https://news.ycombinator.com/item?id=21301117

I derive the normal equations in just two lines of matrix algebra, never use calculus, assume nothing about probability.

The normal equations are plenty simple to program in nearly any programming language. For the linear algebra, just use LINPACK or for a really simple approach just write your own Gauss elimination linear equations solver.

curiosity_100 · on Oct 19, 2019

You can find detailed python solutions for the Coursera course homework online, for example: https://github.com/suraggupta/coursera-machine-learning-solu...

zwaps · on Oct 21, 2019

I hate myself in advance for a reply like this, but I just scrolled randomly towards the loss function part of that document and it reads

"You can see also expected value notation, were E[x] means an average of x, that is ∑N−1i=0xi/N."

I think this is not what E[x] means, i.e. going from expectation to the average (using the sum over n-1 no less) involves a bit more than notation (in this case, assumptions about the DGP). Yes I also hate posts like this, but in the interest of this being a course on mathematics in DL - wanted tollet you know.

stared · on Oct 27, 2019

Well, it this context, there are the same, and I don't see a room for confusion.

(I can add that with "uniform measure" and "arithmetic average", but for people learning it it is too much, for people knowing that the context is obvious anyway.)

Sedai05 · on Oct 19, 2019

It says it's from scratch, but it's using a lot of words and math symbols I don't understand when describing Linear Regression.

applecrazy · on Oct 19, 2019

If you want a gentler introduction, I recommend the first module of Andrew Ng’s Machine Learning class on Coursera. He also explains things intuitively before jumping into the math. Even when he uses math, he makes sure you understand the notations and what equation means.

graycat · on Oct 20, 2019

Then you might prefer my

https://news.ycombinator.com/item?id=21301117

here.

I carefully define all symbols. I keep the math really simple; the only prerequisite is matrix multiplication; don't need matrix inverse. For matrix transpose, I define that. For squared length of a vector, I define that, too. I don't use calculus, derivatives, or probability. I derive the normal equations in just two lines of matrix algebra.

adipginting · on Oct 19, 2019

Maybe the target audience are Mathematicians or Physicists in the mind of the author

z3phyr · on Oct 19, 2019

At first I thought it was about implementing an ML like language. Then the realisation hit pretty fast..

uhryks · on Oct 20, 2019

This, I wish people wrote Machine Learning in their titles, always confuses me.

antonvs · on Oct 19, 2019

Because linear regression is a common language implementation technique...

Sinidir · on Oct 19, 2019

Understood everything up until this:

        ∇ΘJ=∇Θ(y−XΘ)T(y−XΘ)

expanded into

        ∇ΘJ=∇ΘyTy−(XΘ)Ty−yTXΘ+ΘT(XTX)Θ

Can someone explain why ∇Θ is only applied to the first term? Also i have never seen gradient notation used like that before.

metahost · on Oct 19, 2019

Looks like the equation in the article is missing enclosing parenthesis. Read the following paragraph.

graycat · on Oct 20, 2019

You don't need calculus or that notation. See my

https://news.ycombinator.com/item?id=21301117

in this thread. I keep the math and notation really simple. The only prerequisite is matrix multiplication. I use no calculus or probability. I derive the normal equations in just two simple lines of matrix algebra.

If you want software, there is lots of it or you can write your own easily enough: For the matrix work, might use LINPACK or just write a little Gauss elimination routine for yourself.

Myrmornis · on Oct 19, 2019

> Can someone explain why ∇Θ is only applied to the first term?

It's not, there's an implied parenthesis:

(∇_Θ)J = (∇_Θ)( yTy − (XΘ)Ty − yTXΘ + ΘT(XTX)Θ ).

In the subsequent expressions you can see derivatives being taken of all the terms.

kxyvr · on Oct 19, 2019

Well, personally I prefer a different kind of notation. That said, there are several ways to get the same solution. In the following, apostrophe, ', means transpose and two symbols next to each other means multiplication, Ax = A (times) x. I can't figure out how to stop italics on this forum using a star.

1. Expansion. Let J(x) = norm(Ax-b)^2. Then,

J(x) = norm(Ax-b)^2 = (Ax-b)'(Ax-b) = (x'A'-b')(Ax-b) = x'A'Ax - b'Ax -x'A'b + b'b = x'A'Ax - 2b'Ax + b'b

Then,

grad J(x) = 2A'Ax-2A'b

Except, how did we know that grad(x'A'Ax) = 2A'Ax or grad(2b'Ax)=2A'b? Well, there are at least two ways to derive it. One, go back to the limit definition. This is tedious. Two, cheat and use a Taylor series. Now, this assumes that we believe Taylor's theorem and it's worthwhile to go through a proof someday. Today's not that day. Taylor series assert that under the proper conditions that:

J(x+dx) = J(x) + J'(x)dx + (1/2)J''(x)dxdx + R(x)

where J'(x) denotes the total derivative of J and R(x) is a remainder term. This is a linear operator where if J:X->Y, then J'(x)\in L(X,Y) where L(X,Y) denotes the space of linear operators that have domain X and codomain Y. This means that J''(x)\in L(X,L(X,Y)) since we can still play this same trick. This means that J''(x)dxdx \in Y. Anyway, why do we care? If we can write something that looks like a Taylor series, then we can read off the derivatives. For example, for K(x) = x'Bx, we have that:

K(x+dx) = (x+dx)'B(x+dx) = x'Bx+dx'Bx+x'Bdx+dx'Bdx = x'Bx+2x'Bdx+dx'Bdx = K(x)+K'(x)dx+1/2K''(x)dxdx

Hence, if we just match the terms:

K(x)=x'Bx K'(x)dx=2x'Bdx (1/2)K''(x)dx=dx'Bdx or K''(x)=2dx'Bdx

Great, except grad K(x)=2B'x and not 2x'B. Why the transpose? Well, it's because K'(x) and grad K(x) are not the same thing. Recall, from above, if K:X->R where R means the real numbers, then K'(x)\in L(X,R). To be clear, this is a function that maps X to R. Say X was something like an m dimensional vector space, R(m). Then, K'(x) : R(m)->R. If we want to represent this as a matrix, the matrix would have size R(1,m). This is a row vector. Normally, the gradient is a column vector. Why? Well, most of the time, people view the gradient as a matrix of partial derivatives and they order them in a magical way. To me, a better way to visualize it is as what's called the Riesz representative of the total derivative. Basically, there's a theorem that states that under certain conditions any linear functional can be represented as the inner product between an element in that space and the argument. Basically, if f is a linear function that maps to the real numbers, then we can write any linear function, under the right conditions, as f(x)=<a,x> where <.,.> denotes inner products. In R(m), this inner product is just the transpose, so any linear functional in R(m) can be written as f(x)=a'x. Going back to the above, K'(x)dx=2x'Bdx=(2B'x)'dx. This means that the linear functional C(dx)=K'(x)dx can be written as C(dx)=K'(x)dx=(2B'x)'dx. That element, 2B'x is the gradient, by definition. It's the Riesz representative of the total derivative. And, again, you can define the gradient differently. I wouldn't since this is clean and generalizes to Hilbert spaces, which are complete inner products spaces, which means we get up to some restriction of infinite dimensions and function spaces.

OK, that was a lot, but then we can realize that we can just apply the Taylor series trick to the whole expansion:

J(x+dx) = norm(A(x+dx)-b)^2 = norm(Ax+b+Adx)^2 = norm(Ax+b)^2 + 2(Ax+b)'Adx + dx'A'Adx = J(x) + J'(x)dx + (1/2)J''(x)dxdx

After matching, we have that

J'(x)dx = 2(Ax+b)'Adx

or that

grad J(x) = 2A'(b+Ax) = 2A'b + 2A'Ax

2. But, wait, there's more. We could just use the chain rule. Recall, (f o g)'(x) = f'(g(x))g'(x) where o denotes function composition. In the above case, f(x)=norm(x)^2 and g(x)=Ax+b. Also playing Taylor series games, we'd eventually find out that f'(x)dx=2 dx' where I is the identity and g'(x)=A. This means that

f'(g(x))g'(x) = 2(Ax-b)'A

which means the gradient is:

2A'(Ax-b)

3. If you really don't like this, you could pretend to be a physicist and do everything component wise with 1-D derivatives. That was a joke. This will produce the correct result, but one that I find particularly painful to derive. As such, I will punt and avoid that here.

Anyway, all teasing aside, there are lots of different ways to get the same result. Really, just use what works well for you and potentially adapt if you're working with an audience that views things differently. I don't claim what I wrote above is the best notation or derivations for you, but they work well for me and I wanted to provide some variety on a Saturday afternoon.

olooney · on Oct 20, 2019

Yes, using chain rule for the gradient of norm(Ax-b)^2 makes that derivation a lot shorter. Or just citing a theorem for the gradient of the L2 norm would have been fine. I used the FOIL approach because each of the 4 terms can be tackled using very elementary theorems in matrix calculus - a constant term, two linear terms, and a quadratic form. But in retrospect this was probably a mistake seeing as several people have gotten hung up on it.

MidgetGourde · on Oct 19, 2019

I could balance equations in chemistry and do some algebra and trigonometry at GCSE level. But this blows my mind. Higher tier GCSE Maths. I should have paid more attention. I think I know of what regression is in a simpistic term. It's like taking a rolling graph and predicting the next point on the x / y axis, using the previous data as a training model. I hope.

applecrazy · on Oct 19, 2019

You’re exactly right. It’s fitting a higher-dimensional line or curve to the data.

j7ake · on Oct 19, 2019

I found the notation here non-standard and confusing.

In my opinion, the wikipedia article is more concise and provides more intuition (especially the multiple ways to derive the closed form solution).

https://en.wikipedia.org/wiki/Ordinary_least_squares#Alterna...

graycat · on Oct 20, 2019

The Wikipedia article has lots of nice material, but much easier than that or the OP here is my

https://news.ycombinator.com/item?id=21301117

also in this thread.

adipginting · on Oct 19, 2019

For a computer science phd or master student candidate, I think the article or the site overall is very good

graycat · on Oct 20, 2019

For any student, middle school or later who knows matrix multiplication, but no calculus or probability, just use my

https://news.ycombinator.com/item?id=21301117

in this thread. It's really simple!

dlphn___xyz · on Oct 19, 2019

one of the best ‘from scratch’ articles ive read on LR

graycat · on Oct 19, 2019

=== Introduction

Here I give some notes for a shorter, simpler, more general, no calculus derivatives derivation of the basics of regression analysis.

=== The Given Data

Let R be the set of real numbers (can use the set of rational numbers instead if wish).

We are given positive integers m and n.

In our typing, character '^' starts a superscript; character '_' starts a subscript.

We regard R^m as the set of all m x 1 matrices (column vectors); that is, each point in R^m is a matrix with m rows and 1 column with its components in set R.

We are given m x 1 y in R^m.

We are given m x n matrix A with components in R.

We will need to know what matrix transpose is: For an intuitive view, the transpose of A is A^T where each column of A becomes a row of A^T. Or to be explicit, for i = 1 to m and j = 1 to n, the component in row i and column j of A becomes the component in row j and column i of A^T.

We notice that each column of A is m x 1 and, thus, a point in R^m.

Suppose b is n x 1 with components in R.

Then the matrix product Ab is m x 1 in R^m and is a linear combination (with coefficients from b) of the columns of A.

=== Real World

So we have data A and y. To get this data, maybe we got some medical record data on m = 100,000 people.

For each person we got data on age, income, years of schooling, height, weight, birth gender (0 for male, 1 for female), years of smoking, blood sugar level, blood cholesterol level, race (1 to 5 from white, black, Hispanic, Asian, other), that is n = 10 variables. We get to select these variables and call them independent.

For person i = 1 to m = 100,000, we put the value of variable j = 1 to n = 10 in row i and column j of matrix A.

Or matrix A has one row for each person and one column for each variable.

Vector y has systolic blood pressure: For i = 1 to m = 100,000, row i of y, that is, y_i, has the systolic blood pressure of person i. We regard systolic blood pressure in y as the dependent variable.

What we want to do is use our data to get a mathematical function that for any person, in the m = 100,000 or not, takes in the values of the independent variables, 1 x n v, and returns the corresponding value of the dependent variable 1 x 1 z. To this end we want n x 1 b so that we have

z = vb

So for any person, we take their independent variable values v, apply coefficients b, and get their dependent variable value z.

We seek the b we will use on all people.

=== The Normal Equations

So, we have our given m x n A and m x 1 y, and we seek our n x 1 b.

We let

S = { Au | n x 1 u }

That is, S is the set of all the linear combinations of the columns of A, a hyperplane in R^m, a generalization of a plane in three dimensions or a line in two dimensions, a vector subspace of R^m, the vector subspace of R^m spanned by the columns of A.

We seek w in S to minimize the squared length

||y - w||^2

= \sum_{i = 1}^m (y_i - w_i)^2

(notation from D. Knuth's math typesetting software TeX), that is the sum (capital letter Greek sigma) from i = 1 to m of

(y_i - w_i)^2

where y_i is component i of m x 1 y in R^m and similarly for w_i.

That is, we seek the w in the hyperplane S that is closest to our dependent variable value y.

Well, from some simple geometry, the vector

y - w

has to be perpendicular to the hyperplane S.

Also from the geometry, w is unique, the only point in S that minimizes the squared length

||y - w||^2

For the geometry, there is one classic theorem -- in W. Rudin, Real and Complex Analysis -- that will mostly do: In quite general situations, more general than R^m, every non-empty, closed, convex set has a unique element of minimum norm (length).

Now that we have the definitions and tools we need, we derive the normal equations in just two lines of matrix algebra:

Since (y - w) is perpendicular to each column of A, we have that

A^T (y - w) = 0

where the right side is an n x 1 matrix of 0s.

Then

A^T y = A^T Ab

or with the usual writing order

(A^T A)b = A^T y

the normal equations where we have A and y and solve for b.

=== Results

Since w is in S, the equations do have a solution. Moreover, from the geometry, w is unique.

If the n x n square matrix (A^T A) has an inverse, then the solution, the b is also unique.

Note: Vector w DOES exist and is unique; b DOES exist; if the inverse of (A^T A) exists, then b is unique; otherwise b STILL exists but is not unique. STILL w is unique. How 'bout that!

Since w is in S and since (y - w) is perpendicular to all the vectors in S, we have that

(y - w)^T w = 0

y^T y = (y - w + w)^T (y - w + w)

= (y - w)^T (y - w) + (y - w)^T w + w^T (y - w) + w^T w

= (y - w)^T (y - w) + w^T w

or

[total sum of squares] = [regression sum of squares] + [error sum of squares]

or the Pythagorean theorem.

So, now that we have found b, for any v, we can get our desired z from vb.

The b we found works best in the least squares sense for the m = 100,000 data we had. If the 100,000 are an appropriate sample of a billion, and if our b works well on our sample, then maybe b will work well on the billion. Ah, maybe could prove some theorems here, theorems with meager or no more assumptions than we have made so far!

"Look, Ma, no calculus! And no mention of the Gaussian distribution. No mention of maximum likelihood.

Except for regarding the m = 100,000 observations as a sample from a billion, we have mentioned no probability concepts. We never asked for a matrix inverse. And we never mentioned multicollinearity."

And, maybe the inverse of n x n (A^T A) does exist but is "numerically unstable". Sooooo, we begin to suspect: The b we get may be inaccurate but the w may still be fine and our

z = vb

may also still be accurate. Might want to look into this.

For a start, sure, the matrix (A^T A) being numerically unstable is essentially the same as the ratio of the largest and smallest (positive, they are all positive) eigenvalues being large. That is, roughly, the problem is that the small eigenvalues are, in the scale of the problem, close to 0. Even if they are 0, we have shown that our w is still unique.

vermooten · on Oct 19, 2019

Hardly 'from scratch', I was expecting an ELI5 article.

dang · on Oct 19, 2019

The submitted title, "Linear Regression from Scratch", was misleading. We've replaced it with the title on the page.

From-scratchness is relative I guess.

ChrisSD · on Oct 19, 2019

First you need to create a universe...

craftinator · on Oct 19, 2019

Easy Biscuits from Scratch:

Step 1) Plant a crop of wheat

Sedai05 · on Oct 19, 2019

Could you explain it like you're explaining to a high school senior?

coolness · on Oct 19, 2019

Yeah, I wonder what the intended audience is. I would guess that if someone is familiar with the notation and concepts introduced, that they would also be quite familiar with linear regression. Nice article nevertheless.

smitty1e · on Oct 19, 2019

The LaTeX was LaToast, which is LaShame.

bradleyankrom · on Oct 19, 2019

Are you blocking one of the JS libs that the page uses? The notation was screwed up for me until I temporarily unblocked them.

smitty1e · on Oct 20, 2019