A. Speaking from the story of the linear regression

I believe we all heard the famous naturalist, Charles Darwin’s name, and the hero of today’s story is his cousin Galton.

Galton was a physiologist, in 1995, he studied the 1078 heap and his son’s height and found that they generally satisfy a formula, that is,

Y=0.8567+0.516*x

This formula of x refers to the father’s height, Y refers to the son’s height. It is clear that this is what we learned in high school linear equations, reaction on a plane is a straight line.

By this formula, we might intuitively father will always be there for the tall tall son, the father will have shorter shorter son. But Galton further observed that not all cases are like this. Particularly high father’s son will be shorter than his father some special dwarf father’s son will be higher than some of his father, said his son would not have continued to go higher or shorter. This phenomenon, in fact, return. Trend will not last forever, but will return to a center.

Through this story, I believe you have to understand what is linear regression, then the next we are concerned that more details.

II. Understanding Linear Regression

Throws question: Suppose there is such a number of points, which are the available data, we have to find the line fitting these points, and then predict the next point. Then we want to find out how this line on the road?

h (x) =a0+a1* x (a1 is slope, a0 is intercept)

Or another question, how do we find a_1 and a_0?

Cost Function

First contact with linear regression students may not know what cost function is. In fact, when you encounter a concept that does not know, just think about two things, what is the concept and what is the use. If you think about these two points, at least you will not be confused.

What is the cost function?

我们先随便画两条线来拟合那些点，如图所示，可以看到，明显图二更加拟合，也就是说图二的线更接近我们理想中的线。

OK，再仔细观察，图二的线和图一的线有什么不同呢？最明显的，就是图一中，各个点沿y轴到那条直线的距离更远，而图二中各个点到线的距离更近。

This all points along the y axis to the line error, i.e. error of each point, the average value. Is the cost function. Formula is as follows:

pred (i) is the y-value of the line at the ith point, and y (i) is the ith point. The y-value of this point, plus the squared, mainly avoids negative numbers. This is the cost function.

What is the use of cost functions?

The cost function helps us to figure out the best possible values for a0 and a1. As mentioned earlier, the cost function is the average of the distance from the y-axis to the straight line at each point. Our goal is to minimize this value. In general, the cost function is convex, as shown in the following figure.

Is it familiar to see this function? When learning derivation, you don’t often see such a graph, which is usually solved by the derivation.

From y=a0+a1*x, this line starts. By writing the cost function, our goal has remained the same, just to find a0 and a1 and make the line closer to those points (that is, to minimize the cost function). Of course, we haven’t talked about how to minimize the cost function. Let’s talk about how to minimize the cost function.

Gradient Descent

What is a gradient drop?

Gradient descent is an iterative constantly updated a0 and a1 method of reducing the cost function. We can derivation of the cost function the way it should be seen that let a0 or a1 to add or subtract.

The above part is actually a derivative of the cost function, by which we can know whether a0 and a1 should increase or decrease.

This formula is actually a (partial derivatives a0- cost function). Of course, where there is a rate α (Alpha) control, the derivative of the cost function to know a0 and a1 is increased or decreased, it should be increased [alpha] is much, much reduced.

For example, suppose you’re on a half-hill right now, what you’re going to do is go downhill, the deflection of the cost function, is telling you whether you should go down or up. And the rate α is to control how big the step is.

Small steps (α small) means small run down the mountain, the disadvantage is the relatively long run. Major step forward (α large) means faster, but may suddenly step too, ran across the hillside to go.

Gradient descent what’s the use?

By gradient descent, so that we can find a0 and a1 a local optimal solution, and why it is locally optimal solution? Examples because the problem may not be a reality in the beginning so clear, many times you may find this line can be, this line is not bad, the piece seems to be. The computer will be so, it may also feel a certain line has been good enough. Do not go to the other line.

In response to the problem we ask, it can be said that it is the minimization problem (the minimization cost function), but probably like the image on the right, it is minimal in a local, it is raised to the left to the right, so it’s safe to be salted. This phenomenon is related to the initial random selection, and also to the rate of training.

When an appropriate alpha value is selected, when the update iterates enough times after. theoretically reaches a certain bottom, which means that the cost function is the smallest in a certain range. At this time the a0 drink a1 was asked for, and we were able to get a straight line to fit the midpoint of the space.

And finally, what we’ve just described is just computation in two-dimensional space, that is, only one feature. In reality, there are often more than one feature, but more than one feature, such as the following form:

h(x)=a0 + a1 * x1 + a2 * x2 + a3 * x3 …

However, the calculation method and calculation method are similar. It’s just that the amount of data will grow and the computation will be more complex.

OK, start with an example of today began to introduce linear regression. Then the cost function described, as well as a method of solving the cost function is minimized, gradient descent. We will be back on to do a linear regression, and introduce a variety of other preliminary regression analysis using sklearn.

Above ~

Recommended Reading:

IDEA builds the latest Spark2.4.3 source code reading and debugging development environment on Windows

Scala Functional Programming Guide (I) Introduction to Functional Thinking

Commonly speaking of decision tree algorithm (2) instance analysis

Evolution of Big Data Storage — From RAID to Hadoop Hdfs

C, java, Python, behind these names!