logo 2uData.com

The previous pageFundamentals of regressionThe next page

In many cases, we would like to quantify the relationship(s) between numerical variables, it means that we have to find out equation(s) that express these relationship(s). In these cases, we have to use regression. In this page, we investigate simple cases consisting of only two variables `X` and `Y`.

Regression equation

 

In order to find out the regression equation `y = f(x)` (sometimes known as model), first of all, we have to determine the general form of this equation: linear, quadratic, cubic, exponential, logarithmic, ... This general form is chosen based on experience, on preliminary study, on reference, ...

Next, we determine the coefficients of this equation. For example:

  • for quadratic equation `y=ax^2+bx+c` we have to determine `a`, `b`, and `c`,
  • for exponential equation `y=m+n e^(px)` we have to determine `m`, `n`, and `p`.

To determine the coefficients of regression equation, symbolized generally as `a_k`, least squares method is used.


Least squares method

 

Consider a simple case, for one value of `X`, there is one corresponding value of `Y`.

For each value `x_i`, there are two corresponding values of `y` (Fig. 1)

XYxi MoyiyioM

Fig. 1 Illustration to least squares method.

  • `y_i` : actual value, obtained from the observation, measured during study. This value is represented by point M.
  • `y_(io)` : predicted value, obtained from regression equation `y=f(x)`, represented by point
    Mo :   `y_(io)=f(x_i)`.
  • In general, there is a difference between these two values, `y_i–y_(io)` (known as residue), represented by distance MMo.

The chosen curve is the one which is “closest” with points M. It means that the sum of distances MMo is smallest, or the sum of squares `(y_i – y_(io))^2` is smallest. Denote this sum as `SS_E` (sum of squares of error):

`SS_E=sum_i (y_i-f(x_i))^2`(2)

In `SS_E`, beside values `x_i`, `y_i` which we already know, there are coefficients `a_k` which we don't know and we have to find.

To satisfy the least squares condition, then:

`(partial SS_E)/(partial a_k)=0`(3)

Solve equations (3), we find out `a_k`, and therefore, regression equation.


Coefficient R2

 

To evaluate the conformity of regression equation to real data, coefficient of determination, denoted as `R^2`, is used. To determine this coefficient, we realize as follows:

  • Calculate the mean of all `y_i`, symbolized as `bar y`.
  • Calculate `SS_T` (total sum of squares) defined as:
    `SS_T=sum_i (y_i-bar y)^2`(4)
  • Coefficient of determination `R^2` is defined as:
    `R^2=`1-(SS_E)/(SS_T)`(5)

The value of `R^2` varies from 0 to 1. This quantity represents the conformity of the regression equation. The higher this value is, the better regression equation conforms to data.




The previous pageThe first page of chapterThe next page


This web page was last updated on 03 December 2018.