In many cases, we would like to quantify the relationship(s) between numerical variables, it means that we have to find out equation(s) that express these relationship(s). In these cases, we have to use regression. In this page, we investigate simple cases consisting of only two variables `X` and `Y`.
Regression equation
In order to find out the regression equation `y = f(x)` (sometimes known as model), first of all, we have to determine the general form of this equation: linear, quadratic, cubic, exponential, logarithmic, ... This general form is chosen based on experience, on preliminary study, on reference, ...
Next, we determine the coefficients of this equation. For example:
- for quadratic equation `y=ax^2+bx+c` we have to determine `a`, `b`, and `c`,
- for exponential equation `y=m+n e^(px)` we have to determine `m`, `n`, and `p`.
To determine the coefficients of regression equation, symbolized generally as `a_k`, least squares method is used.
Least squares method
Consider a simple case, for one value of `X`, there is one corresponding value of `Y`.
For each value `x_i`, there are two corresponding values of `y` (Fig. 1)
Fig. 1 Illustration to least squares method.
- `y_i` : actual value, obtained from the observation, measured during study. This value is represented by point M.
- `y_(io)` : predicted value, obtained from regression equation `y=f(x)`, represented by point
Mo : `y_(io)=f(x_i)`. - In general, there is a difference between these two values, `y_i–y_(io)` (known as residue), represented by distance MMo.
The chosen curve is the one which is “closest” with points M. It means that the sum of distances MMo is smallest, or the sum of squares `(y_i – y_(io))^2` is smallest. Denote this sum as `SS_E` (sum of squares of error):
| `SS_E=sum_i (y_i-f(x_i))^2` | (2) |
In `SS_E`, beside values `x_i`, `y_i` which we already know, there are coefficients `a_k` which we don't know and we have to find.
To satisfy the least squares condition, then:
| `(partial SS_E)/(partial a_k)=0` | (3) |
Solve equations (3), we find out `a_k`, and therefore, regression equation.
Coefficient R2
To evaluate the conformity of regression equation to real data, coefficient of determination, denoted as `R^2`, is used. To determine this coefficient, we realize as follows:
- Calculate the mean of all `y_i`, symbolized as `bar y`.
- Calculate `SS_T` (total sum of squares) defined as:
`SS_T=sum_i (y_i-bar y)^2` (4) - Coefficient of determination `R^2` is defined as:
`R^2=`1-(SS_E)/(SS_T)` (5)
The value of `R^2` varies from 0 to 1. This quantity represents the conformity of the regression equation. The higher this value is, the better regression equation conforms to data.