## Simple Linear Regression and Correlation

The relationship between two variables may be one of dependency. That is, the magnitude of one of the variable (the dependent variable) is assumed to be determined by the magnitude of the second variable (the independent variable). Sometimes, the independent variable is called the predictor or regressor variable, and the dependent variable is called the response or criterion variable. This dependent relationship is termed regression. However, in many types of biological data, the relationship between two variables is not one of dependency. In such cases, the magnitude of one of the variables changes with changes in the magnitude of the second variable, and the relationship is correlation. Both simple linear regression and simple linear correlation consider two variables. In the simple regression, the one variable is linearly dependent on a second variable, whereas neither variable is functionally dependent upon the other in the simple correlation.

It is very convenient to graph simple regression data, using the abscissa (X axis) for the independent variable and the ordinate (Y axis) for the dependent variable. The simplest functional relationship of one variable to another in a population is the simple linear regression:

Here, a and p are population parameters (constants) that describe the functional relationship between the two variables in the population. However, in a population the data are unlikely to be exactly on a straight line, thus Y may be related to X by

Y = a + pXt + et where s; is referred to as an error or residual.

Generally, there is considerable variability of data around any straight line. Therefore, we seek to define a so-called "best-fit" line through the data. The criterion for "best-fit" normally utilizes the concept of least squares. The criterion of least squares considers the vertical deviation of each point from the line (Y; — Y/) and defines the best-fit line as that which results in the smallest value for the sum of the squares of these deviations for all values of Y/ with respect to Y. That is, E"=1 (Y — YD2 is to be minimum where n is the number of data points. The sum of squares of these deviations is called the residual sum of squares (or the error sum of squares). Because it is impossible to possess all the data for the entire population, we have to estimate parameters a and P from a sample of n data, where n is the number of pairs of X and Y values. The calculations required to arrive at such estimates and to execute the testing of a variety of important hypotheses involve the computation of sums of squared deviations from the mean. This requires calculation of a quantity referred to as the sum of the cross-products of deviations from the mean:

X xy = X (X; — X)( Yt — Y) = X XtYt — {[E X;) (E Y)}/n

The parameter P is termed the regression coefficient, or the slope of the best-fit regression line. The best estimate of P is b = E xy/E x2 = {E (X; — X)(Yi — Y)}/E (Xt — X)2 = [E XY — {(E X;)(E Yi)}/n][E X2 — (E X;)>]

Although the denominator in this calculation is always positive, the numerator may be either positive, negative, or zero. The regression coefficient expresses what change in Y is associated, on the average, with a unit change in X.

A line can be defined uniquely, by stating, in addition to P, any one point on the line, conventionally on the line where X = 0. The value of Y in the population at this point is the parameter a, which is called the Y intercept. The best estimate of a is a = Y — bX

By specifying both a and b, a line is uniquely defined. Because a and b are calculated using the criterion of least squares, the residual sum of squares from this line is the smallest. Certain basic assumptions are met with respect to regression analysis (Graybill and Iyer, 1994; Sen and Srivastava, 1997).

1. For any value of X there exists in the population a normal distribution of Y values and, therefore, a normal distribution of s's.

2. The variances of these population distributions of Y values (and of s's) must all be equal to one another, that is, homogeneity of variances.

3. In the population, the mean of the Y's at a given X lies on a straight line with all other mean Y's at the other X's; that is, the actual relationship in the population is linear.

4. The values of Y are to have come at random from the sampled population and are to be independent of one another.

5. The measurements of X are obtained without error. In practice, it is assumed that the errors in X data are at least small compared with the measurement errors in Y.

For ANOVA, the overall variability of the dependent variable termed the total sum of squares is calculated by computing the sum of squares of deviations of Y values from Y:

total SS = £ y2 = £ (Yt — Y)2 = Z Y2 — (Z Y)2/n

Then, one determines the amount of variability among the Y values that results from there being a linear regression; this is termed the linear regression sum of squares.

= [Z XY — {(Z X )(Z Yt)}/n]2/[Z X2 — (Z Xt)2/n]

The value of the regression SS will be equal to that of the total SS only if each data point falls exactly on the regression line. The scatter of data points around the regression line is defined by the residual sum of squares, which is calculated from the difference in the total and linear regression sums of squares:

residual SS = Z (Y; — Y/)2 = total SS — regression SS

The degrees of freedom associated with the total variability of Y values are n — 1. The degrees of freedom associated with the variability among Y^'s due to regression are always 1 in a simple linear regression. The residual degrees of freedom are calculable as residual DF = total DF — regression DF = n — 2

Once the regression and residual mean squares are calculated (MS = SS/DF), the null hypothesis may be tested by determining

F = regression MS/residual MS

This calculated F value is then compared to the critical value, Fa(1)>VlV2, where v1 = regression DF = 1, and v2 = residual DF = n — 2. The residual mean square is often written as sY,X, a representation denoting that it is the variance of Y after taking into account the dependence of Y on X. The square root of this quantity — that is, SYX — is called the standard error of estimate (occasionally termed the standard error of the regression). The ANOVA calculations are summarized in Table 2.2.

The proportion of the total variation in Y that is explained or accounted for by the fitted regression is termed the coefficient of determination, r2, which may be thought of as a measure of the strength of the straight line relationship:

r2 = regression SS/total SS

The quantity r is the correlation coefficient which is calculated as r = E xy/(E x2 E y2)i'2

TABLE 2.2. ANOVA Calculations of Simple Linear Regression

Source of Variation Sum of Squares, SS DF Mean Square, MS

Source of Variation Sum of Squares, SS DF Mean Square, MS

TABLE 2.2. ANOVA Calculations of Simple Linear Regression

 Total Y — Y] I / n — 1 Linear regression (I xy)2/1 x2 1 Regression SS/regression DF Y — Y]
0 0