My Machine Learning Notes: lm (in R) output inferences

Above is the screenshot of linear model summary on the Boston dataset (MASS library).

Dependent variable: medv

The initial lines of the linear model summary displays various statistics (Minimum value, 1st quartile value, Median, 3rd quartile value and Maximum value) of the residuals. Residuals are deviations of the model from the actual training data.

Then comes the significance listing of the various coefficients obtained. A null hypothesis test is being performed to check the effectiveness of the model obtained, by measuring the probabilities (p-values) of getting to see the coefficients' values that are estimated by the model, in case if there is no relation between dependent and independent variables. I will try to clarify this in the following steps.

H0 (Null hypothesis): There is no relation between the dependent and the independent variable. Thus, the actual coefficients tend to remain around zero.

H1 (Alternative hypothesis): There is definitely some relation between dependent and independent variables. Thus the coefficients estimated by the model are not obtained just by chance.

We now assume that H0 is true and will try to validate the correctness of our assumption. For this, we derive numerous samples from the given training data and come up with the coefficient values for each sample. Applying central limit theorem, the sampling distribution for each of the coefficients of the samples, should be normally distributed (or t-distributed here). For example, the sampling distribution of zn coefficients of the samples would look like:

where μ = mean of the sampling distribution = zn coefficient value of the original distribution = 0 (assuming no variable dependency as per H0)

and σ = standard deviation of the sampling distribution = standard error.

Now, looking at the figures in summary(fit) screenshot:

Coefficient estimate for zn = 4.624e-02

Standard error σ = 1.373e-02

Here, we are finally interested to know the probability of getting a value of 4.624e-02 or more, for zn coefficient under the assumption of H0, which is essentially the area under the sampling distribution curve from the 4.624e-02 mark on the horizontal axis. So, we need to know, how many standard deviations away from the mean, will the value 4.624e-02 fall, which is essentially called the t-statistic or the t-value.

Thus, t-statistic or t-value = (Coefficient estimate for zn - μ)/(σ) = (4.624e-02 - 0)/(1.373e-02) = 3.382 (which is the same value as observed in the summary(fit) screenshot).

Now, having these values with us, we can refer to the t-table in order to find out the AUC from the 4.624e-02 mark, which is the probability of getting at-least 4.624e-02 for zn coefficient under H0 assumption. This probability is also called the p-value in terms of hypothesis testing.

The p-value thus obtained = 0.000778

In other words we can say that there is a 0.0778% chance of getting 4.624e-02 for zn coefficient in the data samples, if we assume that there is no relationship between dependent and independent variables (H0). Such a low probability can easily be ruled out and H0 can be falsified. Thus, we can say that the dependent variable is significantly dependent on zn and hence, the "***" against estimated zn coefficient in then summary(fit) screenshot.

Residual standard error: This is an estimate of standard deviation of the residuals and hence, an estimate of the proportion of variance of the dependent variable that is unexplained my the model. Practically, a residual standard error of 0 signifies overfitting.

R-squared: R-squared estimates the proportion of variance of the dependent variable that is explained by the model (or the independent variable).

Multiple R-squared: It is the measure of R-squared for a model having multiple independent variables.

Adjusted R-squared: As the number of predictor variables (independent variables) in a model increase, the value of multiple r-squared also increases, since each new predictor added will contribute in explaining a proportion of variance of the target (dependent) variable. This contribution may not be very significant, but will add on to multiple r-squared. Thus, adjusted r-squared is calculated by adding a penalty for the number of predictors.

The difference between multiple and adjusted r-squared values will not be large for a good fit model. A large difference observed between these values, might indicate an overfitted model.

Now, we observe that the values of residual standard error and multiple/adjusted r-squared are on different scales. This makes it difficult to compare the proportion of variance of the target explained verses the proportion of variance of the target left unexplained by the model. Thus, we perform a F-test, the outcome of which is F-statistic. F-statistic gives us the measure of the proportion of variance of the target explained verses the proportion of variance of the target left unexplained by the model. Thus, the higher value of F-statistic obtained here (108.1, as seen in the screenshot above), signifies that the proportion of the target variance explained by the model is much higher than the proportion of the target variance unexplained by the model. This definitely means that the model obtained is a good fit. Further, the low p-value of the F-test (2.2e-16) tells that the probability of getting this high value of F-statistic under the assumption of H0 (null hypothesis) is very low, and thus H0 (null hypothesis) can easily be ruled out and H1 (alternative hypothesis) holds strong and is to be considered.

My Machine Learning Notes

Wednesday, March 15, 2017

lm (in R) output inferences

No comments:

Post a Comment