What Is the Most Important Part of an External Audit
Essay by David Liao • March 12, 2016 • Research Paper • 2,806 Words (12 Pages) • 1,567 Views
Regression and Generalised Linear Models (ST300) Project
Candidate number 33845
Introduction
Aim
The aim of this project is find an explanation to understand the relationships between height and a selection of a number of variables relating to the physical dimensions of the individuals in a data set. The data that we used was a data set of 507 individuals who regularly exercised and composed evenly of males and females.
Methodology
The step we started with first was to select the variables that we considered to have a significant effect on the response variable (here the height of the individual)[1].
We used several methods together to reach a selection of 8 variables. We then proceeded with the identification and removal of outliers that we believed may have an adverse effect on our model. Analysis of the different models was achieved using various graphic representations of the data including qq-plots and standardised residual plots.
The final section of the report will be devoted to further analysis to what the model tells us in terms of the relationship and why the model is suitable.
Result
The Final Model we obtained was,
lm(formula = height~ biacromial+ waist.girth+thigh.girth+ bicep.girth + calf.girth + weight + gender+ chest.girth) with the following coefficients.
(Intercept) 203.34501 |
biacromial 0.44517 |
waist.girth -0.63473 |
thigh.girth -0.71210 |
bicep.girth -0.72215 |
calf.girth -0.47886 |
weight 1.35938 |
gender 3.79164 |
chest.girth -0.17182 |
Variable selection
We considered the standard methods of variable selection separately and based on the regression summary of the new data sets we proceeded to choose a set of variables to use.
Forward, backward and stepwise selection
First we attempted to use stepwise selection. (Later we discovered that for this data set the results we obtained were the same for all 3 selection processes)
The results we obtained were a set of 17 variables; this is a set that we considered too large to easily interpret the model. However we can conclude that the selection process has excluded some variables that have very small t values. In this case the 7 variables that were [2]excluded all has t values less than 1 which all have p values of more than 5 percent. While it is possible that by considering the variables individually we may have missed patterns of the regression as a whole, in this case since the t values are so small, we considered it safe to assume that these variables were in fact not significant in explaining the model.
Regression Subset
[pic 1]
From the diagram above we can see that the adjusted r squared is fairly stable from around 6 to 24. So we can choose in region of 6 to 9 as this is a range of number of variables that is a fair comprise in terms of picking a model that has good predictive power as well as being easy to interpret. A problem with using subset selection is that randomness can cause the best model to be selected by chance and a model close to the best one can have very different variables. This means that the model given using the subset selection may not be the best by a large margin. With this in mind when selecting a model we try and select one that has variables consistent when adding more variables. For example at 8 variables we have waist.girth, thigh.girth, bicep.girth, weight, gender, biacromial, calf.girth, gender. When we use the best r squared model with 9 variables, the previous 8 are all included with chest girth being the new addition. While this does not mean that there isn’t a model with different variables that has a good r squared, it most likely means that the one presented by the best subset selection is not a selection of variables that gave the best r squared by chance. [pic 2]
Using this, we selected 9 variables as it still had a fair r squared and the variables it composed of were also in the set of 10 given by the subset selection.
Upon running regression with all data points we see that pelvic breath has a poor p value and so we remove it from the list of variables.
We check if the removal has improved the model and indeed prior to removal we had 2 values in the regression with a p value of over 0.001 and after we have none.
Thus we choose the following eight variables: biacromial , waist.girth, thigh.girth, bicep.girth calf.girth, weight, gender , chest.girth.
Regression Diagnostics
[pic 3]
[pic 4]
We fitted 3 models to the data we had, fit is the original model, fit2 is after removal of high leverage points, fit 3 is without points with high cooks distance, fit 4 is a combination of removal of high leverage and high cooks distance. The number of high leverage points was 32, 26 points with high cooks distance and 46 points with high cooks distance and as well as being a high leverage point. This is within the 10 percent reasonable range so we proceed with removing the points and seeing if there are any significant changes in coefficients on improvements in the residuals.
We see that without deleting any data points (top left) we have some outliers for some large fitted values. Using method of deleting high leverage points (top right), we have a plot showing some outliers remain.
The last 2 plots (cooks distance and the plot with both sets of points deleted) have no pattern, with values centred on 0 as well as having fewer outliers. The spread on the 2 graphs suggest we have a fairly constant variance with perhaps slightly higher variance then we would expect for individuals with high height.
...
...