Thursday, 2 May 2013

Thinking about numbers for Multiple Regression Analysis

I have had one of those emails that makes you wonder, someone trying to fit 25 explanatory variables to a data set that has just 20 observations. It is one of those days when you say "Stop trying and do something simpler". However it set me wondering about rules of thumb for sample size for regression that was perhaps better. The figures are available in Samuel B. Green (1991): How Many Subjects Does It Take To Do A Regression Analysis, Multivariate Behavioral Research, 26:3, 499-510

Well I have discovered somethings. Firstly most are based on A+B.M where M is the number of explanatory variables. I got somethings right straight off. My idea is that power does not vary according to M but the squared root of M and indeed this does seem to be closer to the truth. However I fairly soon realised that M+1 rather than M is a better measure. I played around a bit and a figure of 17.5 seems to work as the constant to match the GPower calculations.

So I ended up with a graph that looked like that below:


Cartledge method is interesting. It is not in the Statistics literature but comes from a modelling background. It actually is not for straight regression but gives the number of cases you need to have if you model all interactions in order for the mathematics to work. Note that word mathematics! The statistics would want five times that amount (we do not like fitting a dimension with only two points, sorry mathematicians). Be warned interactions terms need adding only with the greatest of caution.

Another point is I am not wedded to 17.5, I would quite happily consider using 20 instead. This would enable some leeway for fitting different models. I wonder how you correct for independent variable selection technique.

No comments:

Post a Comment