Thursday, 24 January 2013

Sometimes thinking visually is slow: PCA and Regression

I was in a discussion over planning a course on Statistics for Linguists where someone said they did not understand factor analysis and then went onto describe problems over how it decided  what was important. The problem seemed to be all about scaling, I came away feeling oddly as if I knew the answer but I could not think of it.

Right the first thing to realise is Principle Component Analysis does exactly what it says on the can. If you see linear regression as minimizing the distance from the line to the points when the vertical distance. Then the equivalent for Principle Component Analysis is minimizing the perpendicular distance from the line!

The problem is then there problem there always has been, how to scale the X and Y axis to allow for the variation. There are two ways to do this. The easiest is not to do anything! That way you do the Principle Components of the Covariance Matrix. The main alternative is to standard for the standard deviation in X by the standard deviation in Y or, in other words, to use standardised scores. To do this you just find the Principle Components of the Correlation Matrix.

Sometimes once I have things visually, things are very easy but it has probably taken the best part of twenty years for that visualization to coalesce together.

Friday, 11 January 2013

Fiddle Factors and Adjusting logged data

I have had a query on what sort of an offset you should use when taking logarithms of variables with zeros in them. I have spent quite some time searching for papers on this and not turning up anything. So today I decided to have a go and see what would happen if I looked at a situation where normally a log is the suggested transform.

Thus I decided to look at generated Possion data and to see if I used fiddle factors which actually behaved best. My hunch was that the smaller the mean the smaller the fiddle factor would work best. This was a hunch.

So I generated six samples of possion data with means  0.1,0.2,0.5,1,2,5,10,and 20. I then looked at taking log transforms with offsets of 0.05, 0.1,0.2, 0.5 and 1 as well as the raw data. I then calculated the Shapiro-Wilks statistic for all these transformed data, the untransformed data set as zero. The results are not as expected:
Right the first thing to say is that a lot of the time the untransformed data performs as well as any of the log transforms. The only case where it seems to be effective is when it may be valid is for Poissons around 0.5 and 1 i.e. those with a fairly high level of skewness. Beyond this it does nothing to improve the data and when the data is less skewed then the logarithm actually performs worse than normal data.

This is very preliminary findings but my advice given these would be:
  • If the majority of your data are zeros consider using a binary logistic model or some other complex way of modelling zeros
  • If the data is reasonably normal then taking a log is unlikely to add anything and may weaken the transformation.
  • It is only in cases where data is skewed and the minority are zero that a transform worth doing. In these cases working with a fiddle factor of the lowest score as offset may well be the best option.