Thursday, 6 June 2013

Rethinking Principle Components Analysis (PCA)

Right I have use PCA quite a bit in the past, the first thing I use it for is quite simply to check the underlying structure of a scale. Many researchers put scales together where they want to sum up the scores. They run a Cronbach's Alpha and then find it is something like 0.437. They then go to a statistician in tears because they have collected all this data and now the scale does not work. Or at least the sensible ones do. The statisticians job is to sort out why it does not work and to save the day by suggesting a work around. There are a variety of tools that we can use. The first is to check the scaling is all in the same direction! It is amazing how many people try to create a scale by adding variables even when some are deliberately scaled in the other direction. However if that does not work then often a straight forward PCA will lead you to the problem and you can then talk about ways of fixing it.

The second one is when there the data is known to be highly correlated but the actual patterns are not known. The classic example is for food intakes, not nutrient intakes, these are the actual food items such a bread, sugar, milk, eggs etc. Nobody really believes there are factors underlying this but there are patterns and distinguishing patterns should work. We normally have to standardize the factors, most people eat a lot more bread than drink in fruit juices! Also quite often it is a good idea to use log measurement to standardize. In this case I will apply a rotation and then hope the patterns make sense.

Now last week I was co-teaching a class for linguist with Nick Fieller. This was fun. Partly because oddly enough Nick and I are fairly similar with our attitude to statistics although he is by far the brighter guy and partly because it was in an area of his expertise. He knew far more about MDA than me but his use of PCA was interesting. What he suggested and what I had not thought of was to use PCA and plotting as a method for doing exploratory statistics before carrying out a Repeated Measures ANOVA. Let me explain, the big problem with Repeated Measures ANOVA is there are few graphs that are actually useful because the data has at least two levels of variation. Therefore I have tended to suck it and see approach. The Principle Component actually summarises the variation within subjects into several dimensions. The graphs are for a statistician but I will work through an example. Here I am looking at a variable UMMA measured at 5 time points. The first was prior to being given medication and everyone was given the same, the aim being to lower their UMMA. There were however two groups of subjects. So what we got was  the following table:

Total Variance Explained
Component
Initial Eigenvalues
Extraction Sums of Squared Loadings
Total
% of Variance
Cumulative %
Total
% of Variance
Cumulative %
Raw
1
2.972
80.789
80.789
2.972
80.789
80.789
2
.454
12.345
93.134
.454
12.345
93.134
3
.109
2.971
96.104



4
.092
2.509
98.614



5
.051
1.386
100.000



Now there are something you should note firstly the first component is the most important explaining 81% of the data, secondly the second explains around another 12% so it seems to me worth having even though it  is relatively small.

Now when we look at the components we get:

Component Matrix

Raw
Component
1
2
umma_v1
1.249
-.454
umma_V2
.692
.204
umma_V3
.671
.241
umma_V4
.537
.328
UMMA_V5
.441
.201

The first component is basically the mean with some weighting towards the first time point. The second compares the first value which is pre treatment with the rest. So a fairly nice interpretation. Lets have a look at the scatter graph of the first against the second but this time with the two groups C & D.


The first thing you notice is that two members of group C are a lot higher than the rest. Also there are about two values where the second component is much. these points are outliers. There however does not seem to be much difference apart from this between Group C and Group D which is precisely what we concluded in the Repeated Measures ANOVA.

So thanks Nick that is another tool in my kit for tackling some of the trickiest forms of analysis there is.



No comments:

Post a Comment