Thursday, 20 June 2013

When Dummy variables make sense

Back in early days of computers, most statisticians used to be dab hands at calculating dummy variables. It was a neat trick by which you could get a regression procedure to conduct an ANOVA! What is more there was plenty of argument over which were the best way to calculate dummy variable. If you look in SPSS for General Linear Models and look under contrasts you can still seem many of the options.  They have some really complex ones such as the Helmert that have some nice properties like orthogonality which is actually very nice if you have a balanced design but of no particular use if you do not! The imbalance in the design destroys the nice properties of orthogonality. The same actually happens if all sum of the options is 1!

So a dummy variable is a simple way to assign values to particular groups. Lets suppose we have a choice of lunch boxes and we want to know what role colour and menu plays in choosing children's choices. So we have red, blue and green lunch boxes and we have healthy (wholemeal bread, low fat spread, tuna and sweetcorn sandwich with a plain low fat yoghurt and some grapes), normal (brown bread ham sandwich with a fruit yoghurt), bad (white bread jam sandwich with a chocolate mousse pudding) menu options. Kids are given a lunch box at random and asked to rate it. We also want to check girls from boys.
contrast
Boys0
Girls1
Now the old statistician used to know that we could have a factor with girls /boys in or we could have a variable that assigned 1 to girls and 0 to boys and this would equally tells us of the difference. Now this can be extended easily to the colour of lunch box. So  for lunch boxes we might get

contrast bluecontrast green
Red00
Blue10
Green01
Contrast b looks if the box is blue and contrast b looks if the box is green. Red has become the default value. However for the meals there is actually a progression so you code slightly differently
contrast normalcontrast bad
good00
normal10
bad11
This allows us to compare the difference that lies between good and normal and that which lies between normal and bad!

The interaction between colour and menu are calculated by just multiplying one by the other or if you prefer the matrix as below.
colourmenublue-normalblue-badgreen-normalgreen-bad
redgood0000
normal0000
bad0000
bluegood0000
normal1000
bad1100
greengood0000
normal0010
bad0011


This sort of recoding is time consuming and most statistics packages now do it for you! That is right they still do it but the computer does it for you rather than you needing to think about it and they can choose the contrasts with nice properties and then encourage you to use multiple comparison tests to find where the differences lie. This is even true of R. However if you really want to you can always use dummy variables rather than trusting the computer.

However what happens when you have structural zeros, that is combinations that do not occur. Packages like SAS and SPSS actually do the thinking for you set the contrast to zero and take it off the degrees of freedom but others say "sorry we can't fit this"

Well you can, because you just have to calculate the dummy variables. Then you do a frequency on each dummy variable and those who only have 0 in them get dropped and you just add the other terms. Simple really isn't it. I need at some stage to do the R code to demonstrate this!

Thursday, 13 June 2013

Name that study design?

Design

You want to know if men out perform women in a class on physics. So you take the class at the start of semester and give it a test on physics, give a second test half way through and then record exam results at the end of the semester.

Challenge

Find a protocol for such a study against which you can check your design!

Ruling out the obvious

There are several protocols that you can check against which it simply will not fit. It is not a clinical trial, as there is no randomisation, subjects on entering are either male or female and I am afraid giving gender altering drugs would be unethical! It is not a case-control study because the outcome variable is not male or female and there is no matching. It is not a cross-sectional  study as it does have time as a related variable.

In other words all the accepted study protocols do not apply here. This feels like madness because the study as presented here is very close to the classic case for using the T-Test in statistics 101. What is wrong is that rather than using random allocation, the study design assumes random recruitment. That is from the pool of all likely male students to take the course, everyone is equally likely to choose or not choose the course and the same for all likely female students. This is of course somewhat dubious, that is why clinical trials like to have random allocation but when you can not alter a persons status then you cannot use random allocation.

Moving Forward

So how do you tackle it. Well I can tell you the way we came up with. Firstly we stated it was a comparative study. Secondly we found that actually such studies are quite common in bio-equivalence although unlike them our null was that the two groups were the same. So we could point to other studies with the same design.

Will the Journal understand?

I am not sure. We are being open with them. Yes there is a double factorial design that we could have done if we had also run a trial without intervention (equivalent to testing a group of students who did not sit the physics class) and if we did that there would have been some random allocation but only within groups. The problem being that actually finding one group of subjects we were looking for a fairly rare event and we might well have had to sacrifice power quite substantially if we did that. However the study does not tick any of the tidy tick boxes the journal has and journals are getting more and more tick box orientated.

Thursday, 6 June 2013

Rethinking Principle Components Analysis (PCA)

Right I have use PCA quite a bit in the past, the first thing I use it for is quite simply to check the underlying structure of a scale. Many researchers put scales together where they want to sum up the scores. They run a Cronbach's Alpha and then find it is something like 0.437. They then go to a statistician in tears because they have collected all this data and now the scale does not work. Or at least the sensible ones do. The statisticians job is to sort out why it does not work and to save the day by suggesting a work around. There are a variety of tools that we can use. The first is to check the scaling is all in the same direction! It is amazing how many people try to create a scale by adding variables even when some are deliberately scaled in the other direction. However if that does not work then often a straight forward PCA will lead you to the problem and you can then talk about ways of fixing it.

The second one is when there the data is known to be highly correlated but the actual patterns are not known. The classic example is for food intakes, not nutrient intakes, these are the actual food items such a bread, sugar, milk, eggs etc. Nobody really believes there are factors underlying this but there are patterns and distinguishing patterns should work. We normally have to standardize the factors, most people eat a lot more bread than drink in fruit juices! Also quite often it is a good idea to use log measurement to standardize. In this case I will apply a rotation and then hope the patterns make sense.

Now last week I was co-teaching a class for linguist with Nick Fieller. This was fun. Partly because oddly enough Nick and I are fairly similar with our attitude to statistics although he is by far the brighter guy and partly because it was in an area of his expertise. He knew far more about MDA than me but his use of PCA was interesting. What he suggested and what I had not thought of was to use PCA and plotting as a method for doing exploratory statistics before carrying out a Repeated Measures ANOVA. Let me explain, the big problem with Repeated Measures ANOVA is there are few graphs that are actually useful because the data has at least two levels of variation. Therefore I have tended to suck it and see approach. The Principle Component actually summarises the variation within subjects into several dimensions. The graphs are for a statistician but I will work through an example. Here I am looking at a variable UMMA measured at 5 time points. The first was prior to being given medication and everyone was given the same, the aim being to lower their UMMA. There were however two groups of subjects. So what we got was  the following table:

Total Variance Explained
Component
Initial Eigenvalues
Extraction Sums of Squared Loadings
Total
% of Variance
Cumulative %
Total
% of Variance
Cumulative %
Raw
1
2.972
80.789
80.789
2.972
80.789
80.789
2
.454
12.345
93.134
.454
12.345
93.134
3
.109
2.971
96.104



4
.092
2.509
98.614



5
.051
1.386
100.000



Now there are something you should note firstly the first component is the most important explaining 81% of the data, secondly the second explains around another 12% so it seems to me worth having even though it  is relatively small.

Now when we look at the components we get:

Component Matrix

Raw
Component
1
2
umma_v1
1.249
-.454
umma_V2
.692
.204
umma_V3
.671
.241
umma_V4
.537
.328
UMMA_V5
.441
.201

The first component is basically the mean with some weighting towards the first time point. The second compares the first value which is pre treatment with the rest. So a fairly nice interpretation. Lets have a look at the scatter graph of the first against the second but this time with the two groups C & D.


The first thing you notice is that two members of group C are a lot higher than the rest. Also there are about two values where the second component is much. these points are outliers. There however does not seem to be much difference apart from this between Group C and Group D which is precisely what we concluded in the Repeated Measures ANOVA.

So thanks Nick that is another tool in my kit for tackling some of the trickiest forms of analysis there is.