Thursday, 20 June 2013

When Dummy variables make sense

Back in early days of computers, most statisticians used to be dab hands at calculating dummy variables. It was a neat trick by which you could get a regression procedure to conduct an ANOVA! What is more there was plenty of argument over which were the best way to calculate dummy variable. If you look in SPSS for General Linear Models and look under contrasts you can still seem many of the options.  They have some really complex ones such as the Helmert that have some nice properties like orthogonality which is actually very nice if you have a balanced design but of no particular use if you do not! The imbalance in the design destroys the nice properties of orthogonality. The same actually happens if all sum of the options is 1!

So a dummy variable is a simple way to assign values to particular groups. Lets suppose we have a choice of lunch boxes and we want to know what role colour and menu plays in choosing children's choices. So we have red, blue and green lunch boxes and we have healthy (wholemeal bread, low fat spread, tuna and sweetcorn sandwich with a plain low fat yoghurt and some grapes), normal (brown bread ham sandwich with a fruit yoghurt), bad (white bread jam sandwich with a chocolate mousse pudding) menu options. Kids are given a lunch box at random and asked to rate it. We also want to check girls from boys.
contrast
Boys0
Girls1
Now the old statistician used to know that we could have a factor with girls /boys in or we could have a variable that assigned 1 to girls and 0 to boys and this would equally tells us of the difference. Now this can be extended easily to the colour of lunch box. So  for lunch boxes we might get

contrast bluecontrast green
Red00
Blue10
Green01
Contrast b looks if the box is blue and contrast b looks if the box is green. Red has become the default value. However for the meals there is actually a progression so you code slightly differently
contrast normalcontrast bad
good00
normal10
bad11
This allows us to compare the difference that lies between good and normal and that which lies between normal and bad!

The interaction between colour and menu are calculated by just multiplying one by the other or if you prefer the matrix as below.
colourmenublue-normalblue-badgreen-normalgreen-bad
redgood0000
normal0000
bad0000
bluegood0000
normal1000
bad1100
greengood0000
normal0010
bad0011


This sort of recoding is time consuming and most statistics packages now do it for you! That is right they still do it but the computer does it for you rather than you needing to think about it and they can choose the contrasts with nice properties and then encourage you to use multiple comparison tests to find where the differences lie. This is even true of R. However if you really want to you can always use dummy variables rather than trusting the computer.

However what happens when you have structural zeros, that is combinations that do not occur. Packages like SAS and SPSS actually do the thinking for you set the contrast to zero and take it off the degrees of freedom but others say "sorry we can't fit this"

Well you can, because you just have to calculate the dummy variables. Then you do a frequency on each dummy variable and those who only have 0 in them get dropped and you just add the other terms. Simple really isn't it. I need at some stage to do the R code to demonstrate this!

No comments:

Post a Comment