Back in early days of computers, most statisticians used to be dab hands at calculating dummy variables. It was a neat trick by which you could get a regression procedure to conduct an ANOVA! What is more there was plenty of argument over which were the best way to calculate dummy variable. If you look in SPSS for General Linear Models and look under contrasts you can still seem many of the options. They have some really complex ones such as the
Helmert that have some nice properties like
orthogonality which is actually very nice if you have a balanced design but of no particular use if you do not! The imbalance in the design destroys the nice properties of orthogonality. The same actually happens if all sum of the options is 1!
So a dummy variable is a simple way to assign values to particular groups. Lets suppose we have a choice of lunch boxes and we want to know what role colour and menu plays in choosing children's choices. So we have red, blue and green lunch boxes and we have healthy (wholemeal bread, low fat spread, tuna and sweetcorn sandwich with a plain low fat yoghurt and some grapes), normal (brown bread ham sandwich with a fruit yoghurt), bad (white bread jam sandwich with a chocolate mousse pudding) menu options. Kids are given a lunch box at random and asked to rate it. We also want to check girls from boys.
Now the old statistician used to know that we could have a factor with girls /boys in or we could have a variable that assigned 1 to girls and 0 to boys and this would equally tells us of the difference. Now this can be extended easily to the colour of lunch box. So for lunch boxes we might get
| contrast blue | contrast green |
| Red | 0 | 0 |
| Blue | 1 | 0 |
| Green | 0 | 1 |
Contrast b looks if the box is blue and contrast b looks if the box is green. Red has become the default value. However for the meals there is actually a progression so you code slightly differently
| contrast normal | contrast bad |
| good | 0 | 0 |
| normal | 1 | 0 |
| bad | 1 | 1 |
This allows us to compare the difference that lies between good and normal and that which lies between normal and bad!
The interaction between colour and menu are calculated by just multiplying one by the other or if you prefer the matrix as below.
| colour | menu | blue-normal | blue-bad | green-normal | green-bad |
| red | good | 0 | 0 | 0 | 0 |
| normal | 0 | 0 | 0 | 0 |
| bad | 0 | 0 | 0 | 0 |
| blue | good | 0 | 0 | 0 | 0 |
| normal | 1 | 0 | 0 | 0 |
| bad | 1 | 1 | 0 | 0 |
| green | good | 0 | 0 | 0 | 0 |
| normal | 0 | 0 | 1 | 0 |
| bad | 0 | 0 | 1 | 1 |
This sort of recoding is time consuming and most statistics packages now do it for you! That is right they still do it but the computer does it for you rather than you needing to think about it and they can choose the contrasts with nice properties and then encourage you to use multiple comparison tests to find where the differences lie. This is even true of R. However if you really want to you can always use dummy variables rather than trusting the computer.
However what happens when you have structural zeros, that is combinations that do not occur. Packages like SAS and SPSS actually do the thinking for you set the contrast to zero and take it off the degrees of freedom but others say "sorry we can't fit this"
Well you can, because you just have to calculate the dummy variables. Then you do a frequency on each dummy variable and those who only have 0 in them get dropped and you just add the other terms. Simple really isn't it. I need at some stage to do the R code to demonstrate this!