Thursday, 10 October 2013

Excel Rant

I am going to briefly put on my professional hat and repeat something that is clearly out there.

Excel should always be used with caution. It annoys me that for graphs I often end up using Excel (I keep promising my self to learn a R Graphics package well enough not to). I do not use it for serious statistical analysis, and I do not use it for handling large data sets and here is why.

There are a series of papers on using Excel for statistics, and to give you some idea of the tone the paper from Excel 2007 says
No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.
and that is in the abstract. Now things are not quite as bad as it seems as for the 2010 paper there was an improvement. It is just that I know of other errors which have not been reported due to Microsoft's tardiness in fixing the reported ones. Note you need to use the new procedures.

However I equally do not like to use Excel for data handling
  1. It tries to load every single value into memory, yes that means the memory is working over time. It is also a sign to the old school programmers that it is using inefficient techniques for calculating values. That is mean is calculated by summing and then dividing by the number of cases. Most statistical software uses efficient routines such as this by C C Spicer. Yes that is 1972 but it is more accurate as well as more efficient. If you want a twenty minute lecture on machine accuracy I can give it you. The short answer is it avoids very small and very large numbers.
  2. Secondly, if you alter a value in the data sheet it then revises all the calculations in the sheet whether or not they depend on that value. If you want to test this out put the following formula in a cell =rand() and sees how often it changes.
Those are my gripes, but the serious data handlers have their own list and there too is a solution in a free bit of software you can down load from Microsoft. There are still problems, but Microsoft has gone some way to solving them. That said it has taken them over a decade and if I know anything about data collection it is that if those new things are not scaleable  then in less than a decade the current fixes will be broke.

Look as statisticians go I have very little pride, my main package is SPSS. There are many statisticians out there who would not sully their hands with that. However if you said the only alternative was Excel, then suddenly SPSS looks infinitely preferable.

Wednesday, 4 September 2013

Slow Down and take Your Time

This will be a bit of a rant. I am used to people wanting everything done by yesterday! Let me be honest now a good analysis takes time. I know data collection is time consuming, I know that computerising that information takes skill but please do not think that you can then find your analysis in five minutes. Analysis takes time as well.

This morning I did one in an hour. This was exceptional. There were also very good reasons why I was limiting what I was doing. Firstly this was not the first time I had done this analysis. The analysis was really the extra variable on a dataset I had analysed at least twice already. Secondly I did not want to look very far as I was well aware that the numbers were small.  I did not test normality, I assumed it, as the sample size was too small to test.

I am not particularly happy with the result. With small data sets such as this you can often see why things happen if you do quite specific plots. I did one and I would not like to say that the significance found was due to just one case. So if I am honest I would be very careful about reporting the results. It rather than giving a conclusion, looks like we have something we should investigate further. Fortunately for me I know the main researcher and that will be his instinct as well.

However,this is the exception, and we now have to go onto the rule. When someone comes to me with less than a weeks deadline I dread it. Seriously I probably need three days out of that week just to decide on the analysis. Then once I have decided there is another run through and then there is presenting the results. Now that is for statistics.

If you are talking qualitative then please you need to spend at least a week just doing the structural and first coding. This is important, this is about familiarisation, this is about getting to know your data. It takes me two to three days with limited quantitative analysis. It takes twice or three times that with complex multifaceted qualitative data. What is more, it can not be rushed. Then you have secondary and final coding with checking facilities and further analysis.

Please if you have an analysis give it at least a month from final data entry to written reports!




Thursday, 29 August 2013

Using a standard Measurement Scale in a Survey within a fresh Population

Many years ago, a researcher came to me with a simple scale of over all well being. The first question asked how well over all they felt and the rest asked about performing specific tasks such as climbing stairs. The aim was to measure overall well being. She was investigating the response of a specific group of patients who had leg injuries from motorcycle accidents. The scale did not seem to behave as expected. Upon investigation I found the first term was routinely scored as doing fine while the rest could have quite negative responses. This was because the group heard the general question as having the words "except for you leg" in it, even though the words were not there.

If the culture of the population you are investigating is different from the one that the Measurement Scale was developed on then you need to check you can apply it.  This is particularly true if you are translating the survey and that even if you back translate to check equivalence. No matter how careful your translation is if the topic you are dealing with has been conceptualised differently then you will have trouble with the scale.

There are a number of things you can do to check it. Firstly you can simply run a Cronbach's alpha, and if this fails then you know you are in trouble. If you want to be more thorough you can carry out a Confirmatory Factor Analysis (CFA), but what you should not do is just assume that ideas from one setting can translate unproblematically in another.

I came across another one this week.  The person was looking at the way disability was viewed in this country and in a Middle Eastern Country. Both the CFA, Cronbach's alpha and exploratory factor analysis suggested problems. This was found after she had conducted two large surveys and from what I can see there is also an intervention in one survey. Plus she has sampled from several subpopulations such as health care professionals and parents of disabled children.

In some ways she is fortunate, the analysis conclusively demonstrates that disability is constructed differently in the two cultures. However it is not enough to say they differ we also have to say how they differ and probably why the differ. The only clue to this is to go back to each item and compare it across the cultures. So not the fancy techniques but a huge lot of fairly simple statistics. Then you try and make sense of it!

Thursday, 8 August 2013

Reviewers comment answers questions

I have had a tough paper on the go, it is the sort that drives you up one wall and down the other. Now this paper is from a designed study. A designed study normally takes an couple of afternoons to analyse. Not this one. I have probably spent a couple of weeks on it. We are at least on out tenth round of analysis. Indeed we sent of the paper to a journal knowing there were problems with it but also knowing we were going around in circles analysing it.

It was rejected. Well no real surprise there! However we got the reviewers comments back.

 I should explain we have an expert on the topic who conducted the study but she is not writing the paper. She probably thinks it is too simple. The thing was we were having to classify people after they had been recruited to analytic group. This was although we deliberately went to recruit people with specific characteristics we found that people were unstable between recruitment and the trial. That is they swapped groups.

What happened was one of the reviewers comments stood out. It suggested an analysis we had not done and this was after we thought we had analysed data every which way. We did it and the analysis popped out with the results just as we felt they should. No playing, no cludging, straight there in broad daylight.

You see the reviewer had two things:
1)Distance, we had got too close to the data and could not see the wood for the trees
2)Wider Perspective, we were too narrowly focused on the design of the study and how to analyse it. The reviewer was able to say basically "now X might well be interesting" and put his/her finger on the one thing we were overlooking.

So to reviewers who spend the time to give useful review, thank you! Even when you do not recommend acceptance!





Thursday, 20 June 2013

When Dummy variables make sense

Back in early days of computers, most statisticians used to be dab hands at calculating dummy variables. It was a neat trick by which you could get a regression procedure to conduct an ANOVA! What is more there was plenty of argument over which were the best way to calculate dummy variable. If you look in SPSS for General Linear Models and look under contrasts you can still seem many of the options.  They have some really complex ones such as the Helmert that have some nice properties like orthogonality which is actually very nice if you have a balanced design but of no particular use if you do not! The imbalance in the design destroys the nice properties of orthogonality. The same actually happens if all sum of the options is 1!

So a dummy variable is a simple way to assign values to particular groups. Lets suppose we have a choice of lunch boxes and we want to know what role colour and menu plays in choosing children's choices. So we have red, blue and green lunch boxes and we have healthy (wholemeal bread, low fat spread, tuna and sweetcorn sandwich with a plain low fat yoghurt and some grapes), normal (brown bread ham sandwich with a fruit yoghurt), bad (white bread jam sandwich with a chocolate mousse pudding) menu options. Kids are given a lunch box at random and asked to rate it. We also want to check girls from boys.
contrast
Boys0
Girls1
Now the old statistician used to know that we could have a factor with girls /boys in or we could have a variable that assigned 1 to girls and 0 to boys and this would equally tells us of the difference. Now this can be extended easily to the colour of lunch box. So  for lunch boxes we might get

contrast bluecontrast green
Red00
Blue10
Green01
Contrast b looks if the box is blue and contrast b looks if the box is green. Red has become the default value. However for the meals there is actually a progression so you code slightly differently
contrast normalcontrast bad
good00
normal10
bad11
This allows us to compare the difference that lies between good and normal and that which lies between normal and bad!

The interaction between colour and menu are calculated by just multiplying one by the other or if you prefer the matrix as below.
colourmenublue-normalblue-badgreen-normalgreen-bad
redgood0000
normal0000
bad0000
bluegood0000
normal1000
bad1100
greengood0000
normal0010
bad0011


This sort of recoding is time consuming and most statistics packages now do it for you! That is right they still do it but the computer does it for you rather than you needing to think about it and they can choose the contrasts with nice properties and then encourage you to use multiple comparison tests to find where the differences lie. This is even true of R. However if you really want to you can always use dummy variables rather than trusting the computer.

However what happens when you have structural zeros, that is combinations that do not occur. Packages like SAS and SPSS actually do the thinking for you set the contrast to zero and take it off the degrees of freedom but others say "sorry we can't fit this"

Well you can, because you just have to calculate the dummy variables. Then you do a frequency on each dummy variable and those who only have 0 in them get dropped and you just add the other terms. Simple really isn't it. I need at some stage to do the R code to demonstrate this!

Thursday, 13 June 2013

Name that study design?

Design

You want to know if men out perform women in a class on physics. So you take the class at the start of semester and give it a test on physics, give a second test half way through and then record exam results at the end of the semester.

Challenge

Find a protocol for such a study against which you can check your design!

Ruling out the obvious

There are several protocols that you can check against which it simply will not fit. It is not a clinical trial, as there is no randomisation, subjects on entering are either male or female and I am afraid giving gender altering drugs would be unethical! It is not a case-control study because the outcome variable is not male or female and there is no matching. It is not a cross-sectional  study as it does have time as a related variable.

In other words all the accepted study protocols do not apply here. This feels like madness because the study as presented here is very close to the classic case for using the T-Test in statistics 101. What is wrong is that rather than using random allocation, the study design assumes random recruitment. That is from the pool of all likely male students to take the course, everyone is equally likely to choose or not choose the course and the same for all likely female students. This is of course somewhat dubious, that is why clinical trials like to have random allocation but when you can not alter a persons status then you cannot use random allocation.

Moving Forward

So how do you tackle it. Well I can tell you the way we came up with. Firstly we stated it was a comparative study. Secondly we found that actually such studies are quite common in bio-equivalence although unlike them our null was that the two groups were the same. So we could point to other studies with the same design.

Will the Journal understand?

I am not sure. We are being open with them. Yes there is a double factorial design that we could have done if we had also run a trial without intervention (equivalent to testing a group of students who did not sit the physics class) and if we did that there would have been some random allocation but only within groups. The problem being that actually finding one group of subjects we were looking for a fairly rare event and we might well have had to sacrifice power quite substantially if we did that. However the study does not tick any of the tidy tick boxes the journal has and journals are getting more and more tick box orientated.

Thursday, 6 June 2013

Rethinking Principle Components Analysis (PCA)

Right I have use PCA quite a bit in the past, the first thing I use it for is quite simply to check the underlying structure of a scale. Many researchers put scales together where they want to sum up the scores. They run a Cronbach's Alpha and then find it is something like 0.437. They then go to a statistician in tears because they have collected all this data and now the scale does not work. Or at least the sensible ones do. The statisticians job is to sort out why it does not work and to save the day by suggesting a work around. There are a variety of tools that we can use. The first is to check the scaling is all in the same direction! It is amazing how many people try to create a scale by adding variables even when some are deliberately scaled in the other direction. However if that does not work then often a straight forward PCA will lead you to the problem and you can then talk about ways of fixing it.

The second one is when there the data is known to be highly correlated but the actual patterns are not known. The classic example is for food intakes, not nutrient intakes, these are the actual food items such a bread, sugar, milk, eggs etc. Nobody really believes there are factors underlying this but there are patterns and distinguishing patterns should work. We normally have to standardize the factors, most people eat a lot more bread than drink in fruit juices! Also quite often it is a good idea to use log measurement to standardize. In this case I will apply a rotation and then hope the patterns make sense.

Now last week I was co-teaching a class for linguist with Nick Fieller. This was fun. Partly because oddly enough Nick and I are fairly similar with our attitude to statistics although he is by far the brighter guy and partly because it was in an area of his expertise. He knew far more about MDA than me but his use of PCA was interesting. What he suggested and what I had not thought of was to use PCA and plotting as a method for doing exploratory statistics before carrying out a Repeated Measures ANOVA. Let me explain, the big problem with Repeated Measures ANOVA is there are few graphs that are actually useful because the data has at least two levels of variation. Therefore I have tended to suck it and see approach. The Principle Component actually summarises the variation within subjects into several dimensions. The graphs are for a statistician but I will work through an example. Here I am looking at a variable UMMA measured at 5 time points. The first was prior to being given medication and everyone was given the same, the aim being to lower their UMMA. There were however two groups of subjects. So what we got was  the following table:

Total Variance Explained
Component
Initial Eigenvalues
Extraction Sums of Squared Loadings
Total
% of Variance
Cumulative %
Total
% of Variance
Cumulative %
Raw
1
2.972
80.789
80.789
2.972
80.789
80.789
2
.454
12.345
93.134
.454
12.345
93.134
3
.109
2.971
96.104



4
.092
2.509
98.614



5
.051
1.386
100.000



Now there are something you should note firstly the first component is the most important explaining 81% of the data, secondly the second explains around another 12% so it seems to me worth having even though it  is relatively small.

Now when we look at the components we get:

Component Matrix

Raw
Component
1
2
umma_v1
1.249
-.454
umma_V2
.692
.204
umma_V3
.671
.241
umma_V4
.537
.328
UMMA_V5
.441
.201

The first component is basically the mean with some weighting towards the first time point. The second compares the first value which is pre treatment with the rest. So a fairly nice interpretation. Lets have a look at the scatter graph of the first against the second but this time with the two groups C & D.


The first thing you notice is that two members of group C are a lot higher than the rest. Also there are about two values where the second component is much. these points are outliers. There however does not seem to be much difference apart from this between Group C and Group D which is precisely what we concluded in the Repeated Measures ANOVA.

So thanks Nick that is another tool in my kit for tackling some of the trickiest forms of analysis there is.