Thursday, 9 January 2014

Checking your work when using procedure (or watch out for * join)

Now I know most people's data sets are relatively small scale, you are lucky if you have a hundred cases. However ever so often you get huge ones. We have them for things like computer usage within the department, Linguistics corpus data has them and I have a number of purchase transactions from a cafeteria.

Now I tend to handle these in SPSS and up to recently SPSS was reliable in handling them. That is if the files got corrupted while running a procedure 99% of the time it was me who had messed up! Most of the time when I merged files if there was a problem it was either:
  1. Due to the fact I had not sorted the data
  2. Unexpected duplicate cases, either extra blank cases at the base or a duplicate where there should not be one.
However SPSS just used to tell me I was doing things wrong, go back and sort it out.
Sorting data sets especially large data sets before you merge is tedious but that is all. Yes I have been in the make a cup of coffee crowd and see how much you drink while SPSS is sorting. I handled data sets with over 600,000 cases almost two decades ago on pcs. The machine used to sit in the office chuntering as it processed the data and I got on with other work. I would check it every time I wanted a coffee but normally it ran quite happily on its own. Due to the way that analysis work I had to invert the file at one stage by sorting.

Now SPSS decided that forcing people to do this sort of tedious work and wait around was not on so it developed "star sort". Not just that they put it up as the default option via menus with SPSS. It may work for small datasets with a few cases in it. It does not work for large ones in my experience. The result is that instead of a tedious sort command before I now have to either:
  1. Do a frequency and other checks that the data is as expected 
  2. Sort the data as I always did and remember to uncheck the boxes so sort dialogue actually uses the old way that worked.
No it has not improved things and as I missed this until it started giving me what I knew were wrong results I am not happy. If you are going to change a default then change it to something that works. Before someone says take this up with SPSS, I beta tested this particular version of SPSS (21) and I reported the problem then.


Thursday, 10 October 2013

Excel Rant

I am going to briefly put on my professional hat and repeat something that is clearly out there.

Excel should always be used with caution. It annoys me that for graphs I often end up using Excel (I keep promising my self to learn a R Graphics package well enough not to). I do not use it for serious statistical analysis, and I do not use it for handling large data sets and here is why.

There are a series of papers on using Excel for statistics, and to give you some idea of the tone the paper from Excel 2007 says
No statistical procedure in Excel should be used until Microsoft documents that the procedure is correct; it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.
and that is in the abstract. Now things are not quite as bad as it seems as for the 2010 paper there was an improvement. It is just that I know of other errors which have not been reported due to Microsoft's tardiness in fixing the reported ones. Note you need to use the new procedures.

However I equally do not like to use Excel for data handling
  1. It tries to load every single value into memory, yes that means the memory is working over time. It is also a sign to the old school programmers that it is using inefficient techniques for calculating values. That is mean is calculated by summing and then dividing by the number of cases. Most statistical software uses efficient routines such as this by C C Spicer. Yes that is 1972 but it is more accurate as well as more efficient. If you want a twenty minute lecture on machine accuracy I can give it you. The short answer is it avoids very small and very large numbers.
  2. Secondly, if you alter a value in the data sheet it then revises all the calculations in the sheet whether or not they depend on that value. If you want to test this out put the following formula in a cell =rand() and sees how often it changes.
Those are my gripes, but the serious data handlers have their own list and there too is a solution in a free bit of software you can down load from Microsoft. There are still problems, but Microsoft has gone some way to solving them. That said it has taken them over a decade and if I know anything about data collection it is that if those new things are not scaleable  then in less than a decade the current fixes will be broke.

Look as statisticians go I have very little pride, my main package is SPSS. There are many statisticians out there who would not sully their hands with that. However if you said the only alternative was Excel, then suddenly SPSS looks infinitely preferable.

Wednesday, 4 September 2013

Slow Down and take Your Time

This will be a bit of a rant. I am used to people wanting everything done by yesterday! Let me be honest now a good analysis takes time. I know data collection is time consuming, I know that computerising that information takes skill but please do not think that you can then find your analysis in five minutes. Analysis takes time as well.

This morning I did one in an hour. This was exceptional. There were also very good reasons why I was limiting what I was doing. Firstly this was not the first time I had done this analysis. The analysis was really the extra variable on a dataset I had analysed at least twice already. Secondly I did not want to look very far as I was well aware that the numbers were small.  I did not test normality, I assumed it, as the sample size was too small to test.

I am not particularly happy with the result. With small data sets such as this you can often see why things happen if you do quite specific plots. I did one and I would not like to say that the significance found was due to just one case. So if I am honest I would be very careful about reporting the results. It rather than giving a conclusion, looks like we have something we should investigate further. Fortunately for me I know the main researcher and that will be his instinct as well.

However,this is the exception, and we now have to go onto the rule. When someone comes to me with less than a weeks deadline I dread it. Seriously I probably need three days out of that week just to decide on the analysis. Then once I have decided there is another run through and then there is presenting the results. Now that is for statistics.

If you are talking qualitative then please you need to spend at least a week just doing the structural and first coding. This is important, this is about familiarisation, this is about getting to know your data. It takes me two to three days with limited quantitative analysis. It takes twice or three times that with complex multifaceted qualitative data. What is more, it can not be rushed. Then you have secondary and final coding with checking facilities and further analysis.

Please if you have an analysis give it at least a month from final data entry to written reports!




Thursday, 29 August 2013

Using a standard Measurement Scale in a Survey within a fresh Population

Many years ago, a researcher came to me with a simple scale of over all well being. The first question asked how well over all they felt and the rest asked about performing specific tasks such as climbing stairs. The aim was to measure overall well being. She was investigating the response of a specific group of patients who had leg injuries from motorcycle accidents. The scale did not seem to behave as expected. Upon investigation I found the first term was routinely scored as doing fine while the rest could have quite negative responses. This was because the group heard the general question as having the words "except for you leg" in it, even though the words were not there.

If the culture of the population you are investigating is different from the one that the Measurement Scale was developed on then you need to check you can apply it.  This is particularly true if you are translating the survey and that even if you back translate to check equivalence. No matter how careful your translation is if the topic you are dealing with has been conceptualised differently then you will have trouble with the scale.

There are a number of things you can do to check it. Firstly you can simply run a Cronbach's alpha, and if this fails then you know you are in trouble. If you want to be more thorough you can carry out a Confirmatory Factor Analysis (CFA), but what you should not do is just assume that ideas from one setting can translate unproblematically in another.

I came across another one this week.  The person was looking at the way disability was viewed in this country and in a Middle Eastern Country. Both the CFA, Cronbach's alpha and exploratory factor analysis suggested problems. This was found after she had conducted two large surveys and from what I can see there is also an intervention in one survey. Plus she has sampled from several subpopulations such as health care professionals and parents of disabled children.

In some ways she is fortunate, the analysis conclusively demonstrates that disability is constructed differently in the two cultures. However it is not enough to say they differ we also have to say how they differ and probably why the differ. The only clue to this is to go back to each item and compare it across the cultures. So not the fancy techniques but a huge lot of fairly simple statistics. Then you try and make sense of it!

Thursday, 8 August 2013

Reviewers comment answers questions

I have had a tough paper on the go, it is the sort that drives you up one wall and down the other. Now this paper is from a designed study. A designed study normally takes an couple of afternoons to analyse. Not this one. I have probably spent a couple of weeks on it. We are at least on out tenth round of analysis. Indeed we sent of the paper to a journal knowing there were problems with it but also knowing we were going around in circles analysing it.

It was rejected. Well no real surprise there! However we got the reviewers comments back.

 I should explain we have an expert on the topic who conducted the study but she is not writing the paper. She probably thinks it is too simple. The thing was we were having to classify people after they had been recruited to analytic group. This was although we deliberately went to recruit people with specific characteristics we found that people were unstable between recruitment and the trial. That is they swapped groups.

What happened was one of the reviewers comments stood out. It suggested an analysis we had not done and this was after we thought we had analysed data every which way. We did it and the analysis popped out with the results just as we felt they should. No playing, no cludging, straight there in broad daylight.

You see the reviewer had two things:
1)Distance, we had got too close to the data and could not see the wood for the trees
2)Wider Perspective, we were too narrowly focused on the design of the study and how to analyse it. The reviewer was able to say basically "now X might well be interesting" and put his/her finger on the one thing we were overlooking.

So to reviewers who spend the time to give useful review, thank you! Even when you do not recommend acceptance!





Thursday, 20 June 2013

When Dummy variables make sense

Back in early days of computers, most statisticians used to be dab hands at calculating dummy variables. It was a neat trick by which you could get a regression procedure to conduct an ANOVA! What is more there was plenty of argument over which were the best way to calculate dummy variable. If you look in SPSS for General Linear Models and look under contrasts you can still seem many of the options.  They have some really complex ones such as the Helmert that have some nice properties like orthogonality which is actually very nice if you have a balanced design but of no particular use if you do not! The imbalance in the design destroys the nice properties of orthogonality. The same actually happens if all sum of the options is 1!

So a dummy variable is a simple way to assign values to particular groups. Lets suppose we have a choice of lunch boxes and we want to know what role colour and menu plays in choosing children's choices. So we have red, blue and green lunch boxes and we have healthy (wholemeal bread, low fat spread, tuna and sweetcorn sandwich with a plain low fat yoghurt and some grapes), normal (brown bread ham sandwich with a fruit yoghurt), bad (white bread jam sandwich with a chocolate mousse pudding) menu options. Kids are given a lunch box at random and asked to rate it. We also want to check girls from boys.
contrast
Boys0
Girls1
Now the old statistician used to know that we could have a factor with girls /boys in or we could have a variable that assigned 1 to girls and 0 to boys and this would equally tells us of the difference. Now this can be extended easily to the colour of lunch box. So  for lunch boxes we might get

contrast bluecontrast green
Red00
Blue10
Green01
Contrast b looks if the box is blue and contrast b looks if the box is green. Red has become the default value. However for the meals there is actually a progression so you code slightly differently
contrast normalcontrast bad
good00
normal10
bad11
This allows us to compare the difference that lies between good and normal and that which lies between normal and bad!

The interaction between colour and menu are calculated by just multiplying one by the other or if you prefer the matrix as below.
colourmenublue-normalblue-badgreen-normalgreen-bad
redgood0000
normal0000
bad0000
bluegood0000
normal1000
bad1100
greengood0000
normal0010
bad0011


This sort of recoding is time consuming and most statistics packages now do it for you! That is right they still do it but the computer does it for you rather than you needing to think about it and they can choose the contrasts with nice properties and then encourage you to use multiple comparison tests to find where the differences lie. This is even true of R. However if you really want to you can always use dummy variables rather than trusting the computer.

However what happens when you have structural zeros, that is combinations that do not occur. Packages like SAS and SPSS actually do the thinking for you set the contrast to zero and take it off the degrees of freedom but others say "sorry we can't fit this"

Well you can, because you just have to calculate the dummy variables. Then you do a frequency on each dummy variable and those who only have 0 in them get dropped and you just add the other terms. Simple really isn't it. I need at some stage to do the R code to demonstrate this!

Thursday, 13 June 2013

Name that study design?

Design

You want to know if men out perform women in a class on physics. So you take the class at the start of semester and give it a test on physics, give a second test half way through and then record exam results at the end of the semester.

Challenge

Find a protocol for such a study against which you can check your design!

Ruling out the obvious

There are several protocols that you can check against which it simply will not fit. It is not a clinical trial, as there is no randomisation, subjects on entering are either male or female and I am afraid giving gender altering drugs would be unethical! It is not a case-control study because the outcome variable is not male or female and there is no matching. It is not a cross-sectional  study as it does have time as a related variable.

In other words all the accepted study protocols do not apply here. This feels like madness because the study as presented here is very close to the classic case for using the T-Test in statistics 101. What is wrong is that rather than using random allocation, the study design assumes random recruitment. That is from the pool of all likely male students to take the course, everyone is equally likely to choose or not choose the course and the same for all likely female students. This is of course somewhat dubious, that is why clinical trials like to have random allocation but when you can not alter a persons status then you cannot use random allocation.

Moving Forward

So how do you tackle it. Well I can tell you the way we came up with. Firstly we stated it was a comparative study. Secondly we found that actually such studies are quite common in bio-equivalence although unlike them our null was that the two groups were the same. So we could point to other studies with the same design.

Will the Journal understand?

I am not sure. We are being open with them. Yes there is a double factorial design that we could have done if we had also run a trial without intervention (equivalent to testing a group of students who did not sit the physics class) and if we did that there would have been some random allocation but only within groups. The problem being that actually finding one group of subjects we were looking for a fairly rare event and we might well have had to sacrifice power quite substantially if we did that. However the study does not tick any of the tidy tick boxes the journal has and journals are getting more and more tick box orientated.