Rumours of Research: September 2012

Wednesday, 19 September 2012

Rant: On Simple Statistical Packages

There was a time when I was ready to look at every new statistical package that came along, well almost. The longer I have been supporting statistical packages the more cautious I become. Let me be straight with you, even the big grown up packages like SPSS get it wrong surprisingly often and new packages that are aimed at those doing simple statistics get it wrong a lot more of the time.

Now let me be honest there are some superb statistics packages out there that are newer. I think of Stata for starters. The thing that distinguishes them is that they are not trying to produce a simple tool for researchers but take statistical complexity seriously! There packages are not any easier than SPSS to use, but when used properly are powerful tools. What really gets me is the people who think "lets put together something simple for the researcher and forget about the rest of the statistics".

Today I had a first query on a package that is popular with the medical school. The person was doing a simple oneway ANOVA and then doing post hoc tests. He came to me with his data because he could not get p-values out for the Bonferroni post hoc tests. I suppose I could of explained that Bonferroni does not produce p-values but alters significance levels and that was why he was not getting any results. He also had a significant F value and no significant post hoc.

Well Bonferroni is highly conservative, even the more accurately calculated p-value of sidak is less conservative and yes the difference between the two is that Sidak calculates actual p-value while Bonferroni uses an approximation. Fortunately it gave t-values (but no degrees of freedom) so I was able to find LSD p-values and then carry out a Holm Sequential test and calculate an inverse Sidak (many of the inverse Bonferroni would have been greater than 1, I told you it was an approximation). All this is do-able if you are going to kludge it in Excel.

What would have been more sensible is if it had done what packages like SPSS do and that is give a variety of post-hoc tests. He could have done the standard one such as Tukey and got p-values out straight out. This is what I did when I established that was what he wanted and I also showed him how to draw the graph using SPSS to estimate the terms and a graphics package to actually draw it (well Excel in this case but any graphic package does this). That was so simple he was delighted.

So the first thing against the package is that it does not even give the standard comparative post hocs that many of the main ones instead using an old fashioned (think over fifty years ago) technique which has long been discarded unless there are no other options.

However I saw more, to do this ANOVA the user had to put each group into a separate column. This immediately sounds warning bells with me. The one thing I find nearly all these packages don't do is deal with levels of variation. When they come to ANOVA they basically treat a paired t-test as a group t-test. NOT GOOD. This leads to wrong interpretation. In fact the wrong presentation of within subject comparison is a major bug bear in the literature and in this case confidence intervals have not been an improvement!

To be polite, I will not be recommending this particular package for statistical analysis. I suspect if you want to do a proper statistical analysis of all but the most constrained of research then you REALLY do need to need to put in the time and effort to learn a proper statistical package. I am not fussed which, I will even allow SPSS* but please do not bring me something that you say is easier. So far all such packages I have seen in their attempts to simplify mislead researchers into doing WRONG analyses.

*When I was a postgrad SPSS had a reputation for doing this sort of thing, now some twenty years on, SPSS has developed and there are so many more very poor packages out there that I do not feel this can be sustained.

Thursday, 13 September 2012

Authorship and Culture in Research

I am in a funny position of being a researchers whose value is not assessed on the number of papers I am author on. Therefore I am not particularly interested in Authorship, but it is useful to know the number of papers and acknowledgement as it becomes a measure for the department to say that it is contributing to research at the University. Eventually I suspect that I will have to keep detailed notes but at present it is just a matter of reporting.

There is however a distinct divide between the two areas. In medicine these days I get authorship on about half the papers. The reason for this is that I am a statistician and we have found over the years that having a statistician as author tends on the whole to reduce the number of analysis queries from reviewers. Then in medicine it is normal to have many authors on a paper. Also it is easy to track papers in medicine as they are all online!

Now come to the arts and social science. I rarely if ever am an author, but I am frequently cited in acknowledgements. This is fine. It is the culture of the discipline and the way it works. There is a slight problem in that usually I do not know if I am acknowledged unless I search for myself and then find it. However the proportion of published papers that are online in the Arts and Social Sciences is a lot less. It would be courtesy if they sent me an email when they acknowledged me in a paper!

Thursday, 6 September 2012

Systems and getting things coordinated

This is a time when I am going to talk of the frustrations of running something efficiently. One software package I keep a watching eye on in SAS. Now let me be honest I do not use SAS, but I do give some sort of support and more importantly I do look after the distribution of license codes at this university. As far as the University is concerned SAS is a bit of an awkward package. It is expensive (it costs us about twice what SPSS does) and it is used by very few people here (I make it around the 30 mark, given that we have hundreds for SPSS that is a very small population, we probably have a larger Stata one and they buy it themselves), however those users are scattered across many departments! The result is that if we buy a license it is about a third the cost it would cost the University if each group bought their own license.

However for this size group it is not worth doing a technically complicated system in place. So we have an email list and I subscribe people that ask for the SAS license to it each year, removing people who had it the previous year after I have sent the email that says the new codes are in. In my absence the helpdesk can do this.

You'd think give that there are so few of them they would be on a single version but no we have people on 9.1.3, 9.2 and 9.3 (some of them are running two versions). Then when I do this I have to keep everyone in the department who needs to be in the loop.

I am not aware when codes should come, I used to be but SAS kept changing the date; so this year I was not aware we were running out until I started getting emails. Then I have to ask for the codes from the person in charge of licensing and finally I need to contact everyone to say the new codes are in.

Then people of course start asking for codes which we did not get first off so I have to go through the loop again. Then there are a couple of weeks when people get around to asking for the codes. So it takes time.