Thursday, 25 April 2013

Something lost between analysis and reviewer

This is an unfinished tale. About a year ago I did an analysis for human nutrition. I am not going into what the topic of the paper is as it has not been published.  I am going to try and make sense of a reviewers comments.

Lets get this clear, normally reviewers comments are useful, they strengthen a paper and get it stronger. The problem is that I can't relate the reviewers comments to the analysis. He talks of omnibus tests and then comparisons within them. I did no omnibus tests!  I have been through the output several times and they do not exist.

So I am now trying to find what he is talking about. This means sorting through the differences between what I have written in the report, what has been lifted for the paper and what the reviewers comments are. Somewhere in between the three is something I am not getting and I have not been able to place it. Until it is placed I can't deal with what I need to in the analysis.

At present I suspect that the reviewer has seen the raw data and worked out the chi-squared for that, then has gone to the paper and assumed that the chi-squared quoted is from that when in actual fact it is the chi-squared for the adjusted logistic regression.

The alternative is that they have assumed that something is groups when in actual fact they are present and absences. There are two different ways of coding for different traits. If the traits are mutually exclusive you can code them within one variable with each trait being given a different coding category. So for instance if you had favourite colour you might have the options: red, blue, green, yellow, pink, purple, brown and black. You still might get the awkward respondent who said "orange" but that is another category. As you have asked for favourite you do not get the answer "red and blue". However if you as the question "what colour crayons do you have?" then the answer "red and blue" is totally sensible. In other words having a red crayon does not exclude someone from having a blue crayon. So they ate not mutually exclusive categories.

Now when you have mutually exclusive categories you get omnibus tests. When you do not have exclusive categories you do not. Now I can not recall any mutually exclusive categories, all categories could be present or absent in any individual. Indeed I created a category in one part that specifically counted up the number of different categories of other variables that each individual had and split it between those that had only one and those that had multiple. There were about eight different characteristics. That means an omnibus test for every possible category would have 28-1 or 255 different combinations which implies an omnibus test would have 255 different levels.

No thanks.

Oh well maybe my brain will be clearer on Monday and I can find out what is actually going on. That or I am going to ask the prof if the reviewer would mind making their comments clearer.

Thursday, 4 April 2013

Learning R, the Good, the Bad and the Ugly

A long time ago, when I was a young statistician there was a package called GLIM; it was a very nice package in some ways although limited. What was nice about it was that it allowed a statistician to follow a coherent way when modelling something mathematical without most of the time having to work with complex matrices. I got used to making it sing and dance for me in my first job. So there is part of me that wants to have a powerful version of that package.

However my jobbing day to day work has been done in SPSS. Now SPSS decent enough, but it shows its complex development history. From what I can see the original code was written in Fortran (which is not nice for programming statistics in), they then went through a stage of using C++ until going over to Java. There are also a general program syntax which basically gives you simple data manipulation and the ability to call routines, matrix language, an output manipulation language (oms) and interfaces to Python and R. In other words there is a lot of bits and pieces that do things differently. What it does have beside  this is enough staff to actually check the routines are working and some knowledge of statistical programming. If you were designing a package from scratch you would hope to end up with something simpler.

Now about a decade and a half ago there were  a couple of statistical programming languages being developed. One was called S and one was called xlisp-stat. S was commercial and xlisp-stat was open source. Eventually people started putting a nice front end onto S and calling it S+ and with that it gained popularity. This was to such an extent that an open source version called R was created that basically was the programming language.

So in my mind R was this powerful version of GLIM.

It isn't. Let me give you the Good, the Bad and the Ugly.

If I selected something for Good it would be the lm command which is a linear model command. The nice thing about this is it covers both the regression and anova. That means there is a consistent way of specifying many different models. Not just that but it is extended to glm gams and nlms (with probably a whole host of others. This means there is a neat consistent way to deal with 90% of statistical packages.

I can give two things that are simply bad. The first is the multiple implementation of various fairly simple commands. For instance the functions anova and Anova do much the same but have different parameters you can set! I am also somewhat frustrated that there is package called lawstats which does some things I really want to do, but does it only at the expense of disabling useful bits of another package! This is inherent part of open source software. The next is just poor management by the central people. When R came along a lot of statisticians had written code in Fortran or C++. The good way to deal with this was to insist it was put into R thus making it also Open Source. This however meant work for these statisticians. So instead there must be an interface with Fortran compiled programs and C++ compiled programs. You can not publicly check this code, it is in a sealed box, yet it exists as part of the R suite. How do I know this, because it is there is the user notes for those routines!

The way I am using R it is very much just another statistics package like SPSS. It has different standards in different parts. For instance up to today if I wanted to work in a specific data set I put into the command the term "data=mydata" where "mydata" is the name of the data set. However today I tried to plot interactions this way and it did not work. I noticed that sometimes there seemed to be an extra term around a plot.interaction call which said "with". So I looked that up and it is the way with that command to make it for a specific data set. I would not mind but originally I had started using mydata$varname but had already switched to the data approach as that did not work. This is simply ugly!