Rumours of Research: Learning R, the Good, the Bad and the Ugly

A long time ago, when I was a young statistician there was a package called GLIM; it was a very nice package in some ways although limited. What was nice about it was that it allowed a statistician to follow a coherent way when modelling something mathematical without most of the time having to work with complex matrices. I got used to making it sing and dance for me in my first job. So there is part of me that wants to have a powerful version of that package.

However my jobbing day to day work has been done in SPSS. Now SPSS decent enough, but it shows its complex development history. From what I can see the original code was written in Fortran (which is not nice for programming statistics in), they then went through a stage of using C++ until going over to Java. There are also a general program syntax which basically gives you simple data manipulation and the ability to call routines, matrix language, an output manipulation language (oms) and interfaces to Python and R. In other words there is a lot of bits and pieces that do things differently. What it does have beside this is enough staff to actually check the routines are working and some knowledge of statistical programming. If you were designing a package from scratch you would hope to end up with something simpler.

Now about a decade and a half ago there were a couple of statistical programming languages being developed. One was called S and one was called xlisp-stat. S was commercial and xlisp-stat was open source. Eventually people started putting a nice front end onto S and calling it S+ and with that it gained popularity. This was to such an extent that an open source version called R was created that basically was the programming language.

So in my mind R was this powerful version of GLIM.

It isn't. Let me give you the Good, the Bad and the Ugly.

If I selected something for Good it would be the lm command which is a linear model command. The nice thing about this is it covers both the regression and anova. That means there is a consistent way of specifying many different models. Not just that but it is extended to glm gams and nlms (with probably a whole host of others. This means there is a neat consistent way to deal with 90% of statistical packages.

I can give two things that are simply bad. The first is the multiple implementation of various fairly simple commands. For instance the functions anova and Anova do much the same but have different parameters you can set! I am also somewhat frustrated that there is package called lawstats which does some things I really want to do, but does it only at the expense of disabling useful bits of another package! This is inherent part of open source software. The next is just poor management by the central people. When R came along a lot of statisticians had written code in Fortran or C++. The good way to deal with this was to insist it was put into R thus making it also Open Source. This however meant work for these statisticians. So instead there must be an interface with Fortran compiled programs and C++ compiled programs. You can not publicly check this code, it is in a sealed box, yet it exists as part of the R suite. How do I know this, because it is there is the user notes for those routines!

The way I am using R it is very much just another statistics package like SPSS. It has different standards in different parts. For instance up to today if I wanted to work in a specific data set I put into the command the term "data=mydata" where "mydata" is the name of the data set. However today I tried to plot interactions this way and it did not work. I noticed that sometimes there seemed to be an extra term around a plot.interaction call which said "with". So I looked that up and it is the way with that command to make it for a specific data set. I would not mind but originally I had started using mydata$varname but had already switched to the data approach as that did not work. This is simply ugly!

Rumours of Research

Thursday, 4 April 2013

Learning R, the Good, the Bad and the Ugly

No comments:

Post a Comment