Thursday, 31 May 2012

Tedium of Finishing Off

This week has been a finishing off week, two papers were coming up to that line which means they are ready to be sent out.At the surface you'd think that they were both the same but the feel was very, very different. I also saw a PhD student who is planning to submit next week, I sincerely hope that he gets through his viva without major reanalysis. I have come to dread opening his data set as there is always some more errors in it. However I prefer not to think about that.

So lets take the first paper, it was a short communication of a piece of analytic work. The problem was that despite the fact we had written reports and such from the data, we had always been so constrained that we have never played with the data. So we had to sit down and check what we were finding. We actually took a very specific route and the statistics suddenly started giving useful results. This is a data set that by the standards of quantitative work was messy and likely to have a lot of recording error in it because of the collection techniques. So it was with some surprise that things started to make sense. Although it took us some time to get our heads around things (inverse correlations always seem to be more difficult to interpret) but once done we were able to write in a very short space of time.

The other paper was a fairly simple qualitative paper which had been to one journal and had been rejected thus needed quite a bit of rewriting. It felt completely different, the work involved was a lot more about balancing our analysis against the findings in the literature, we regularly had to take a sentence ask what was being said and then revise it into something that actually was of academic standard. There was also a need to update all the references. We did go back to the analysis, but when we did it was simply to the output files from last summer. Much nicer much easier and means I know I am using a consistent version of SPSS. So no new analysis but plenty of detailed writing.

Anyway hopefully both papers will be off to the journals in the next week or so. I hope we have some success.

Wednesday, 23 May 2012

Water and Long Haul Statistics

Last week the paper Water intake and post-exercise cognitive performance: an observational study of long-distance walkers and runners was was published and I am one of the authors as Margo felt the analysis was going to be tricky. I suppose I could have talked about it then but last week was dominated by the move, so here is my reflection as the statistician on the project.

It was a paper that started out as a student project, well two student projects actually although in the end  it was only water we concentrated on and the other half which looked at nutrition in general did not. What was unusual about this study was that it was a field trial (to adopt the agricultural terminology) not a laboratory/clinic based trial. Field trials are known to have large levels of complexity and this was the case here.

Lets start with the basics, this trial was pre and post race measures of cognitive performance for long distance runners. The students had no ability to determine who went into which race, that was the choice of the athlete who decided to join the study after they had entered a race where the students were attending. Secondly we had no control over liquid intake prior or during the race we had to rely on self reporting by the athletes. Thirdly we found there are few reliable ways of finding someone's hydration status in a one off setting. Most of the highly technical methods either are not appropriate for a race situation or are only reliable if repeatedly administered.

This is one of those case where it looks simple, a paired T-test will do, and it quickly comes apparent that no it won't do. The biggest problem was removing the confounding. I went through several analyses where the results were incoherent (did not make sense), yet consistent enough to suggest there was something going on but not enough so I could satisfactorily characterize it. So I had to resit down and consider the assumptions we had made and then re-analyze. I think it was on fifth attempt that it finally hit me that we had a problem with pre-scores and the water intake pre-race. If cognitive scores are dependent on hydration then this is true of pre-race ones as well as post-race. So to get a cognitive performance score that was neutral before the race I had to adjust the scores for the water intake over the previous twenty four hours. I could not just use the raw scores. It worked, at last we got a consistent and coherent story. Then all we had to do was to write the paper.

However in writing it we discovered that we should have taken pre and post weight, I suspect as well as self reported normal weight. It is one of those cases where

It took one year from student placement to analysis and another year for the paper to be produced, but what had been very much a first simple trial at doing something had born fruit. 

Thursday, 17 May 2012

On Chucking Things Out

This week I am moving offices, not just that the department has decided that we need to do a massive clear out before we move as they do not want to pay to move things that will never be used. I am having some problems with this edict.

Firstly I am a statistician by training, culturally statisticians are hoarders. That is they tend to store data indefinitely on computers, keep copies of long forgotten analyses and so on. When I first started working as a statistician, half the office the four of us shared (it wasn't big) was taken up with computer output. There were stores elsewhere as well I believe. Even when we got half of a much bigger office, under all the waist height shelves were boxes full of output and paper copies of data. You learn quick about deleting something too soon and I did it often enough in those early years.

I developed a method of working when I was in charge of the analysis, it involved big ring binders. I would do an analysis, print it out, put it into a ring binder, and then summarize the results for the front of the ring binders. These analyses were kept. If you knew my approach to most other things, you would realise this is a real taking on the culture. The ring binders were then kept, forever just in case there was need to return to the analysis. There are to Statisticians no such thing as a dead analysis where the output can be junked.

I have had to be radical with this and say that files where the analysis is several years old, have not been recently consulted and have been published really should be counted as dead. If I did not know whether published I was also to presume dead.

I am also semi pathological about this and data, but these days data is computerised, However until fairly recently I kept all the hard copies of surveys that were conducted about fifteen years ago. Well if the worst came to the worst and we wanted to go back to them and the computer network had been wiped out we could always re-enter the data. The oldest data set I have on the computer is one of those sets, if dates from August 1996. So the computer copy outlasted the paper ones in this case.

 These are on computer, so none are being lost, although I will back up my hard disk before we move. I must also look to creating a better back up system. The work is now fairly organised at least as far as data. Senior researcher, then folders for topics or if a student then in queries and under the student name. The main problem is that a lot of the time the data is shared between my computer and the consultees and there can be problems with knowing where a file is.

Secondly manuals are a tool of my trade. I have used them for answering users queries since I first started this job and my reading of them gave me a head in my previous ones. Sometime you don't even need a copy of the software with a good manual. Part of my job seems to rtfm for other people. . Needless to say I like having manuals around, they feel like a safety net for the times I do not know the answer.

However paper copies of manuals are disappearing (so much so I sometimes print off copies of pdf as I need a paper copy and I can't buy the manual). Also I do not use manuals equally, the SPSS syntax manual is the one I consult most frequently (I wonder if it is time I got a new copy as it is ten versions out of date although syntax changes slower than most people think, and whether I can get a new copy). At the other extreme are the SPlus extra modules which after a decade aren't even out of their wrapper and we don't even support the software anymore. Some of my manuals are old, very old; some had been in the department longer than me.

However the move has been towards more and more online manuals. Also I do not use all manuals equally. I have a similar attitude to master copies of software especially when I have been caught out without it. Universities seem to be the place for historic software users.

So I have had to have a cull, the one where got rid of all no longer supported software, or when alternatives were available on line that were more up to date. This meant chucking out one huge lot of manuals that were over twenty years old, three versions out of date and I had not consulted in the last two years.

The whole thing is that I am doing this, I am doing it because these things are on paper, if these things were computerised, there is no way I would be doing this, there would be pressure on me to preserve them even longer, well maybe not the manuals but definitely the analyses. It seems to me the demand to archive data is an outcome of the present computer age, before then archiving was difficult and therefore not attempted. Only the highly methodical kept notes and data carefully enough that it was still around when they died and then if they were important it was archived. Things weren't transportable so keeping things was often not an option.

Yes I understand fully why we want to keep things but I am becoming aware that a sensible deletion policy might well be part of a good data management plan!

Thursday, 10 May 2012

The Importance of having metadata

There is a lot of interest at present in Research Data Management, its one of those words that you can hear been talked about either with excitement or with dread. You can learn more from Digital Curation Centre. There are buzz words around like "Open Access", "research outputs", Data Life (I wish they would talk of data half life but that is me) and metadata. Now I am not going to give a detailed understanding of Research Data Management, what I am going to do is give the story this week which highlights the need for ongoing metadata keeping in a live research project.

I have had a long term research relationship with Margo Barker started soon after I came about 19 years ago and has covered many research projects. The one this started about four years ago, when Margo was able for a short while to employ someone to look at magazine messages for her. The person did a good job and produced some results in the short time available. However things moved fast and when she went, Margo found this project was largely swamped with other things. It was not until last summer when another person working on a short term project had some spare time to look at it that it got revitalised. He managed to get the paper ready for presentation but it took into the autumn before a paper was ready to be submitted. During this process little new analysis was done.

The paper was sent back for editing and due to the reviewers comments we decided to alter the analysis by removing some of the data. That was fairly easy for the stuff I had on my machine (I had the SPSS programs that did the original analysis all saved) but there was other data I had not seen. A good part of this week has therefore been spent searching computer storage and opening possible files to see if they held the original data. We found it eventually but it took us a long time to do it.

So what difference would meta-data have had for this project. The meta-data would ideally contain the creation date, the last edited date, what was actually in each file, not just what data was there but what analysis had been carried out, any changes made and so on. If this was stored in a central file for that research project, then we could have opened the file and looked up which files contained the information on the number of magazine sampled. We would know what each variable actually was and could probably answer question on what actually made an article about nutrition (fortunately the original researcher was good at documenting the data, so we found that she clearly stated that she had removed articles with only a small bit on nutrition, but this was hidden away in a file and we came across it by coincidence. Also we might for instance have not spent quite so much time wondering whether supplements and weight loss products were one and the same (they weren't but it took a lot of time to find that out, because we did not know we were still short one of the data sets)!

Moral of this story: most researchers are human and humans forget. Having your metadata in a usable form makes it easier to pick up an analysis after a time has lapsed and saves you time because you have information on what data is where.

Thursday, 3 May 2012

Where to begin

Most people in the department who think of me probably think of me as the person who deals with SPSS (it doesn't bite honestly) and if they are a bit more knowledgeable that I tend to work with researchers and also look after something called NVivo as well.

Well I do do that but in doing that and the work around it I do so much more. Just today for instance I have:
  • dealt with a doctoral student using Mixed Methods (both SPSS and NVivo), helped her enter data so we can run some sort of analysis in SPSS (her data was lost and she has only got outputs);
  • helped another student deal with the analysis of their green roofs in SPSS
  • Looked briefly at some data on advertising in Woman's magazines which is in Excel 
  • spent some time discovering what Max-Diff methodology was
That is a slightly heavy day but it is not exceptional. The topics people bring to me are varied, probably my main skill is not using SPSS or NVivo but the ability to extract from someone what they understand themselves to be doing with their research. I therefore tend to work very closely and intensely with the researchers I am advising.

I have been doing the job for close on twenty years and there are really two core reasons for doing it. Firstly there is the character of the people who you are dealing with; secondly there is the stimulus of the work involved.


It is good to work with researchers, researchers whether they are doctoral students or professors are usually passionate people about what they are doing. If someone shows an interest in what they are doing they become animated reflecting the  high level of personal involvement in their research. Often regardless of how cynical they have become about academia and the outputs of their research, they remain people who want to make a difference.

Secondly there is the fact I am perpetually curious, I like learning new things, even better I like discovering or helping to discover new things. However I am not good at sitting down and working methodically in one direction, there are too many shiny new things out there to distract me. With this job I get to change what I am working on several times a day. Most researchers work on a small number of topics for a long period of time. Plenty of stimulus for my brain then.

So over the coming weeks I hope to bring to this blog some of the stories of the research at the University of Sheffield I am involved with and also some of my reflections on supporting researchers in the use of computers.