Rumours of Research: 2012

Thursday, 13 December 2012

Not so Joined Up Thinking

Right last December, IBM SPSS wrote to all academic sites in the UK suggesting that they were stopping issuing license codes for very old pieces of software. I was largely unconcerned about this. We were upgrading the student system in the summer and the staff were using over the summer and really sometimes moving staff onto new versions is a good idea. SPSS users tend to be research staff so they are largely staff running their own machines.

Now what I should have done is considered the small number of people who do not fall into that groups. There are some postgraduate students who use departmental machines and there are about half a dozen support staff who have managed desktop because they are supporting students in the use of SPSS. That was a mistake.

At the same time the decision is taken over the summer that the version on the old desktop would not upgrade to a newer version because of the problem of the depth of folders installed on it. This did not reach my surface consciousness, although I may have been told. It would not have been major we were planning to be on Window 7 service anyway.

At the same time I am getting a lot of staff saying SPSS is not forward compatible with datasets. Then everyone get worried. The thing is that actually with respect to data files the compatibility is forward and backward compatible right back to 9 or beyond. What is problematic is the output files. There are several solutions to this, you could open them all and export them or you can install the Legacy Viewer which allows you to view the files. A further option if you have syntax is that you can send the syntax file and the data file to the user and they can generate the output for themselves. Hardly a big problem you'd have thought.

Now lets move forward to this years renewal. The new version of the Desktop is out and running a version we have codes for. However there are a number of machines not on the new desktop. These are running the old desktop and we cannot get codes for the version of SPSS running on that. The intention is to move them on in the next six months or so, but there is a spell between where we want to be and where we are.

Well what are the options. To release the version of SPSS that has a license code onto the desktop will mean that users loose space in their profile with all those directories. So if we do not release we have to deal with users having used SPSS finding the license code has expired. Most are postgraduate students and telling them to use one of our computer rooms if they cannot persuade their department to install SPSS specially in the departmental rooms seems a fair option but that leaves members of staff.

First response is that they should not be on the managed service. Actually once I looked into who it was, it raised other questions. The group of staff I had overlooked were staff in MASH who support students in the use of statistics and mathematics during their course. Their work is largely remedial in it is getting students up to the standard where they can complete the requirements of their course. These have managed machine not because they are using administrative software that they need the managed machine for but because if they have a managed machine they are using the same thing as students. In other words they may well need upgrading to MANW7 earlier rather than later. We should be contacting them and sorting a way forward in the next few days.

We need to think this through. We also need to think how we deal with upgrading in future because the need to upgrade SPSS will not go away. We will have to move to a newer version next summer as IBM SPSS will no longer be giving out the codes for the version on at present.

Friday, 7 December 2012

Reflections on Difficulties with providing Qualitative Software Training

This week I had one of those training sessions that was miserable experience for those being trained and also a miserable experience for me as trainer . Lets be clear I get a kick out of training when I have managed to conduct a training in such a way that people have really got something out of it. However sometimes a training just does not deliver and this one was one of them. I also read In Praise of Evaluation in my denominations magazine this morning and what passes here is my evaluation of what went wrong.

Last Minute

Too much of the preparation was left to the day itself. That was not just the people who were sorting it out technically but also me! A course that relies on a technical fix needs to be tested for at some stage at least a couple of days before hand. This was just one such course. Lack of doing this showed in the fact the instructors PC did not have the software on and the fact I did not give out the correct names and passwords to the special accounts.

I have a real blind spot where courses are concerned and nearly always end up doing things at the last minute. This time I had the course notes printed but I needed to print the username and passwords at the last minute. I had not given time to thinking how to do this and ended up doing it in a hurry. I therefore messed it up.

Bussed in teaching

Right there was a communication problem. I discovered a week beforehand students were being told that they were getting a work shop when in actual fact they were getting a very full training. I have taught the course six times in the last year and it takes 2.5 hours to cover what I cover! I had two hours to do that. The students were not going to get to work on their own projects in that time and they really should not have had the expectation that they could do so.

The problem with this is that the training happens in a wider context but nobody has thought really about the integration of the use of computer programs into the wider course. Rather they have employed an "expert" to give the training and that has failed to meet the requirements of the course.

What is equally troubling is the students had expectations of me to teach them things that really are not the task of the computer software expert. I teach people how to use the software. I do not teach them how to construct their research project.

The separation of software training from research methodology

This is the real biggie. The course basically is in three separate parts and those parts do not hand together. There is a number of methodology lectures which talk around collecting and analysing specific types of qualitative data. There is the "workshop" I give on using software to support doing research. Finally there is the student project when they have to show they have engaged with doing some sort of qualitative research. None of the three hangs with any of the other.

At present the focus of the course is on the style of data collected. Now most qualitative researchers collect multiple sorts of data. So at one level it is sensible on the other hand it puts all the emphasis on the style of gathering data. The analysis is often treated as self explanatory. It isn't.

Secondly software has developed, today it deals with working right from the recording through to writing the report. Using software is no longer just around coding. The standard line these days is that teaching software needs to be fully integrated into the course on qualitative research methods. That means you need to run sessions on organising data, transcribing interviews, coding , reviewing coding and checking through, developing theory and exploring data and finally reporting. These sessions probably work best if the first hour is spent with someone doing the theory and the second hour is spent with students actually getting a demonstration and working in detail. All these need a sound theoretical underpinning.

Conclusion

I need to explore ways in which I can discipline myself not to be so last minute
I need to be clear about what I am offering and start much earlier the talk about what they are expecting people to do
Really the course needs a redesign to increase emphasis on the analytic questions that a qualitative researcher is faced with so as to enable students to know what they are doing when taught to do something with the software.

Friday, 30 November 2012

Provider vs Facilitator the difference between learning and research computing support

I have been mulling over this, this last week. What I write now is to try and get some clarity on what the difference is between learning computer support and research computer support. Now learning and research are two functions of the University. Historically it was "teaching" but then I suspect that today's educational climate prefers to characterise what goes on with taught students from their perspective rather than that of the staff. So I am using learning. However I do want to draw the boundaries between learning and research, because research is often seen as just a form of learning.

A person who comes to learn a subject is learning how to navigate through that subject. They can become very proficient at this and become highly skilled within that subject. Indeed at the higher levels they quite often will be going to books and journals on their own in order to gain the knowledge they want to gain. The academic staff in the university may well not have that knowledge but and this is the crucial thing, someone somewhere else does. That is why they can go to the books and into the journals to find that knowledge. A researcher is not learning to navigate a topic, they are explorers. That is they are trying to find or create knowledge that nobody knew before. The primary stuff they are interested isn't in the books or the journals because books and journals only tells you where somebody else has been. That is not to say that knowing the subject thoroughly and doing literature searches are a waste of time, if you did not do them, you would not know what was old territory and what was new.

This means that when the University central services support research, it is not the same as supporting learning. When we support learning we aim to provide the necessary computer tools for an individual to come proficient at navigating the subject they have chosen to study at University level. That is not to say students may not decide to use other tools. I can remember back in the 1980s sometimes buying alternative texts books as I was struggling with the ones recommended by the class. The recommended texts were in the University library, the alternative texts weren't. So to with what central learning computer support. It does not support everything a student might do but it does support enough that you should not have to get other computer resources. However doing this relies on the fact that students are learning to navigate a known subject and we already have access to expert navigators of that subject (i.e. academic staff) so we can ask the staff what tools do they need and work with them to provide a good enough set.

With researchers we cannot do this. Yes we can provide the navigation tools that we provide to learners but those really will only take you so far. There are specific generic skills researchers need to develop e.g. an ability to use Google with more than your average skill, literature database searching skills, time management and writing skills, but when it comes to actually going beyond that boundary of the known into the unknown the expert is the individual concerned. So it is no good coming to a central service and expecting them to know exactly what your computing needs are. The person who decides what those needs are is you the researcher.

That means that for learners, a centralised computing facility can provide a fairly standard desktop PC that can be used by multiple users. That desktop PC may also allow you to link to more specialist machines where you can run specific packages that are part of what your course expects you to use. That is relatively neat and tidy.

First with researchers they really should not have a standard machine set up, their main computer should thus be their own machine not a shared one. They may purchase it themselves or their department may purchase it for them and it need not necessarily be high spec, although it is wrong to think that high spec is for science and engineering while low spec is for science and arts. The latest version of NVivo a qualitative software package has a higher spec than SPSS a package widely used by quantitative researchers. When you look at this as a computer person it is how they should be, NVivo does a lot that is technically more difficult than the complex mathematics in SPSS e.g. it is far more resource intensive from the computers perspective to play a video than it is to invert a matrix.However the SPSS user is often has more computer knowledge than the individual using NVivo. The old correlation which suggested that the people who needed high powered computer and the people who know about computing is beginning to breakdown and I think will continue to do so as more and more people start to use computers not just for numbers. Numbers are easy for computers to handle, other things are far more difficult. What is important is the responsibility for running that computer remains with the researcher. This also means that Doctoral Research Scholarships should include money for a computer that would be satisfactory for the student to carry the research on.

Second, central services are not going to buy all the software a researcher needs. Central Services might well buy licenses for research software where the software is widely used (something around ten departments asking for it, spread over at two faculties) and the central purchase of the software cuts the cost for the whole University. So check what is provided, as a rule if a piece of software has central support then you will get more support if you run into difficulties. However if what is centrally provided is not what you want, or does not work for you, please feel free to do your own investigation. If you want advice about software then it may well be a good idea to talk to someone from central services who may be able to tell you what to look for even though funding is unlikely to be forthcoming. The one things a central service computer department has is a lot of people who know a lot about computers in various ways. We also have a lot of people with a lot of diverse interests. Actually cost is becoming less of an issue with the amount of Open Software available. You may well find what you are looking for, for free, however you still need them machine to run it on. The other thing is to realise that if you do not like a standard offering e.g. Word there is no reason why you must have it on your PC and why you might not have Open Office instead. Central Services do not give this sort of freedom to learners.

We also are providers of infrastructure that you use. You may buy the PC on your desk but the wires or wireless signal that connects it to the network comes from central services. Every time a researcher today reads a journal article by downloading it to their computer rather than walking over to the library and photocopying it they are using central computing services. We make provision for email and other methods of making contact with colleagues through the internet and may provide collaborative spaces where you can collaborate online. We certainly have people with experience of developing such spaces whose knowledge can be tapped into. If you are interested in such a space then it might well be worth talking to us. This is the sort of thing that can often come as an off shoot of our provision for learners. We also provide shared space for department which you may well use and if you want space to make a good backup for your data then please come and talk with Central Services as we can do more than you would thing and the prices are dropping. In other words do not leave data backup to chance. Central services also provide access to large computing facilities both at this University and in wider academia. This is again because doing this on a cooperative level saves money! That said there are often charges for the use of these services but they are often built into the research bids when they go out.

It might surprise you the tone of the last two paragraphs, the initiative on the whole remains very firmly with the researcher. We are not going to hunt you out and ask about your research computing needs, you have to come to us with those needs and then we will see if we can help you with them. If you know at the bid stage that you are going to need significant support, be collecting large archives or need to store or release online research outputs, then it really is a good idea to talk to us before the bid goes in. It makes more things possible and will allow you to properly fund this sort of thing.

In the end central services computing support to researchers is not really that of a provider but of a facilitator with the researcher firmly in the driving seat. If you like the important thing to remember is it is the researchers job to pull not central services job to push. As a general rule the more you make a conscious effort to approach us about queries the more we are able to help.

Thursday, 22 November 2012

Users and Users

I ended up this week in a discussion about new machines in a training room. It was one of those meetings where everyone wants things to be as simple as possible but nobody really has thought about the complexity of the situation. The idea had been mooted that we should use virtual machines for those applications that did not run on the managed desktop and instead the desktop would link to the virtual machines.

Well it sounds great, an automatic role back of files and such, not having unsecure machines and lots more, lets be honest if it would work it would be a good solution. However what does work mean?

Here I need to express my stance, which is basically I want users to be able to run SPSS and NVivo on these machines as the room is the one where I teach the majority of courses in.

There is first the technical difficulties, I am somewhat cautious over whether NVivo would run in a virtual environment although it did for a while run on virtual windows environments when running on a Mac. With SPSS I envisage NO problems whatsoever. In other words as far as can we technically do this, I need persuasion whether NVivo would run (it is notoriously difficult to run in non-standard settings) but do not see other problems.

However these pale into insignificance when we start looking at user using the system. Many of the people in my department tend to think that the world is made up of people like us. That is not to say all of us are wizards on the computers, some are, but most of us are just what I would call proficient users who are able to use computers pretty well to carry out our own personal jobs. The wizards need to be wizards to carry out their jobs. However that means that in our day to day experience we are not working with a large portion of the population and these people really do not trust computers. The closest I can get to the feeling I get from them is they suspect computers of being the sort of sly operator, who lulls you into thinking you are doing all right and then pulls a fast one on you. Therefore they do not trust whatever they do on computers and if ANYTHING changes they freeze in an instant state of panic.

Now imagine if you run a virtual machine in front of just such a person, what do you get. Well you have two start buttons just to begin with, not just that but two filestore systems and the second one has a different c: drive to the first one. This is not supposed to happen. I can see people having problems with not knowing how to get to the second start button.

What I am not saying is these people are stupid, many of them are far from that, and have degrees to prove it. That is why I characterize it as a lack of trust that undermines all their understanding, in the end they do not trust themselves to work with the machine and for it to give consistent outcomes. This group of people are diminishing as people grow up with using computers, or at least the stage that this begins. When I first came twenty years ago I quite literally started SPSS courses explaining how to use a computer mouse! They are not extinct and I suspect will not be extinct in my life time.

The problem with NVivo is that it is a powerful package (what it does is quite complex) that is targeted at a particular group that has a high propensity towards this freezing. These are the people who read books rather than played with computers as kids! There are more maths-phobes among them than the population in general and maths-phobes seem to often link with this group of people. These are people who have chosen to do Qualitative Analysis and often part of that reasoning is because they found Quantitative scary! They just lump computers into the "scary" category as well.

Whatever I do I have not to give extra hurdles between these people and the using of the software. They need to feel they are in control and "difficult" things is not a good place to start.

Thursday, 15 November 2012

Linguists and Statistics

This week we seem to be entering a new phase in dealing with statistics in Linguistics and Modern Languages. I have therefore been reviewing the statistical texts to see what they are like. I have on and off for the best part of twenty years been dealing with statistics within this research area. The comments below are anecdotal but I think seem fair.

Linguistics are a rather different from many other areas of statistics. The normal approaches do not necessarily apply. First off let me list the time of variables:

counts - normally of words in various ways
presence/absence variable the sort you answer yes or no to, but also some categorical
marks out of, either straight test or number of times one form is chosen in preference to another
Likert type preference scales

The problem is that none of these type of variables are automatically Gaussian (normally) distributed.

Second there is differing levels of variation and the idea of a unit or what does a member of your population of interest look like. Well it often takes a lot longer with linguists than other people to work these out. A member of your population may be an individual, a conversation, a text, a sentence or an exchange. If it is say a conversation what about the variation that is related to the people conversing? In other words you have a huge lot to think about before you get close to being able to enter the data.

Third, Linguistics scholars use R! Yes I know you have these very arts based people using a programming language. However R is free, therefore they use R.

So I want a basic text books. I go to Amazon and start checking the books for linguistics, order one as it looks hopeful including SPSS. The first section on research methodology appears vaguely alright nothing completely wrong but then I get to the statistics. It does cover mean, median and mode but then it gets to measures of dispersion and straight in with the St Deviation and onto a T-test. Hang on a T-Test is based on a Gaussian distribution assumption, but these people more often than not have variables that are not Gaussian distributed even when scalar. Actually my alarm bells started ringing when the book said that the difference between interval and ratio does not matter for linguistics. It does matter, it matters for low counts, it matters for test scores when close to zero or full marks are achieve. You can not just assume Gaussian here. Sometimes your categorical variable will be your outcome.

Unfortunately this heavy emphasis on the Gaussian based tests seems to be common in the text books. It is back to the drawing board I think. It looks like I will be learning R properly.

Friday, 9 November 2012

Some resources for Statistics Teachings and Learning

There are a number of resources out there for learning statistics. This is not going to be an exhaustive list but here are a few that I am aware of.

Statlib

This is a really old website, do not expect anything fancy as you would on more modern ones but it is a repository of all things statistical from the community over the years. There is therefore a lot of stuff on it; for instance they have a large data archive with many data set available for download. Many of these are published data sets that go with specific books. There are also a R Archive and also a S Archive but plenty other software as well

Pmean

This is where Steve's Attempt to Teach Statistics(StATS) is now housed. Steve Simon who created both this site and the former site (StATS) is a statistical consultant but spent many years working at the Mercy Children's hospital and training medical researchers there in the use of statistics. There are many pages on that website but just to give you a flavour, he has a page linking to teaching resources, so you can see whether the free stats books are any good.

StatSoft Text book

Statsoft developed Statistica but they also have provided a free textbook on line which covers a wide variety of topics and I often end up there when I need to give a student a reference to something in particular. No it does not require you to buy Statistica

Rice Virtual Lab in Statistics

Another website that is useful they have another text book but even better are some of their demonstrations and even better I find they also have case studies which look to me as it they might have come out of Glasgow's STEPS initiative.

Scholarship of Learning and Teaching Statistics at Glasgow

A long time ago, before all the E-Science initiative or event the teaching centres of excellence before that there was a group of people in Glasgow who were experts on developing teaching software for statistics. They might not be so high profile these days but they still exist . It is worth checking them out occasionally.

Thursday, 1 November 2012

Getting it all together

There has been a heavy NVivo theme to this week. The problem seems to be the sheer energy of trying to join everything up. I am failing at this.

Some of it is easy. I have almost got myself into the situation to record the next NVivo Video. It is coming along and now I have "found" some items through Creative Commons Search that I can legally use as examples to add to the project. It should take me less time to prepare future videos. The problem is that while it was easy to find a photo or a video, it was a lot hard in the context of the project. I was however so successful that I got as a video that actually links to a document that is already in the project. It was much easier with NVivo 8 because then you were given Videos, pictures and transcripts that you could incorporate.

However two other things were on the go as well. Firstly I have said I will teach a session as part of the Social Research Masters programme in December. This was at a time when our technicians were confident they could get it onto the Managed Desktop which is what is on the machines in open access rooms. Let me be clear the Managed Desktop is an unusual setup, as we do not have around 500 staff we do not have the time to sort out a lot of problems on machines that occur if people have relative freedom to do what they want with an open access machine. Therefore a lot of system is tied down with pretty tight security. In particular we like to be able to control where people are able to write to, it stops a lot crap being put all over the system.

However NVivo is not a tidy program. It does a lot, but a lot of its functionality is built on secondary suppliers software. The one I am most suspicious of causing a problem with the writing begins with M who are responsible for the database that QSR use to store everything in. Basically our technical guys are not finding a way to install NVivo without having to give it rights that they see as dangerous. That means it is not on the Managed Desktop yet.

This is in marked contrast with the phone calls I am getting from QSR that are trying to sell to this University NVivo in the Classroom which is basically a training for those who wish to integrate NVivo into their qualitative analysis courses. It covers things like course design and marking. These are not issues that I have to worry about in my Introduction to NVivo as it is one session and really only introduces the software. It makes little sense at present lecturers including Nvivo into their courses if we can't get the software onto the managed desktop so their students can use it.

I also know that for staff to find a solution to this impasse I need to establish a demand for the software as a taught package. So I am going around in ever decreasing circles. My saying there is demand won't get it done, it is only when someone outside the department starts making a fuss that a solution will be found. I can run a fairly decent service for Researchers for NVivo without it being on the Managed Service but as is going to be increasingly the case, when students on taught masters and doing final year projects are expected to use it, this starts to fail.

Thursday, 25 October 2012

The Individuality of Researchers

One thing about working in research support rather than in straight research is that you get to collaborate with a wide variety of researchers. I must have worked with several hundred over the years and they stretch dimension wise across faculty, gender, age and experience. I have seen anyone from undergraduates doing their research project to very senior academics indeed. Let me now state the obvious: they were all different and long may that continue because there difference makes the savour of the job I do.

The thing is there is coming to be believed that a researcher is a researcher is a researcher and that you can interchange one researcher for another. This is mistaking one small truth for a different truth indeed. The thing I will grant them is that a fair number of research skills are generic. The skill to locate appropriate literature (information), the ability to cite literature properly, the design of experiments, the use of statistics, write up research accurately and clearly and even the approaches to Qualitative analysis are actually fairly similar wherever you come across them. So are the thoroughness, the ability to concentrate on detail and the ability to self-motivate and self manage. The first list is taught, along with numerous other research tasks; the second list I suspect is by natural selection. You simply do not survive in academia without them.

That said I have never ever given the same advice to two different academics. Even in the extreme case of masters students in Nephrology where two students were often answering two very closely related questions from the same data set. The reason being that as far as I was concerned the big problem was to push their level of understanding that bit further than they were at present and that they really should be able to interpret their results. I am pretty sure with some of them I could have shown them complex techniques, they would have gone away and done them and not understood the results. My response was therefore shaped not just by the research question but by the abilities and understanding of the researcher. No two researchers bring quite the same combination of skills generic or otherwise to a research problem. The path therefore of the research is partly formed by what skills they have.

Secondly a researcher nearly always is powered by shall we say a curiosity. Now do not imagine this as some mild sort of interest, the sort of query you might ask after the son or daughter of an acquaintance. This is completely the other end of the scale usually. Perhaps it is better to say they are pathologically curious, questions are rarely idle and quite often when they find one that is satisfying they feel a compulsion to push it forward. I have seen senior professors literally bouncing on their chairs when an answer comes into sight or an intuition works out. Now it is this sort of curiosity that really pushes the research. With a doctoral student it normally fairly tightly focused on their immediate research problem, with more mature researchers the focus is more general and less prescriptive, there maybe half a dozen questions they are working on or they may well seek employment with a range of different focuses but the important question is does tackling this issue interest me.

This second is important to grasp, the fact is researchers enjoy recognition, like to well paid and often quite like being a bit of a celebrity but there prime motivational driver is often this curiosity. You will find researchers working longer hours, in poorer conditions for less financial reward than they could have got elsewhere simply because being where they were allowed them to follow that curiosity. A researcher may turn down a substantial promotion if it would lead them into an area of research that does not interest them. Forcing a researcher to go against their passionate curiosity, is often the best way to upset a researcher and get a drop in their productivity. It may also be the best way to stop the research.

This does not mean researchers can't be directed. Most people's even doctoral students have a wide enough curiosity that it can be directed. If one line looks more profitable than another then they are more than willing to follow it. Also research interests overlap, we know this it is why we have research teams and research communities. Alright so they are largely the Academic equivalent to Trainspotter congresses but they still have shared interests they are passionate about. Therefore there should be several researchers who could fill any research job. So it is a matter of picking the best for that role.

However it is a bad idea to have someone employed for one research role and then because of situations to instead place them into a second role where they do not fit. Early in my time at this University I ended up "supervising" John, he was a doctoral student, on his third supervisor as previous two had left. The first irc was a supervisor of choice but the second got him off to measure airborne chalk level deposits around various quarries, in order to measure the impact of super quarries on the environment. This he did but his heart was not in it. The third supervisor was disinterested He eventually came to me to do statistics on a survey he had devised and carried out himself at his own expense and involving a lot of time and effort on his part. The statistics were pretty easy and simple but as he came repeatedly to me he started to talk of other parts of the survey where people had given free text to answer questions. These free text questions had elicited a highly complex response (including drawings and such). It was in looking at these that he was finally able to assess the way people thought about the countryside surrounding the quarries and began to understand what impact further quarries might have. Now I have no doubt another researcher would have found ways to use the metering of the dust but that was not where John was and it largely became a wasted year for him.

Thursday, 11 October 2012

ANOVAs or what they did not tell you at Uni

Yesterday I taught a course to MASH Statistical advisors on ANOVA. All their advisors have a statistical background, quite a number of them are doctoral students in Probability and Statistics department here. So why was I giving a course on such a simple technique as ANOVA which is normally taught first or second year undergraduate.

Well I was not teaching them how to do the mathematics, which is what they get at University, that and a couple of neat examples to have a go at. What I was teaching was the application in a research (non-statistical) environment. This meant I had to cover:

The assumptions that ANOVA makes and which are important
How to check these assumptions and why some of the ways people are taught to check these assumptions are wrong!
What to do if the analyses do not meet the assumptions (transformation, non-parametric, robust methods)
Extra worries when the ANOVA is multi-way
Carrying out more complex analyses with statistical packages in this case SPSS

At no point would I say I was an expert, but I am an experienced user. I have learnt from my years of consultancy but even so I was doing some indepth revision in order to prepare for the course. Sometimes there are no real answers so you make the best decision you can on the information you have to hand. It is not helped by the fact that research statistical packages are about thirty years behind the literature and products like Excel and noddy statistical packages are even worse. If you want to do a modern analysis by Statistical standards then you really need to look at something like R but if you want something more user friendly try something similar to Stata or maybe Statistica which seem to be aimed at Statisticians. However in the non-statistical research environment we are not often going to find people prepared to use them, as they require a significant investment in knowledge of statistics and often programming. Rather the attitude is something along the lines of "if T-tests were good enough for Prof they are good enough for me". In such an environment all we can try to do is move people up a stage accepted techniques and explain why t-test are not good enough. They normally get it when you start explaining multi-comparisons to them. You can then move them onto ANOVA, but then you have to know about ANOVA and that is where this course comes in. Researchers data is habitually more complex than they think and so often the ANOVA is not straightforward. Hopefully I have started a lot of people into sorting their own advice out and we will thereby get a higher level of statistics at the University.

Wednesday, 19 September 2012

Rant: On Simple Statistical Packages

There was a time when I was ready to look at every new statistical package that came along, well almost. The longer I have been supporting statistical packages the more cautious I become. Let me be straight with you, even the big grown up packages like SPSS get it wrong surprisingly often and new packages that are aimed at those doing simple statistics get it wrong a lot more of the time.

Now let me be honest there are some superb statistics packages out there that are newer. I think of Stata for starters. The thing that distinguishes them is that they are not trying to produce a simple tool for researchers but take statistical complexity seriously! There packages are not any easier than SPSS to use, but when used properly are powerful tools. What really gets me is the people who think "lets put together something simple for the researcher and forget about the rest of the statistics".

Today I had a first query on a package that is popular with the medical school. The person was doing a simple oneway ANOVA and then doing post hoc tests. He came to me with his data because he could not get p-values out for the Bonferroni post hoc tests. I suppose I could of explained that Bonferroni does not produce p-values but alters significance levels and that was why he was not getting any results. He also had a significant F value and no significant post hoc.

Well Bonferroni is highly conservative, even the more accurately calculated p-value of sidak is less conservative and yes the difference between the two is that Sidak calculates actual p-value while Bonferroni uses an approximation. Fortunately it gave t-values (but no degrees of freedom) so I was able to find LSD p-values and then carry out a Holm Sequential test and calculate an inverse Sidak (many of the inverse Bonferroni would have been greater than 1, I told you it was an approximation). All this is do-able if you are going to kludge it in Excel.

What would have been more sensible is if it had done what packages like SPSS do and that is give a variety of post-hoc tests. He could have done the standard one such as Tukey and got p-values out straight out. This is what I did when I established that was what he wanted and I also showed him how to draw the graph using SPSS to estimate the terms and a graphics package to actually draw it (well Excel in this case but any graphic package does this). That was so simple he was delighted.

So the first thing against the package is that it does not even give the standard comparative post hocs that many of the main ones instead using an old fashioned (think over fifty years ago) technique which has long been discarded unless there are no other options.

However I saw more, to do this ANOVA the user had to put each group into a separate column. This immediately sounds warning bells with me. The one thing I find nearly all these packages don't do is deal with levels of variation. When they come to ANOVA they basically treat a paired t-test as a group t-test. NOT GOOD. This leads to wrong interpretation. In fact the wrong presentation of within subject comparison is a major bug bear in the literature and in this case confidence intervals have not been an improvement!

To be polite, I will not be recommending this particular package for statistical analysis. I suspect if you want to do a proper statistical analysis of all but the most constrained of research then you REALLY do need to need to put in the time and effort to learn a proper statistical package. I am not fussed which, I will even allow SPSS* but please do not bring me something that you say is easier. So far all such packages I have seen in their attempts to simplify mislead researchers into doing WRONG analyses.

*When I was a postgrad SPSS had a reputation for doing this sort of thing, now some twenty years on, SPSS has developed and there are so many more very poor packages out there that I do not feel this can be sustained.

Thursday, 13 September 2012

Authorship and Culture in Research

I am in a funny position of being a researchers whose value is not assessed on the number of papers I am author on. Therefore I am not particularly interested in Authorship, but it is useful to know the number of papers and acknowledgement as it becomes a measure for the department to say that it is contributing to research at the University. Eventually I suspect that I will have to keep detailed notes but at present it is just a matter of reporting.

There is however a distinct divide between the two areas. In medicine these days I get authorship on about half the papers. The reason for this is that I am a statistician and we have found over the years that having a statistician as author tends on the whole to reduce the number of analysis queries from reviewers. Then in medicine it is normal to have many authors on a paper. Also it is easy to track papers in medicine as they are all online!

Now come to the arts and social science. I rarely if ever am an author, but I am frequently cited in acknowledgements. This is fine. It is the culture of the discipline and the way it works. There is a slight problem in that usually I do not know if I am acknowledged unless I search for myself and then find it. However the proportion of published papers that are online in the Arts and Social Sciences is a lot less. It would be courtesy if they sent me an email when they acknowledged me in a paper!

Thursday, 6 September 2012

Systems and getting things coordinated

This is a time when I am going to talk of the frustrations of running something efficiently. One software package I keep a watching eye on in SAS. Now let me be honest I do not use SAS, but I do give some sort of support and more importantly I do look after the distribution of license codes at this university. As far as the University is concerned SAS is a bit of an awkward package. It is expensive (it costs us about twice what SPSS does) and it is used by very few people here (I make it around the 30 mark, given that we have hundreds for SPSS that is a very small population, we probably have a larger Stata one and they buy it themselves), however those users are scattered across many departments! The result is that if we buy a license it is about a third the cost it would cost the University if each group bought their own license.

However for this size group it is not worth doing a technically complicated system in place. So we have an email list and I subscribe people that ask for the SAS license to it each year, removing people who had it the previous year after I have sent the email that says the new codes are in. In my absence the helpdesk can do this.

You'd think give that there are so few of them they would be on a single version but no we have people on 9.1.3, 9.2 and 9.3 (some of them are running two versions). Then when I do this I have to keep everyone in the department who needs to be in the loop.

I am not aware when codes should come, I used to be but SAS kept changing the date; so this year I was not aware we were running out until I started getting emails. Then I have to ask for the codes from the person in charge of licensing and finally I need to contact everyone to say the new codes are in.

Then people of course start asking for codes which we did not get first off so I have to go through the loop again. Then there are a couple of weeks when people get around to asking for the codes. So it takes time.

Wednesday, 29 August 2012

Playing with Google Scholar Profile

Now I know I am academic related staff, but because I am a statistician for a while I have quite often been a minor author on a number of academic papers. There is advantages to this for the other authors, the statistical reviewers are far less likely to decide to pick a fight over statistical niceties if they know you have consulted a statistician and if a statistician is there as an author then that is clearly the case. Well I can think of at least one case where I have had to redo tables because although the authors used me for the statistical analysis which was all carefully adjusted, they then went and produced the tables from the raw data without any adjustment.

Well a few weeks ago I found out that Google now has Google Scholar profile. This actually keeps tabs on the times the papers are cited. So I decided to have a play and see what I could turn up. I am using my work email, which is NOT recommended by Google as Academics move around and Universities tend to close email account when staff leave. Remember I am academic related, not an academic and that therefore I am not into playing the same games as most Academics, really my performance is not assessed by my citation index. Secondly, when I am doing my own research it rarely has any relationship to any of my papers. I suspect I will have to at some stage put together my other research papers but as at present they are exactly one in a very small journal I am in no hurry.

So what do I get, well you can view my Citation Profile which will give you an idea of the papers I have written and yes I really am second author on that first paper. It was my first ever paper and I was a very junior statistician who happened to be good at writing databases. That study need ongoing statistical and database support so I was allocated by my boss. It also was very much a one person team and A Webb really is the sole researcher.

Now here is a couple of niceties, when I go onto Google Scholar and am logged in on that account I get the following screen:

Google Scholar page

Now there are some things you should note firstly the bit that says "New! Scholar Updates Recommended articles for you" which links to this entry on the Google Scholar blog and secondly the "My Updates" at the top which actually links to a list of papers I might be interested in see:

Now I do not mind showing you this, it will not tell you anything about my research interests and it should be unfocussed, as I work with a number of research teams. I really can't seem myself using this regularly although I must admit I can see some areas of interest in some of the studies which one research group I work with may develop. However I suspect when the next paper is out, things will change again as it suddenly starts thinking I am working in another direction. However if I was an established researcher in a single field it might well be a way to keep up to date with who is working in my field.

Thursday, 16 August 2012

Cautions over comparing coding in NVivo

This is a blog in two halves, the first is totally practical and will tell you somethings you must note if you are going to compare codings. The second is theoretical and I want to sound a warning over the use of a particular statistics use in Qualitative analysis for comparing codings by different raters.

Right the first part is probably best told as a cautionary tale. A student turned up this week wanting to compare his coding with that of another student. The problem was that when he imported it into NVivo 9 it saw the documents in the two files as different documents. Thus making it very difficult to comparing coding schemes. It took us a while to find out why it was doing it and it was in the end because he had made minor edits to the files after the copy was taken for his friend to code on. That meant he had to go away and transfer all the coding from his friend onto his data files before he could do the comparison. There are ways to avoid this.

Firstly if you have coded your data and are now going to take a clean copy for a friend to continue coding on I suggest you also take a backup copy of your set with which to compare it with, just in case you get itchy fingers.

The second is a tip for getting around this in NVivo and making sure NOBODY edits the files including yourself and that is to print them to pdf before you import and only import the pdf. To do this you would need to download a free print to pdf utility such as CutePDF and use it to produce the pdfs. When these are imported into NVivo you cannot edit them. So there should be no problem with anyone editing them.

However now my concern, he went onto say he was going to calculate a Cohen's Kappa and was surprised when I knew what it was. I know it because it has come up regularly in my statistics degree and I have calculated it a fair number of times. There has nearly always been debate about its applicability and when I was a junior statistician it was usual to report percentage agreement as well. The thing is I have always used it for agreement on the categorical data where I have clearly defined units, such as children grading by two teachers. Or whether two radiologist classified tumours in the same way from xrays. Child and xray are clearly defined coding units and the grades are normally exclusive.

Now the big problem I have is that there is no nicely defined coding item and it quite often happens that the same bit of text is coded to two different codes. Think about it you might want to code the short sentence "John wore a red dress to the party" to both cross dressing and the use of red as a colour. This is perfectly sensible qualitative coding. One simply might look at present and absence of coding on a sentence but then there is a problem of coding with the un-coded units, easy perhaps to deal with if the coding unit is defined as a sentence but if you have "red dress" coded to red and "John wore a red dress" codes, then what is coded in both instances is less than a sentence. Equally there may be times when the code only covers several sentences. If coding unit implies that it is coded to that then how do we know how many code units there are that neither rater has found? I don't have a clue?

So I checked the literature and came across the paper "Content Analysis Research: An examination with Directives for Improving Research Reliability and Objectivity" by Richard H Kolbe and Melissa S Burnett which says:

"However, the use of kappa is difficult in content analysis because a key value, the number ofchance agreements for any particular category, is generally unknown."

which is basically what I am getting at, the underlying fluidity of the coded unit means I would have grave cautions about using Cohen's Kappa. You can read them to see what alternatives there were already, and I suspect more have come about since. I think I probably need to do some serious literature searching on this when I have the time so I can give accurate advice.

Thursday, 9 August 2012

Finishing an analysis

What does it mean to finish an analysis? There are perhaps three different times when you finish an analysis and I have had them all this last fortnight

When you stop analysing: This is really the last one to happen this week, but it is when you draw a line under then analysis and say I am not going to explore anything more no matter how interesting it appear. I had this with someone where I just have to do the confirmatory factor analysis for them and write it up, that will be the last stage of an analysis that has covered everything from simple descriptives, principle components analysis, correlation and so on. I am not writing this up for which I am truly grateful as it is a massive sprawling analysis and I am not sure it tells us anything particularly new.

When you write the report: this is different, in some sense I got to the previous stage before, all I am doing was repeating it, making sure I dotted the "i"s and crossed the "t"s. I know the story the data tells now all I am doing is thinking how to present it so that other people can incorporate an analysis I have done into their work. Almost essential for writing publications. This I do only when I am analysing the data for someone. I had this with a dataset I got about four months ago and it just needed me to do things with a clear head.

When you do the bits after a journal review: When you make the alterations to a paper after a review. The report was written months ago, the academic (in this case a prof) has written it up and submitted it to a journal. The journal has got back with reviewers comments and some of those involve the stats. Normally at the point at which they decided to present the basic data because you had not included the nice tidy table that they wanted and rather than trouble you they go and do it themselves. This means that you have to sit down and produce the results as they came from the analysis. If you are organised you have all the analysis neatly put together in a folder. Unfortunately all I could find was the output and not the exact output so had to recreate it. This is always risky as there are so many nuances around an analysis.

So which is the end? I hope the first but often it is the several times through the third before you can put the data away safely.

Anyway I also started one analysis this week, and I am quite sure there will be more along shortly

Thursday, 2 August 2012

When Answering a Query is not straightforward

This is the story of what should have been a straightforward query, which still has not been answered, well over a week after asked.

Way, way back (many centuries ago, well not quite) but well over a decade, Microsoft Windows included a technology called ODBC (well if I believe the link it is actually a C language API but it came with Windows). If a program in Windows could use ODBC then it could access all other sorts of data, particularly for the current query, Microsoft Access databases. It allowed you to write SQL queries to a database without being in that databases native environment. This was great for getting data into SPSS when the driver existed but there were some awkward bits to it. It was definitely the sort of programming bit that you tended to leave to programmers. I used to have notes on how to do it.

So a query came in saying this was not working for them. So I think, go into SPSS, have a play and see if you can remember what the problem is. First the good news it is easy to duplicate the fault. Now the bad news I find the work around isn't working because the original drivers are not installed.

Next step try installing the drivers again. As far as I can recall the drivers came on a SPSS installation disk. Only they do not come with the latest one, nor do they appear to have come with the previous version. Hmm. It is time for me to go and see if I can download them.

Then I hit trouble big time. Let me explain, I am the basic support person for this University for SPSS, but the person who looks after the license (because that is finance is different). When sorting out the transfer to IBM from SPSS, the transfer was done totally in the name of the finance person who also downloads the software for distribution. All would be fine but I am on IBM SPSS system as a Beta Tester. The result is confusion in the system. I am there but I am not there. Added to which the support system is complex and I am never sure where I am supposed to go and where not.

After two days and numerous phone calls involving a whole range of people both here and at IBM I am able to download the drivers. I still have to install them and then see if I can get it work. Once I have got there I can then try and and sort the query.

Just thought I wonder if they can get the query out as comma or tab delimited from Access and then read it into SPSS. When in doubt the cludge often works.

Thursday, 26 July 2012

One of the highs of Research Support

This really should have been last weeks blog, as it happened last week but I needed to finish off the already told story so it had to wait and nothing better has happened this week. So I think I will tell an incident that is one of the highs of doing my sort of research support and keeps me passionate about it.

I have worked on and off with students from the Sheffield Kidney Institute and last week one of them I had seen before when doing his masters came with another student in her final year of doing her doctorate while he was only starting his. Anyway it was one of those cases when the data was not finalised yet. Actually that in itself is a good sign, a student who knows that they can talk with a statistician before they finish entering their data can save themselves an awful lot of time and effort later on.

Anyway it was good to a have a play, and I was able to demonstrate some of SPSS abilities including the ability to draw ROC Curves which are the ability to distinguish signal from noise correctly. This is important because the research they are doing is on finding ways to distinguish someone who is likely to deteriorate from someone who is going to maintain their current level of kidney disease. The are really dealing with people who are at various stages of the disease from various causes.

The focus of the final years student's doctorate had been to develop a new and better predictor of a test. There already exist standard tests. She only has a small part of the data in so far but I did a ROC curve for the standard test and then I went on and chose one of the other options just by random, well I think it was first in the list but apart from that there was nothing to tell it from any of the others. The curve came back and it was functioning clearly better than the standard and her face lit up.

She had spent the last three plus years developing that test, she had no idea whether it would be better/as good/worse than the current test. What that one simple graph was doing was telling her, that her hard work had paid dividends. No doubt the analysis for the thesis will be a lot more detailed and a lot more complex and a lot of hard work, but to know that in the end it worked somehow makes the statistics appear less of a chore.

Lets be clear, its her hard work that has done it, none of my bits of wizardry would be any good without it, but being able to tell people they have succeeded in what they have been working hard at is a great buzz!

Yes one of them is back already asking for further help.

Thursday, 19 July 2012

Why we did not use Canonical Correlations in the end

Last week I wrote about learning to use Canonical Correlations. This week I am going to tell you why in the end we did not use them for the study.

Let me give you some background to the study first. It comes from modern languages and looks at the acquisition of English by students in Pakistan, this is a doctorate. The student has collected multiple data sources and among them are two surveys, one of teachers and one of students. The student is hard working and is really trying to link up the data. Both questionnaires had sub-questionnaires covering such as beliefs and classroom practice, the students one also contained questionnaire covering their own learning practice. Despite the fact that these sub-questionnaires often had thirty or more questions the number of teachers answering was only forty, while students were over three hundred.

Firstly I did actually spend quite a bit of time reading up on Canonical Correlation had persuaded me that I really did need to get the Canonical Correlation macro running in SPSS. This was not as straightforward as it sounds. There have been changes to the way SPSS handles files over the years and it had not gone through all the iterations. This meant I had to sit down and actually work out why it was going wrong. In the end I cludged a solution so it ran, by cludged I mean I put in a solution so it would run effectively in this case. It does not mean that I can pass this file to anyone else and it will run for them. It won't and if I am honest I really should go back and revise it so it does. I got it working for a test set.

Second stage was to run it on the teachers data. There were two ways we could run it. Firstly we could use it with the raw questions, secondly we could use it with the factors that we had previously calculated from the data. Canonical Correlations did not work for the initial approach as the questionnaires are too long for the number of teachers so the correlation matrix was indeterminant. You need at least one case per question. So we went onto the factors, this worked but the results were uninterpretable because you were dealing with an equation that was dealing with already rotated factors.

So we did a rethink.

The factors were created using Principle Components Analysis with a Varimax rotation to aid interpret-ability. Now the nice thing with this analysis is that the factors that you get out at the end are uncorrelated. So we had two sets of factors where within the sets there was no correlation. So we very simply did a correlation of one set of factors with the other. Remember within sets each factor is independent of the other so the results were easily interpretable.

Its almost Canonical Correlation but not quite and the student could write about it quite clearly. One happy student, one lots more relaxed Statistician.

Thursday, 12 July 2012

Getting my head around Canonical Correlations

This week has been easier in that I have actually found sometime to do some development work. Not much but some. Not much as session run by our Creative Media team, although if I want to hire any of the equipment it is go the Audio and Visual as I am staff. That is not going to be a problem come the autumn as they will be working from the same building as me. Also an introduction to Screenr which may be a quick and dirty way to get videos on how to do things out to people. I still need to think about that.

The other thing I have been doing is looking into Canonical Correlations. I know they existed before hand, they were briefly mentioned in a Multivariate Analysis Course I did as part of my Masters at Reading University over twenty years ago but really we were introduced to a huge number of techniques in a very small time and all I retained was they existed and there are some ways of carrying them out in SPSS.

At first glance they look a beguiling sort of solution to a lot of problems and a natural extension of General Linear Models which most people use. Examples of General Linear Models include t-test, ANOVA, Linear Regression, Multiple Linear Regression. The extension that includes them would also include factor analysis and Discriminant Analysis. They basically allow you to have multiple dependent as well as explanatory variables in a Regression. As such they seem to do for Regression what MANOVA does for ANOVA. This is reinforced in SPSS with the fact that the "easy" way to carry out this analysis is with MANOVA procedure.

The problem is that the complexity of the results seems to make it difficult to interpret exactly what is related to what. I think there are three sets of equations, one is the "factors" from the dependent, another is the "factors" from the explanatory and finally there is the relationship between these. I found a paper by Alissa Sherry and Robin Henson which gives a fairly gentle introduction and am now working through what Tabachnick and Fidell which is a standard text book for people like me. It teaches you all the things to worry about.

To some extent I am beginning to get there. The challenge is to use it in anger next week with real data, this time on learning English in Pakistan and see if it will now give us interpretable results.

Thursday, 5 July 2012

The Challenge of an introductory course

For good or ill, I have been teaching what I have called "An Introduction to NVivo" for the last two years. It was put on due to popular demand and at present I have to run it every two months to keep up with demand (technically I am not doing that).

The odd thing is that in many ways a two hour talking head course is NOT a good way to teach NVivo, it would not be my preferred way. It is meant really as a taster, something that can give people a flavour for NVivo and get them to explore it for themselves. As that it is barely acceptable and leaves me drained when I do it. The fact that I am doing it on out of date software does not help.

Yet this course is hugely popular and I regularly leave people enthused for the package. It is one of the ironies of life. The group I taught today was relatively easy, there was at least one other user and his enthusiasm was infectious. I think we might have even persuaded some of the people who were computer shy that it was worth a try.

What I have to remember for next time is to be quite deliberate about pointing people to other resources to help them. Even the guy who was a user was interested to hear of the two day courses where you could take along your own data and get help from an NVivo expert. The problem at Sheffield is that I am the NVivo expert and I am very aware of huge chunks where I just don't know about the package. The student who was overwhelmed by the course was guided by him to the help on the QSR website and was able to see that they had videos and tutorials that might really help her and which she could take in small bite size pieces. I think I have at least three more users this time which is all I aim to achieve.

If my aim with SPSS course is to get people out from being paranoid to where they feel its a chore, then my aim with this course is to get people to the stage where they feel it is worth having a trial with NVivo. To do that I have to overcome a level of scepticism and persuade users that the program is not going to take over the analysis (and disappoint them that it won't produce p-values) also I have to persuade them that being computerised does not imply distance. On my last course where one guy says "this is not about taking you away from your data, its all about keeping you close to it". If I can get that across no matter how tired and drained I am, then the course has suceeded.

Thursday, 28 June 2012

Transforms Endpoints and fixing things

Statisticians are often slightly inconsistent and I am going through one of those times just at present. Yesterday I was dealing with a data set in which the outcome variable was 1-5, and I was quite happy ignoring of my sensitivities about underlying distributions and just putting the numbers in. Well that is not quite true, I was having difficulty getting SPSS to give me results that allowed for this scale (whether 0,1,2,3,4 or 5 plants had survived in a block) and when I got one that was vaguely right it refused to give estimates. So I went back to the "lets pretend its normal and see what we get". Two things made me confident this was OK. We had three replicates and that means the estimates were not only for five plants but out of three plots each with five plants in it. That means the data is likely to be closer to normal. Secondly the overall results were consistent with the more complex analysis.

Today I am having problems with a different set of data, this has a 30 point scale although nobody scores below 4. However there are many people who score 30. Now when groups start hitting the top end of the scale, the scale becomes insensitive. There are ways of "fixing" this to some extent. The logit transform and the allied transformation logistic transform both help to a certain extent, by placing further out the points that are hitting the end of the scale. They are not perfect, because you have to keep everyone who score 30 together and you know that some of them really should have scored 31, 32, etc. The problem is you do not know which.

However there is more problem with this data set than that. There is a condition and people with that condition automatically score 27 or less. That is normal/controls score 28, 29 or 30. Now I really start asking questions about whether this scale is measuring anything except perhaps what sort of a day a normal person is having. In other words the sensitivity of the scale with normal people is so limited that I suspect that it is swamped by other effects. They do have pretty big data sets, but there does not seem to me to be any evidence that there is anything else going on in controls. I do not think that the logit or logistic transform is doing anything here.

So there I am with one dataset quite merrily ignoring what many would see as obvious problems and with another I am worried by it to the extent where I wonder if there was any point in collecting the control data. Oh the contrariness of being a Statistician

Friday, 22 June 2012

Struggles with getting things from the Web in NVivo 9

Alright NVivo 10 is out and it has web integration, but I have not tried it out except very peripherally. What I do have is a project I am collaborating with that is using the web.This is largely about static web pages and it should be fairly easy to do that right? Well my experience is wrong.

I started off on the preliminaries for this using fileshot pro, you'd think it was ideal, all sorts of extra facilities from your normal pdf printer specially designed for websites. You'd think it. However when we came to using it on commercial websites which listed products we had searched for we got completely black pdfs. Needless to say I have not shelled out for a copy when the trial period expired and when I remember I will be uninstalling from my browser.

So I went to Bullzip pdf printer and this works fine or at least it does on my machine, it produces pages made up of text and pictures which are simple enough to import successfully into NVivo and I can use in vivo coding facilities from NVivo. When the student who is actually doing the hands on stuff comes to use it, it produces pure pictures. Why? I do not know. I have not found any settings that I can change which will allow her to produce pdfs like I do on my machine. Cutepdf does exactly the same trick. This means I have to download all the webpages.

Then we come to problem number two, on the twenty or so pages we are actually looking there are three which have a video on. Having those videos would be nice. What is more it is fairly simple to download the videos as mp4 format, and officially NVivo reads mp4. Should be straight forward then to include.Only when we try we get the message that NVivo does not recognise the format. Oh dear now we have to deal with that. So we looked for a converter that would work. The first didn't and we promptly uninstalled it, but FormatFactory does and will convert these files to avi format which NVivo 9 will read.

Of course this is all technical stuff, it won't get mentioned in any paper or thesis we write, it will not appear at all in the how to books, but it has taken about three hours of work for two of us to get that far and many wasted hours because the software was not working together by the student. Maybe next week we will be making progress on the actual analysis.

Thursday, 14 June 2012

New Versions of Research Software is coming up

This week has been dominated by new version of software. It never rains but it pours in this area, and all I need now is for Minitab to announce that they have a new release. There are new versions coming of both the major packages I support, SPSS and NVivo

Firstly I have been helping to beta test SPSS 21. That is right there is another version coming soon, I expect it before Christmas, and I hope they are sticking at a once a year renewal. Some of the new bits are very nice. You can now easily get descriptives of any variable by just clicking on the variable row header and selecting the option. This is nice for people like myself who end up dealing a lot with other people's data. When something goes wrong in an analysis quite often the first thing I want to do is look at the basic frequencies and descriptive statistics of the data to see if the number are what I was expecting. They also have intergrated SQL terms into their ability to merge datasets. SPSS "Add cases" and "Match Files" commands pre-date SQL. I can remember a presentation back in around 1990 where someone was talking from one of the big database companies about how they had just introduced SQL into it. Although I was relatively new to SPSS the merging abilities had been there for quite a time already and were mature technology. They were admittedly a poor cousin to those in SAS (even then in SAS it was difficult to think of way to reorganise the data that you could not do) but they were there. If this work from the menus then it will be good but at present there are definite glitches and I am not holding my breath. It also looks like for the Statistical Techie Forecasters there may be some simulation algorithms attached. My guess from CICS point of view, SPSS 21 will be available about February 2013 for purchase with looking at the possibility of an upgrade to it on the managed desktop for September 2013.

Second QSR are just in the final stages of launching NVivo 10, I have downloaded a trial copy and am having a play with it on my machine. It has quite a lot of things to work with the Web including Evernote.Now I have not used Evernote but there are quite a few people promoting it for academics. This will be useful if one of the research bids I am involved in comes off. The other thing is that it claims to handle more data and do it quicker. It is still fairly slow on my machine, which is a decent machine and large dataset did cause problems with version 9. The problem is that QSR added functionality without realising the resource demands of large datasets when they are things like videos. So yes I will be glad if this makes in more reliable. You can find out more QSR What is New in NVivo 10 or find out more. However if you used 9 and upgrade to 10 your knowledge is still applicable, the only thing you need to watch is that you do not revert to working in 9 and 9 can't read NVivo 10 Data sets. It is not on campus yet but if you want to play then you can download a trial version for thirty days. I expect there will be a wait for the full license copy to come on site. We have just renewed for five years so don't worry it will come.

Then there are other bits and pieces. Research has been revolutionised by the web in ways it is difficult to comprehend. When I started twenty years ago, most researchers spent a fair amount of time searching for articles physically in the library. It meant that your information was controlled very much by what other people were talking about in your discipline as you had to get reference to a paper to find out that it existed. First came big databases of references that were searchable so that you could look for key words, then came the likes of Google. Today if I want to know what is written on a topic my first port of call is "Google Scholar" and only if I find I can't get it online do I start using the elibrary and only when that fails does a trip to the library follow. So it is with interest that I hear about scholr.ly which uses Google but tries to represent it is a scholar friendly way. I will give it a go and see if it adds anything.

Thursday, 31 May 2012

Tedium of Finishing Off

This week has been a finishing off week, two papers were coming up to that line which means they are ready to be sent out.At the surface you'd think that they were both the same but the feel was very, very different. I also saw a PhD student who is planning to submit next week, I sincerely hope that he gets through his viva without major reanalysis. I have come to dread opening his data set as there is always some more errors in it. However I prefer not to think about that.

So lets take the first paper, it was a short communication of a piece of analytic work. The problem was that despite the fact we had written reports and such from the data, we had always been so constrained that we have never played with the data. So we had to sit down and check what we were finding. We actually took a very specific route and the statistics suddenly started giving useful results. This is a data set that by the standards of quantitative work was messy and likely to have a lot of recording error in it because of the collection techniques. So it was with some surprise that things started to make sense. Although it took us some time to get our heads around things (inverse correlations always seem to be more difficult to interpret) but once done we were able to write in a very short space of time.

The other paper was a fairly simple qualitative paper which had been to one journal and had been rejected thus needed quite a bit of rewriting. It felt completely different, the work involved was a lot more about balancing our analysis against the findings in the literature, we regularly had to take a sentence ask what was being said and then revise it into something that actually was of academic standard. There was also a need to update all the references. We did go back to the analysis, but when we did it was simply to the output files from last summer. Much nicer much easier and means I know I am using a consistent version of SPSS. So no new analysis but plenty of detailed writing.

Anyway hopefully both papers will be off to the journals in the next week or so. I hope we have some success.

Wednesday, 23 May 2012

Water and Long Haul Statistics

Last week the paper Water intake and post-exercise cognitive performance: an observational study of long-distance walkers and runners was was published and I am one of the authors as Margo felt the analysis was going to be tricky. I suppose I could have talked about it then but last week was dominated by the move, so here is my reflection as the statistician on the project.

It was a paper that started out as a student project, well two student projects actually although in the end it was only water we concentrated on and the other half which looked at nutrition in general did not. What was unusual about this study was that it was a field trial (to adopt the agricultural terminology) not a laboratory/clinic based trial. Field trials are known to have large levels of complexity and this was the case here.

Lets start with the basics, this trial was pre and post race measures of cognitive performance for long distance runners. The students had no ability to determine who went into which race, that was the choice of the athlete who decided to join the study after they had entered a race where the students were attending. Secondly we had no control over liquid intake prior or during the race we had to rely on self reporting by the athletes. Thirdly we found there are few reliable ways of finding someone's hydration status in a one off setting. Most of the highly technical methods either are not appropriate for a race situation or are only reliable if repeatedly administered.

This is one of those case where it looks simple, a paired T-test will do, and it quickly comes apparent that no it won't do. The biggest problem was removing the confounding. I went through several analyses where the results were incoherent (did not make sense), yet consistent enough to suggest there was something going on but not enough so I could satisfactorily characterize it. So I had to resit down and consider the assumptions we had made and then re-analyze. I think it was on fifth attempt that it finally hit me that we had a problem with pre-scores and the water intake pre-race. If cognitive scores are dependent on hydration then this is true of pre-race ones as well as post-race. So to get a cognitive performance score that was neutral before the race I had to adjust the scores for the water intake over the previous twenty four hours. I could not just use the raw scores. It worked, at last we got a consistent and coherent story. Then all we had to do was to write the paper.

However in writing it we discovered that we should have taken pre and post weight, I suspect as well as self reported normal weight. It is one of those cases where

It took one year from student placement to analysis and another year for the paper to be produced, but what had been very much a first simple trial at doing something had born fruit.

Thursday, 17 May 2012

On Chucking Things Out

This week I am moving offices, not just that the department has decided that we need to do a massive clear out before we move as they do not want to pay to move things that will never be used. I am having some problems with this edict.

Firstly I am a statistician by training, culturally statisticians are hoarders. That is they tend to store data indefinitely on computers, keep copies of long forgotten analyses and so on. When I first started working as a statistician, half the office the four of us shared (it wasn't big) was taken up with computer output. There were stores elsewhere as well I believe. Even when we got half of a much bigger office, under all the waist height shelves were boxes full of output and paper copies of data. You learn quick about deleting something too soon and I did it often enough in those early years.

I developed a method of working when I was in charge of the analysis, it involved big ring binders. I would do an analysis, print it out, put it into a ring binder, and then summarize the results for the front of the ring binders. These analyses were kept. If you knew my approach to most other things, you would realise this is a real taking on the culture. The ring binders were then kept, forever just in case there was need to return to the analysis. There are to Statisticians no such thing as a dead analysis where the output can be junked.

I have had to be radical with this and say that files where the analysis is several years old, have not been recently consulted and have been published really should be counted as dead. If I did not know whether published I was also to presume dead.

I am also semi pathological about this and data, but these days data is computerised, However until fairly recently I kept all the hard copies of surveys that were conducted about fifteen years ago. Well if the worst came to the worst and we wanted to go back to them and the computer network had been wiped out we could always re-enter the data. The oldest data set I have on the computer is one of those sets, if dates from August 1996. So the computer copy outlasted the paper ones in this case.

These are on computer, so none are being lost, although I will back up my hard disk before we move. I must also look to creating a better back up system. The work is now fairly organised at least as far as data. Senior researcher, then folders for topics or if a student then in queries and under the student name. The main problem is that a lot of the time the data is shared between my computer and the consultees and there can be problems with knowing where a file is.

Secondly manuals are a tool of my trade. I have used them for answering users queries since I first started this job and my reading of them gave me a head in my previous ones. Sometime you don't even need a copy of the software with a good manual. Part of my job seems to rtfm for other people. . Needless to say I like having manuals around, they feel like a safety net for the times I do not know the answer.

However paper copies of manuals are disappearing (so much so I sometimes print off copies of pdf as I need a paper copy and I can't buy the manual). Also I do not use manuals equally, the SPSS syntax manual is the one I consult most frequently (I wonder if it is time I got a new copy as it is ten versions out of date although syntax changes slower than most people think, and whether I can get a new copy). At the other extreme are the SPlus extra modules which after a decade aren't even out of their wrapper and we don't even support the software anymore. Some of my manuals are old, very old; some had been in the department longer than me.

However the move has been towards more and more online manuals. Also I do not use all manuals equally. I have a similar attitude to master copies of software especially when I have been caught out without it. Universities seem to be the place for historic software users.

So I have had to have a cull, the one where got rid of all no longer supported software, or when alternatives were available on line that were more up to date. This meant chucking out one huge lot of manuals that were over twenty years old, three versions out of date and I had not consulted in the last two years.

The whole thing is that I am doing this, I am doing it because these things are on paper, if these things were computerised, there is no way I would be doing this, there would be pressure on me to preserve them even longer, well maybe not the manuals but definitely the analyses. It seems to me the demand to archive data is an outcome of the present computer age, before then archiving was difficult and therefore not attempted. Only the highly methodical kept notes and data carefully enough that it was still around when they died and then if they were important it was archived. Things weren't transportable so keeping things was often not an option.

Yes I understand fully why we want to keep things but I am becoming aware that a sensible deletion policy might well be part of a good data management plan!

Thursday, 10 May 2012

The Importance of having metadata

There is a lot of interest at present in Research Data Management, its one of those words that you can hear been talked about either with excitement or with dread. You can learn more from Digital Curation Centre. There are buzz words around like "Open Access", "research outputs", Data Life (I wish they would talk of data half life but that is me) and metadata. Now I am not going to give a detailed understanding of Research Data Management, what I am going to do is give the story this week which highlights the need for ongoing metadata keeping in a live research project.

I have had a long term research relationship with Margo Barker started soon after I came about 19 years ago and has covered many research projects. The one this started about four years ago, when Margo was able for a short while to employ someone to look at magazine messages for her. The person did a good job and produced some results in the short time available. However things moved fast and when she went, Margo found this project was largely swamped with other things. It was not until last summer when another person working on a short term project had some spare time to look at it that it got revitalised. He managed to get the paper ready for presentation but it took into the autumn before a paper was ready to be submitted. During this process little new analysis was done.

The paper was sent back for editing and due to the reviewers comments we decided to alter the analysis by removing some of the data. That was fairly easy for the stuff I had on my machine (I had the SPSS programs that did the original analysis all saved) but there was other data I had not seen. A good part of this week has therefore been spent searching computer storage and opening possible files to see if they held the original data. We found it eventually but it took us a long time to do it.

So what difference would meta-data have had for this project. The meta-data would ideally contain the creation date, the last edited date, what was actually in each file, not just what data was there but what analysis had been carried out, any changes made and so on. If this was stored in a central file for that research project, then we could have opened the file and looked up which files contained the information on the number of magazine sampled. We would know what each variable actually was and could probably answer question on what actually made an article about nutrition (fortunately the original researcher was good at documenting the data, so we found that she clearly stated that she had removed articles with only a small bit on nutrition, but this was hidden away in a file and we came across it by coincidence. Also we might for instance have not spent quite so much time wondering whether supplements and weight loss products were one and the same (they weren't but it took a lot of time to find that out, because we did not know we were still short one of the data sets)!

Moral of this story: most researchers are human and humans forget. Having your metadata in a usable form makes it easier to pick up an analysis after a time has lapsed and saves you time because you have information on what data is where.

Thursday, 3 May 2012

Where to begin

Most people in the department who think of me probably think of me as the person who deals with SPSS (it doesn't bite honestly) and if they are a bit more knowledgeable that I tend to work with researchers and also look after something called NVivo as well.

Well I do do that but in doing that and the work around it I do so much more. Just today for instance I have:

dealt with a doctoral student using Mixed Methods (both SPSS and NVivo), helped her enter data so we can run some sort of analysis in SPSS (her data was lost and she has only got outputs);
helped another student deal with the analysis of their green roofs in SPSS
Looked briefly at some data on advertising in Woman's magazines which is in Excel
spent some time discovering what Max-Diff methodology was

That is a slightly heavy day but it is not exceptional. The topics people bring to me are varied, probably my main skill is not using SPSS or NVivo but the ability to extract from someone what they understand themselves to be doing with their research. I therefore tend to work very closely and intensely with the researchers I am advising.

I have been doing the job for close on twenty years and there are really two core reasons for doing it. Firstly there is the character of the people who you are dealing with; secondly there is the stimulus of the work involved.

It is good to work with researchers, researchers whether they are doctoral students or professors are usually passionate people about what they are doing. If someone shows an interest in what they are doing they become animated reflecting the high level of personal involvement in their research. Often regardless of how cynical they have become about academia and the outputs of their research, they remain people who want to make a difference.

Secondly there is the fact I am perpetually curious, I like learning new things, even better I like discovering or helping to discover new things. However I am not good at sitting down and working methodically in one direction, there are too many shiny new things out there to distract me. With this job I get to change what I am working on several times a day. Most researchers work on a small number of topics for a long period of time. Plenty of stimulus for my brain then.

So over the coming weeks I hope to bring to this blog some of the stories of the research at the University of Sheffield I am involved with and also some of my reflections on supporting researchers in the use of computers.