Thursday, 16 May 2013

Archive all data?

I saw today this post on the LSE Impact of the Social Sciences blog and broadly speaking I am all for initiatives of the archiving for re-use of data. However I wonder if we are getting carried away somewhat with this. There is indeed a case for archiving the data from public funded, large studies whether it is in the Social Science, Science, Medicine, Engineering or the Arts. That is not in dispute. However it is below that level that things get murky. So let me draw maybe four levels of research data.

The Big Study

This is the above, it is funded either by a government agency, charity or other large body. It probably involves researchers from several institutions and may well included international collaboration. The team of researchers will not be less than twenty and may run into hundreds. It is big, well funded and expected to make an impact. Yes this should be archived

The Team Study

This covers a lot of research, it is smaller in scale. It is externally funded, but as well as research councils, government agencies and charities, there may also be other sources of funds such as industry, campaign groups, even local councils. The research team can vary from about 2 to 50 but usually all are based within a single institution. As a rule these should also be archived unless there is very clear reason not to. The archive may have restricted access for case where research is sensitive.

Personal Research

This is small scale, it may be a team of two or three, it may be externally funded,  it may not, or may be funded by consultancy fees etc. Often an individual is working on their own but may work with two or three others. It is solely based within a single institution and probably within a single department within that institution. This is where archiving becomes more difficult. The first aim should be to develop tools which encourages researchers to keep a note of what data they have and also to have sensible backup procedures. The next step is to find a way that data can be archived efficiently. Could perhaps open publishing be used as a way to gain proper access to this data. For instance a model where the papers are published but a charge is made for the release of the associated data.

Student Researcher

The lowest level, but at the other end are the millions of projects done each year by students. They should be being taught good practice, but they are learners. Learners make mistakes, the data sets are often small and sometimes quite idiosyncratic e.g. a convenience sampling of purchases of organic food from a local specialist shop. The best are up there with the personal researchers and most Doctoral Students should be aiming to be at this level. I think in this case the rule is the opposite of the team study one. That is the presumption is that data should not be archived unless there are significant outcomes from it. The best way of doing that seems to be whether there is a paper published from it.

So while wishing for more data to be archived, I think we seriously need to be circumspect about where we draw the line of which data should be archived.  The purist "all data" will just increase the amount of noise that archeologists need to hunt through eventually.


Thursday, 9 May 2013

Its taken a long time to get here.

Well last week I was author on another paper. This time on Food Messages in women's magazines since 1950-2000. It was a study I first got involved in back in about 2006/7 when the second author had collected the data. There were problems as the frequency of adverts was different over the decades so 10 adverts talking about cheese in 1955 was not the same as 10 adverts talking about cheese in 1995.  The question was how did you adjust for the changing rates of adverts.

Well it has been through a variety of states since then. The first author presented the results several times at public lectures and conferences but publication kept getting put off. It was also connected with another paper which looked far more closely at the messages about food given during the war. Together they showed the ways that different attitudes to food have been shown through advertising and other elements of women's magazines. The concentration of people getting enough calories in the 1950s would seem very odd in today society which is worries about an obesity epidemic.

Of course the idea would have been to have got this published around 2008, but it has taken to 2013. Each time we have returned to the data we have refined the analysis. We started comparing all data to 2000s but we only had five years of data for that compared to the other decades when we had ten. So we switched to using the 1950s as our baseline. We selected carefully what topics for advertising were shown to get those which made a coherent story across more than one type and to remove the categories which were very sparse. Finally we dropped the 2000s altogether and just concentrated on those before hand.

One real nightmare was getting the graphs out. The analysis was done in SPSS, but that is a hopeless package for presentation graphs. Initially they were done in Excel (which is better) but then they would not come in good enough quality so we had the job of getting better quality. Late nights trying to get them to work. Sigmaplot is only marginally better and I am on the look out for a piece of software that does academic plots well. At the moment I am seriously considering learning to use some of the very specialist packages that come R perhaps ggplot2. The thing is you need to draw quite specific plots, give precise labels often using symbols and then you need to produce them as a high quality graphic.

Its out now, so onto something new.

Thursday, 2 May 2013

Thinking about numbers for Multiple Regression Analysis

I have had one of those emails that makes you wonder, someone trying to fit 25 explanatory variables to a data set that has just 20 observations. It is one of those days when you say "Stop trying and do something simpler". However it set me wondering about rules of thumb for sample size for regression that was perhaps better. The figures are available in Samuel B. Green (1991): How Many Subjects Does It Take To Do A Regression Analysis, Multivariate Behavioral Research, 26:3, 499-510

Well I have discovered somethings. Firstly most are based on A+B.M where M is the number of explanatory variables. I got somethings right straight off. My idea is that power does not vary according to M but the squared root of M and indeed this does seem to be closer to the truth. However I fairly soon realised that M+1 rather than M is a better measure. I played around a bit and a figure of 17.5 seems to work as the constant to match the GPower calculations.

So I ended up with a graph that looked like that below:


Cartledge method is interesting. It is not in the Statistics literature but comes from a modelling background. It actually is not for straight regression but gives the number of cases you need to have if you model all interactions in order for the mathematics to work. Note that word mathematics! The statistics would want five times that amount (we do not like fitting a dimension with only two points, sorry mathematicians). Be warned interactions terms need adding only with the greatest of caution.

Another point is I am not wedded to 17.5, I would quite happily consider using 20 instead. This would enable some leeway for fitting different models. I wonder how you correct for independent variable selection technique.