Thursday, 9 January 2014

Checking your work when using procedure (or watch out for * join)

Now I know most people's data sets are relatively small scale, you are lucky if you have a hundred cases. However ever so often you get huge ones. We have them for things like computer usage within the department, Linguistics corpus data has them and I have a number of purchase transactions from a cafeteria.

Now I tend to handle these in SPSS and up to recently SPSS was reliable in handling them. That is if the files got corrupted while running a procedure 99% of the time it was me who had messed up! Most of the time when I merged files if there was a problem it was either:
  1. Due to the fact I had not sorted the data
  2. Unexpected duplicate cases, either extra blank cases at the base or a duplicate where there should not be one.
However SPSS just used to tell me I was doing things wrong, go back and sort it out.
Sorting data sets especially large data sets before you merge is tedious but that is all. Yes I have been in the make a cup of coffee crowd and see how much you drink while SPSS is sorting. I handled data sets with over 600,000 cases almost two decades ago on pcs. The machine used to sit in the office chuntering as it processed the data and I got on with other work. I would check it every time I wanted a coffee but normally it ran quite happily on its own. Due to the way that analysis work I had to invert the file at one stage by sorting.

Now SPSS decided that forcing people to do this sort of tedious work and wait around was not on so it developed "star sort". Not just that they put it up as the default option via menus with SPSS. It may work for small datasets with a few cases in it. It does not work for large ones in my experience. The result is that instead of a tedious sort command before I now have to either:
  1. Do a frequency and other checks that the data is as expected 
  2. Sort the data as I always did and remember to uncheck the boxes so sort dialogue actually uses the old way that worked.
No it has not improved things and as I missed this until it started giving me what I knew were wrong results I am not happy. If you are going to change a default then change it to something that works. Before someone says take this up with SPSS, I beta tested this particular version of SPSS (21) and I reported the problem then.