In practice: Analysing large datasets and developing methods for that

A quick post here but one that seeks to place the rather polemic and borderline-ranty previous post about realising the potential of CAQDAS tools into an applied rather than abstract context.

Here’s a quote that i really like:

The signal characteristic that distinguishes online from offline data collection is the enormous amount of data available online….

Qualitative analysts have mostly reacted to their new-found wealth of data by ignoring it. They have used their new computerized analysis possibilities to do more detailed analysis of the same (small) amount of data. Qualitative analysis has not really come to terms with the fact that enormous amounts of qualitative data are now available in electronic form. Analysis techniques have not been developed that would allow researchers to take advantage of this fact.

(Blank, 2008, p258)

I’m working on a project to analyse the NSS (National Student Survey) qualitative textual for Lancaster University (around 7000 comments). Next steps include analysing the PRES and PTES survey comments. But that;s small fry – the biggie is looking at the module evaluation data for all modules for all years (~130,000 comments!)

This requires using tools to help automate the classification, sorting and sampling of that unstructured data in order to be able to engage with interpretations. This sort of work NEEDS software – there’s a prevailing view that this either can’t be done (you can only work with numbers) or that it will only quantify data and somehow corrupt it and make it non-qualitative.

I would argue that isn’t the case – tools like those I’m testing and comparing including the ProSUITE from Provalis including QDA Miner/WordSTAT, Leximancer and NVivo Plus (incorporating Lexalytics) – enable this sort of working with large datasets based on principles of content analysis and data mining.

However these only go so far – they enable the classification of data and its sorting but there is still a requirement for more traditional qualitative methods of analysis and synthesis. I’ve been using (and hacking) framework matrices in NVivo Plus in order to synthesise and summarise the comments – an application of a method that is more overtly “qualitative data analysis” in a much more traditional vein but yet applied to and mediated by tools that enable application to much MUCh larger datasets than would perhaps normally be used in qual analysis.

And this is the sort of thing I’m talking about in terms of enabling the potential of the tools to guide the strategies and tactics used. But it took an awareness of the capabilities of these tools and an extended period of playing with them to find out what they could do in order to scope the project and consider which sorts of questions could be meaningfully asked, considered and explored as well. This seems to be oppositional to some of the prescriptions in the 5LQDA views about defining strategies separate from the capabilities of the tools – and is one of the reasons for taking this stance and considering it here.

Interestingly this has also led to a rejection of some tools (e.g. MaxQDA and ATLAS.ti) precisely due to their absence of functions for this sort of automated classification – again capabilities and features are a key consideration prior to defining strategies. However I’m now reassessing this as MaxQDA can do lemmatisation which is more advanced than NVivo plus…

This is just one example but to me it seems to be an important one to consider what could be achieved if we explore features and opportunities first rather than defining strategies that don’t account for those. In other words: a symbiotic exploration of the features and potentials of tools to shape and define strategies and tactics can open up new possibilities that were previously rejected rather than those tools and features necessarily or properly being subservient to strategies that fail to account for their possibilities.

On data mining and content analysis

I would highly recommend reading Leetaru (2012)  for a good, accessible overview of data mining methods and how these are used in content analysis. These give a clear insight into the methods, assumptions, applications and limitations of the aforementioned tools helping to demystify and open what can otherwise seem to be a black-box that automagically “does stuff”.

Krippendorf’s (2013) book is also an excellent overview of content analysis with several considerations of human-centred analysis using for example ATLAS.ti or NVivo as well as automated approaches like those available in the tools above.


Blank G. (2008) Online Research Methods and Social Theory. In: Fielding N, Lee RM and Blank G (eds) The SAGE handbook of online research methods. Los Angeles, Calif.: SAGE, 537-549.

Preview of Ch1 available at

Krippendorff, K. (2012). Content analysis: An introduction to its methodology. Sage.

Preview chapters available at

Leetaru, Kalev (2012). Data mining methods for the content analyst : an introduction to the computational analysis of content. Routledge, New York

Preview available at