LLMs for Qualitative Research

Source
TL;DR	Large language models like ChatGPT are increasingly used to automate or assist with qualitative research tasks — coding, categorization, and thematic analysis — with promising but uneven reliability.

What it means

Qualitative content analysis is notoriously time-intensive: researchers read, label, and iteratively refine codes across large bodies of text. LLMs offer a path to automating the mechanical parts while keeping humans in the loop for interpretation and validation.

The key workflow is:

Data extraction — identify relevant passages (change mechanisms, themes, etc.)
Coding scheme development — organize extracted data into discrete, mutually exclusive categories
Annotation — apply the coding scheme to the full dataset
Reliability evaluation — measure consistency via intercoder-agreement (κ)

LLMs can assist at every step, though their reliability varies by task type and approach.

Inductive vs. deductive

Two main approaches in content analysis:

Inductive (data-driven): categories emerge from the data. LLMs perform better here because they can generate rich, example-laden category labels that improve consistency across runs.
Deductive (theory-driven): codes are mapped to a predefined framework (e.g., Theoretical Domains Framework). LLMs struggle more here, especially with structured coding matrices, because overlapping framework categories and sparse semantic labels create ambiguity.

(bijker-chatgpt-qca-2024) found κ of 0.72–0.82 for inductive schemes vs. 0.58–0.73 for deductive approaches using GPT-3.5 Turbo.

The role of prompt engineering

Quality output depends heavily on prompt-engineering. Structured, iterative prompts with clear instructions, relevant synonyms, and explicit examples improve LLM performance significantly. The “garbage in, garbage out” principle applies: vague prompts produce inconsistent coding.

Limitations and risks

Validity gap: most studies assess reliability (consistency), not validity (accuracy relative to ground truth)
Temporal instability: LLM output can vary across time, model versions, and API accounts
Ethical concerns: data privacy, transparency about AI involvement, and potential training data biases all require attention
Data type sensitivity: messy naturalistic data (forums, social media) is harder than structured interview transcripts

Benchmarks and model comparisons

(bennis-ai-thematic-analysis-2025) goes further: nine models benchmarked against expert human analysis, with some achieving Jaccard = 1.00. ChatGPT o1-Pro led; the pace of improvement is rapid — months, not years.

The critical counterargument

(brailas-ai-qualitative-research-2025) argues that optimizing for reliability metrics misses the point. LLMs produce what is statistically probable, not conceptually novel. See epistemic-flattening for the core risk.

The field-level debate sharpened in 2025–2026 into a direct confrontation. jowsey-et-al-2025-we-reject (419 signatories) argued that LLMs are categorically incompatible with Big-Q reflexive qualitative research: AI cannot make meaning; reflexive research must remain distinctly human; environmental and social justice costs are unacceptable. de-paoli-reject-rejection-2026 countered philosophically — human exceptionalism is a position in philosophy of mind, not a methodological claim. greenhalgh-2026-beyond-the-binary reframed the question: not whether AI can make meaning, but whether AI use displaces the researcher’s reflexive engagement. wise-et-al-2026-ai-not-the-enemy offered the most constructive response: mapping LLM architectural properties to interpretivist commitments to argue AI can deepen, not replace, interpretive work.

(carlsen-ralund-computational-grounded-theory-2022) provides the methodological grounding: unsupervised, computer-led approaches (like topic modeling) have fundamental problems with discovery, immersion, and validation. Their CALM framework is the most rigorous articulation of how computers and humans should divide the labor. See computational-grounded-theory.

(anis-french-ai-qualitative-research-2023) offers a middle path: embrace AI for efficiency, and use its failures as analytical insight (algorithmic failure cases flag ambiguous, complex passages worth reading closely). But keep interpretation with the human.