| TL;DR | Large language models like ChatGPT are increasingly used to automate or assist with qualitative research tasks — coding, categorization, and thematic analysis — with promising but uneven reliability. |
|---|
What it means
Qualitative content analysis is notoriously time-intensive: researchers read, label, and iteratively refine codes across large bodies of text. LLMs offer a path to automating the mechanical parts while keeping humans in the loop for interpretation and validation.
The key workflow is:
- Data extraction — identify relevant passages (change mechanisms, themes, etc.)
- Coding scheme development — organize extracted data into discrete, mutually exclusive categories
- Annotation — apply the coding scheme to the full dataset
- Reliability evaluation — measure consistency via intercoder-agreement (κ)
LLMs can assist at every step, though their reliability varies by task type and approach.
Inductive vs. deductive
Two main approaches in content analysis:
- Inductive (data-driven): categories emerge from the data. LLMs perform better here because they can generate rich, example-laden category labels that improve consistency across runs.
- Deductive (theory-driven): codes are mapped to a predefined framework (e.g., Theoretical Domains Framework). LLMs struggle more here, especially with structured coding matrices, because overlapping framework categories and sparse semantic labels create ambiguity.
(bijker-chatgpt-qca-2024) found κ of 0.72–0.82 for inductive schemes vs. 0.58–0.73 for deductive approaches using GPT-3.5 Turbo.
The role of prompt engineering
Quality output depends heavily on prompt-engineering. Structured, iterative prompts with clear instructions, relevant synonyms, and explicit examples improve LLM performance significantly. The “garbage in, garbage out” principle applies: vague prompts produce inconsistent coding.
Limitations and risks
- Validity gap: most studies assess reliability (consistency), not validity (accuracy relative to ground truth)
- Temporal instability: LLM output can vary across time, model versions, and API accounts
- Ethical concerns: data privacy, transparency about AI involvement, and potential training data biases all require attention
- Data type sensitivity: messy naturalistic data (forums, social media) is harder than structured interview transcripts
Benchmarks and model comparisons
(bennis-ai-thematic-analysis-2025) goes further: nine models benchmarked against expert human analysis, with some achieving Jaccard = 1.00. ChatGPT o1-Pro led; the pace of improvement is rapid — months, not years.
The critical counterargument
(brailas-ai-qualitative-research-2025) argues that optimizing for reliability metrics misses the point. LLMs produce what is statistically probable, not conceptually novel. See epistemic-flattening for the core risk.
The field-level debate sharpened in 2025–2026 into a direct confrontation. jowsey-et-al-2025-we-reject (419 signatories) argued that LLMs are categorically incompatible with Big-Q reflexive qualitative research: AI cannot make meaning; reflexive research must remain distinctly human; environmental and social justice costs are unacceptable. de-paoli-reject-rejection-2026 countered philosophically — human exceptionalism is a position in philosophy of mind, not a methodological claim. greenhalgh-2026-beyond-the-binary reframed the question: not whether AI can make meaning, but whether AI use displaces the researcher’s reflexive engagement. wise-et-al-2026-ai-not-the-enemy offered the most constructive response: mapping LLM architectural properties to interpretivist commitments to argue AI can deepen, not replace, interpretive work.
(carlsen-ralund-computational-grounded-theory-2022) provides the methodological grounding: unsupervised, computer-led approaches (like topic modeling) have fundamental problems with discovery, immersion, and validation. Their CALM framework is the most rigorous articulation of how computers and humans should divide the labor. See computational-grounded-theory.
(anis-french-ai-qualitative-research-2023) offers a middle path: embrace AI for efficiency, and use its failures as analytical insight (algorithmic failure cases flag ambiguous, complex passages worth reading closely). But keep interpretation with the human.
See also
- empirical-findings — standing page synthesizing all empirical studies: benchmarks, tool comparisons, practitioner behavior, complementarity findings
- bijker-chatgpt-qca-2024 — empirical test of ChatGPT for qualitative content analysis
- bennis-ai-thematic-analysis-2025 — nine-model benchmark; some reaching perfect concordance
- ayik-et-al-2026-human-vs-ai-ta-tools — four-tool empirical comparison; tool choice encodes epistemological orientation
- brailas-ai-qualitative-research-2025 — critical-epistemological argument against outsourcing analysis
- carlsen-ralund-computational-grounded-theory-2022 — CALM framework for computer-assisted analysis
- anis-french-ai-qualitative-research-2023 — 3E case for AI: efficient, explicatory, equitable
- jowsey-et-al-2025-we-reject — categorical rejection; 419 signatories; AI incompatible with Big-Q research in principle
- de-paoli-reject-rejection-2026 — philosophical rebuttal of categorical rejection
- greenhalgh-2026-beyond-the-binary — governance reframe; four questions for the community
- wise-et-al-2026-ai-not-the-enemy — AI-in-the-loop analysis; interpretivist case for AI
- dellafiore-et-al-2025-expert-interviews — expert practitioners’ perceptions; shame culture; “illusion of meaning”
- prompt-engineering — how to write prompts that produce reliable research output
- intercoder-agreement — Cohen’s κ and why it matters for evaluating LLM coders
- epistemic-flattening — the risk of AI reproducing dominant narratives
- computational-grounded-theory — the computer-led paradigm and its critique
- ai-research-ethics — privacy, consent, bias, and transparency
- qualitative-ai-methods — full taxonomy of AI roles and approaches
- epistemology — epistemological stances represented in the literature
- human-ai-collaboration — frameworks for dividing analytic labor
- validity-trustworthiness — reliability vs. validity across the corpus
- contested-claims — active intellectual disputes in the field
What links here
- AI Research Ethics
- Anis & French (2023) — Efficient, Explicatory, and Equitable: Why Qualitative Researchers Should Embrace AI, but Cautiously
- Bennis & Mouwafaq (2025) — Advancing AI-Driven Thematic Analysis: A Comparative Study of Nine Generative Models
- Bijker et al. (2024) — ChatGPT for Automated Qualitative Research: Content Analysis
- Brailas (2025) — AI in Qualitative Research: Beyond Outsourcing Data Analysis to the Machine
- Carlsen & Ralund (2022) — Computational Grounded Theory Revisited: From Computer-Led to Computer-Assisted Text Analysis
- Chatzichristos (2025) — Qualitative Research in the Era of AI: A Return to Positivism or a New Paradigm?
- Christou (2023) — How to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research?
- Christou (2024) — Thematic Analysis through Artificial Intelligence (AI)
- Computational Grounded Theory
- Costa et al. (2025) — AI as a Co-researcher in the Qualitative Research Workflow: Transforming Human-AI Collaboration
- Dahal (2024) — How Can Generative AI Enhance or Hinder Qualitative Studies? A Critical Appraisal from South Asia, Nepal
- Davison et al. (2024) — The Ethics of Using Generative AI for Qualitative Data Analysis
- Empirical Findings
- Epistemic Flattening
- Fischer & Biemann (2024) — Exploring Large Language Models for Qualitative Data Analysis
- Friese (2026) — From Coding to Conversation: A New Methodological Framework for AI-Assisted Qualitative Analysis
- Goyanes et al. (2025) — Thematic Analysis of Interview Data with ChatGPT: Designing and Testing a Reliable Research Protocol
- Hamilton et al. (2023) — Exploring the Use of AI in Qualitative Analysis: A Comparative Study of Guaranteed Income Data
- AI in Qualitative Research
- Index
- Intercoder Agreement
- Jowsey et al. (2025) — Frankenstein, Thematic Analysis and Generative AI: Quality Appraisal Methods
- Montrosse-Moorhead (2023) — Evaluation Criteria for Artificial Intelligence
- Naeem et al. (2025) — Thematic Analysis and Artificial Intelligence: A Step-by-Step Process for Using ChatGPT
- Nelson (2020) — Computational Grounded Theory: A Methodological Framework
- Nguyen-Trung (2025) — ChatGPT in Thematic Analysis: GAITA and the ACTOR Framework
- Nicmanis & Spurrier (2025) — Getting Started with AI-Assisted Qualitative Analysis: An Introductory Guide
- Paulus & Marone (2024) — "In Minutes Instead of Weeks": Discursive Constructions of Generative AI and Qualitative Data Analysis
- Perkins & Roe (2024) — The Use of Generative AI in Qualitative Analysis: Inductive Thematic Analysis with ChatGPT
- Prescott et al. (2024) — Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses
- Prompt Engineering
- Qualitative AI Methods — A Living Taxonomy
- Reeping et al. (2025) — Interrogating the Use of LLMs in Qualitative Research Using the Q3 Framework
- Sakaguchi et al. (2025) — Evaluating ChatGPT in Qualitative Thematic Analysis in the Japanese Clinical Context
- Salazar et al. (2025) — Comparison of Qualitative Analyses Conducted by Artificial Intelligence Versus Traditional Methods
- Sinha et al. (2024) — The Role of Generative AI in Qualitative Research: GPT-4's Contributions to a Grounded Theory Analysis
- Übellacker (2024) — AcademiaOS: Automating Grounded Theory Development with Large Language Models
- Wheeler (2026) — Technological Reflexivity in Practice: How MAXQDA, NVivo, and ChatGPT Shape Qualitative Survey Analysis
- Williams (2024) — Paradigm Shifts: Exploring AI's Influence on Qualitative Inquiry and Analysis
- Xu (2026) — Doing Thematic Analysis in the Age of Generative AI: Practices, Ethics and Reflexivity
- Yang & Ma (2025) — Artificial Intelligence in Qualitative Analysis: A Practical Guide Using GPT-4 on Substance Use Interview Data
- Zhang et al. (2025) — Harnessing the Power of AI in Qualitative Research: Exploring, Using and Redesigning ChatGPT