| TL;DR | The empirical AI-TA literature measures reliability (consistency) while the qualitative methods literature demands validity (accuracy and meaningfulness). These are not the same criterion, and conflating them is the central unacknowledged methodological problem across most of the corpus. The critical literature — Carlsen & Ralund, Brailas, Jowsey — makes this gap explicit and argues that current reliability-focused evaluation cannot detect the most important failures. |
|---|
The fundamental distinction
Reliability (used in the empirical literature): consistency of coding. Inter-rater reliability (κ, Jaccard) measures whether two coders — human-human, human-AI, or AI-AI — assign the same codes to the same segments. High reliability means the coding is consistent. It does not mean the coding is correct.
Validity (foregrounded in qualitative methods): accuracy and meaningfulness. Does the coding capture what it claims to capture? Do the themes represent participants’ actual meanings? Are the categories genuine constructs of the social world, or artifacts of the researcher’s (or model’s) assumptions?
Trustworthiness (qualitative tradition’s replacement/supplement for validity): Lincoln & Guba’s four criteria:
- Credibility — confidence that findings accurately represent participants’ experiences (analog to internal validity)
- Transferability — the degree to which findings apply in other contexts (analog to external validity)
- Dependability — the research process is documented and consistent (analog to reliability)
- Confirmability — findings reflect participant voices, not researcher bias (analog to objectivity)
The empirical AI-TA literature almost exclusively measures dependability (reliability/consistency). Credibility, transferability, and confirmability are rarely operationalized.
Where the studies stand
Reliability-focused approaches (κ, Jaccard)
The most common evaluation in the corpus: run AI coding, compute Cohen’s κ or Jaccard index against human coding, interpret result.
What this establishes: The AI and human coder agree at a specified rate on these codes, for this data, at this time.
What this does not establish:
- Whether the coding scheme is valid — codes themselves could be wrong, and AI-human agreement on wrong codes is still wrong
- Whether the AI would produce the same codes tomorrow (temporal instability, noted in multiple sources)
- Whether the coding captures minority voices or only dominant patterns (epistemic-flattening)
- Whether themes have the semantic depth that qualitative analysis aims for
Key benchmarks:
- κ 0.72–0.82 inductive, 0.58–0.73 deductive: bijker-chatgpt-qca-2024
- Jaccard = 1.00 (best of nine models): bennis-ai-thematic-analysis-2025
- 71% inductive theme match, 37–47% reliability: prescott-ai-thematic-analysis-2024
-
80% descriptive agreement, ~30% culturally embedded: sakaguchi-chatgpt-japanese-2025
The Jowsey finding — the validity gap made concrete
jowsey-frankenstein-ai-ta-2025 provides the most direct evidence of the reliability/validity gap. Copilot was tested blind — given the same interview data as published human TA studies, without access to those studies’ findings. Results:
- Minimal overlap with human analysis themes
- Only first 2–3 pages of transcripts engaged
- 58% of quoted excerpts were fabricated — specific quotes attributed to participants that did not appear in the data
This is a validity failure (AI produced content that does not represent the data) that would not be detected by intercoder reliability metrics if the AI’s coding were evaluated against only its own output. Quote fabrication in particular represents an ethical failure that κ cannot measure.
The CALM approach — direct validation
carlsen-ralund-computational-grounded-theory-2022 argues that indirect validation (CGT’s approach of correlating topic measures with external variables) cannot detect systematic measurement error. If an LDA model consistently miscodes documents from a particular community, and those miscoded documents still correlate with external variables (because the miscoding is systematic rather than random), the correlation passes — but the measurement is invalid.
Direct validation: Human coders produce a gold-standard coding of a random sample → classifier output compared against this gold standard → measurement error quantified.
This is the only approach that can detect the kind of systematic bias that epistemic-flattening predicts. Indirect validation is a statistical relationship; it says nothing about whether individual documents are correctly coded.
Trustworthiness in Big-Q approaches
For interpretivist research, trustworthiness criteria (Lincoln & Guba) replace or supplement reliability metrics. These are rarely operationalized in the AI-TA literature, but some sources gesture toward them:
Credibility:
- Member checking (not mentioned in the AI-TA literature, but standard in qualitative research)
- Prolonged engagement with data — threatened when AI summarizes instead of researcher reads
- Triangulation — the dual-analyst design (perkins-roe-genai-inductive-2024) approaches this by comparing human and AI-assisted coding
Dependability:
- Audit trail — most thoroughly addressed in the corpus
- christou-ta-through-ai-2024: “an audit trail showing every step of the process”
- wheeler-technological-reflexivity-2026: prompts are methodological decisions requiring documentation
- xu-ai-thematic-analysis-2026: reflexive memos at every phase
- nguyen-trung-nita-2026: the PERFECT monitoring framework is the most structured audit trail mechanism in the corpus — seven components with documented researcher responses at each stage, including explicit Check & Reflect and Tune stages that produce a record of how AI outputs were evaluated and modified. PERFECT externalizes the reflexive process rather than leaving it implicit in memos.
Confirmability:
- Reflexive memos documenting how AI output was interrogated and where human judgment overrode it
- Rare in the empirical literature; operationalized in CAAI (friese-caai-framework-2026) and AbductivAI (costa-abductivai-2025)
- wise-et-al-2026-ai-not-the-enemy proposes systematic AI-assisted surfacing of disconfirming instances as a new confirmability mechanism; more thorough triangulation than manually feasible
A proposed new criterion — Transparency in analytic decision-making: wise-et-al-2026-ai-not-the-enemy argues that AI-in-the-loop analysis requires a new criterion beyond Lincoln & Guba’s original four: explicit documentation of (1) what data context was included in the model’s active context, (2) which model and parameters (including temperature) were selected and why, and (3) how iterative prompting was enacted. This addresses the “illusion of neutrality” — the misperception that AI is more objective than human analysis. Without this criterion, AI use becomes a black box precisely where methodological transparency is most needed.
The eight-criteria framework
montrosse-moorhead-ai-evaluation-2023 offers the most comprehensive framework for AI evaluation in qualitative-adjacent work, derived from Teasdale’s Criteria Domains Framework:
Three implementation criteria:
- Purpose alignment — is AI used for tasks it can perform well, and do those tasks align with research goals?
- Methodological appropriateness — does AI fit the research design and data type?
- Transparency — is AI use documented clearly enough for peer scrutiny?
Five outcome criteria: 4. Accuracy — does AI produce correct outputs? 5. Credibility — are outputs validated beyond initial benchmarks? 6. Equity — does AI perform differentially across population subgroups? 7. Efficiency — do speed gains justify the validation burden? 8. Ethical integrity — are participant rights, privacy, and consent protected?
The equity criterion is specifically underemphasized in the rest of the corpus. sakaguchi-chatgpt-japanese-2025's finding — AI performs well on descriptive English themes but poorly (~30%) on culturally embedded Japanese themes — is the clearest empirical evidence of criterion 6 failure. dahal-genai-qualitative-nepal-2024's Global South perspective and epistemic-flattening address the structural dimensions.
The Q3 Framework
reeping-llm-quality-framework-2025 adapts the Q3 (Qualifying Qualitative Research Quality) Framework to LLM use. Eight dimensions of quality evaluated processually — at each phase of the research, not just at the output stage:
- Researcher stance and assumptions
- Alignment of research questions, design, and method
- Integrity of the analytic process
- Adequacy of data
- Legitimacy of inferences
- Significance and contribution
- Ethical dimensions
- Communication and transparency
The novel contribution: Q3 requires researchers to articulate their positionality relative to the AI as part of criterion 1. Not just “I am a feminist researcher” but “I am a feminist researcher who used ChatGPT-4o with these prompts, and here is how I interrogated and resisted the AI’s framings.” AI positionality is a new methodological concept this framework introduces.
Common validity failures documented in the corpus
| Failure mode | Evidence | Source |
|---|---|---|
| Quote fabrication | 58% of quotes invented | jowsey-frankenstein-ai-ta-2025 |
| Incomplete data coverage | AI reads first 2–3 pages only | jowsey-frankenstein-ai-ta-2025 |
| Low-frequency code omission | AI misses rare events | salazar-gpt4-qualitative-2025, prescott-ai-thematic-analysis-2024 |
| Cultural bias | Japanese themes: 30% vs. 80% descriptive | sakaguchi-chatgpt-japanese-2025 |
| Language performance gap | German weaker than English | fischer-llm-qda-2024 |
| Indirect validation miss | External correlation ≠ valid measurement | carlsen-ralund-computational-grounded-theory-2022 |
| Epistemic flattening | Dominant patterns amplified; minority suppressed | brailas-ai-qualitative-research-2025, epistemic-flattening |
| Illusion of meaning | Outputs appear interpretively meaningful but are algorithmically derived | dellafiore-et-al-2025-expert-interviews |
| Tool-design validity mismatch | ATLAS.ti/ChatGPT produce post-positivist output for Big-Q research | ayik-et-al-2026-human-vs-ai-ta-tools |
What adequate validation looks like
Drawing on the critical literature (carlsen-ralund-computational-grounded-theory-2022, montrosse-moorhead-ai-evaluation-2023, reeping-llm-quality-framework-2025):
- Human scheme development before AI coding — not AI generation of categories
- Direct validation against human-coded random sample, not just correlation with external variables
- Audit trail documenting prompts, AI outputs, and where human judgment overrode AI
- Quote verification — every AI-identified quote checked against original transcript
- Equity check — performance evaluated separately for different subgroups, languages, or discourse communities
- Reflexive memos — researcher documents how AI shaped what they saw and how they interrogated that influence
See also
- empirical-findings — synthesis of all empirical studies in the corpus; the data underlying the reliability/validity gap
- intercoder-agreement — reliability metrics in detail
- human-ai-collaboration — frameworks that build in validation
- epistemology — validity criteria vary by epistemological tradition
- contested-claims — whether any AI approach can achieve qualitative validity
- computational-grounded-theory — direct vs. indirect validation
- epistemic-flattening — the validity risk that reliability metrics can’t detect
- ai-research-ethics — ethical dimensions of validity failures
- ayik-et-al-2026-human-vs-ai-ta-tools — empirical validity test: four tools vs. human TA; tool design encodes epistemological orientation
- jowsey-et-al-2025-we-reject — categorical position: AI-generated analysis cannot satisfy Big-Q validity criteria in principle
- dellafiore-et-al-2025-expert-interviews — “illusion of meaning” as practitioner-identified validity risk
What links here
- AI Research Ethics
- Ayik et al. (2026) — Human vs. AI: Evaluating TA With ChatGPT, QInsights, ATLAS.ti AI, and MAXQDA AI Assist
- Bennis & Mouwafaq (2025) — Advancing AI-Driven Thematic Analysis: A Comparative Study of Nine Generative Models
- Carlsen & Ralund (2022) — Computational Grounded Theory Revisited: From Computer-Led to Computer-Assisted Text Analysis
- Christou (2023) — How to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research?
- Christou (2024) — Thematic Analysis through Artificial Intelligence (AI)
- Computational Grounded Theory
- Contested Claims
- Costa et al. (2025) — AI as a Co-researcher in the Qualitative Research Workflow: Transforming Human-AI Collaboration
- Empirical Findings
- Epistemic Flattening
- Epistemology — Stances Across the Literature
- Fischer & Biemann (2024) — Exploring Large Language Models for Qualitative Data Analysis
- Friese (2026) — From Coding to Conversation: A New Methodological Framework for AI-Assisted Qualitative Analysis
- Goyanes et al. (2025) — Thematic Analysis of Interview Data with ChatGPT: Designing and Testing a Reliable Research Protocol
- AI in Qualitative Research
- Human-AI Collaboration — Frameworks and Models
- Index
- Jowsey et al. (2025) — Frankenstein, Thematic Analysis and Generative AI: Quality Appraisal Methods
- LLMs for Qualitative Research
- Montrosse-Moorhead (2023) — Evaluation Criteria for Artificial Intelligence
- Naeem et al. (2025) — Thematic Analysis and Artificial Intelligence: A Step-by-Step Process for Using ChatGPT
- Nelson (2020) — Computational Grounded Theory: A Methodological Framework
- Nguyen-Trung (2025) — ChatGPT in Thematic Analysis: GAITA and the ACTOR Framework
- Nguyen-Trung & Nguyen (2026) — Narrative-Integrated Thematic Analysis (NITA)
- Nicmanis & Spurrier (2025) — Getting Started with AI-Assisted Qualitative Analysis: An Introductory Guide
- Perkins & Roe (2024) — The Use of Generative AI in Qualitative Analysis: Inductive Thematic Analysis with ChatGPT
- Prescott et al. (2024) — Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses
- Qualitative AI Methods — A Living Taxonomy
- Reeping et al. (2025) — Interrogating the Use of LLMs in Qualitative Research Using the Q3 Framework
- Sakaguchi et al. (2025) — Evaluating ChatGPT in Qualitative Thematic Analysis in the Japanese Clinical Context
- Salazar et al. (2025) — Comparison of Qualitative Analyses Conducted by Artificial Intelligence Versus Traditional Methods
- Wheeler (2026) — Technological Reflexivity in Practice: How MAXQDA, NVivo, and ChatGPT Shape Qualitative Survey Analysis
- Wise et al. (2026) — Why AI is Not the Enemy: Trustworthy AI-in-the-Loop Analysis
- Xu (2026) — Doing Thematic Analysis in the Age of Generative AI: Practices, Ethics and Reflexivity
- Yang & Ma (2025) — Artificial Intelligence in Qualitative Analysis: A Practical Guide Using GPT-4 on Substance Use Interview Data