Source
TL;DRThe empirical AI-TA literature measures reliability (consistency) while the qualitative methods literature demands validity (accuracy and meaningfulness). These are not the same criterion, and conflating them is the central unacknowledged methodological problem across most of the corpus. The critical literature — Carlsen & Ralund, Brailas, Jowsey — makes this gap explicit and argues that current reliability-focused evaluation cannot detect the most important failures.

The fundamental distinction

Reliability (used in the empirical literature): consistency of coding. Inter-rater reliability (κ, Jaccard) measures whether two coders — human-human, human-AI, or AI-AI — assign the same codes to the same segments. High reliability means the coding is consistent. It does not mean the coding is correct.

Validity (foregrounded in qualitative methods): accuracy and meaningfulness. Does the coding capture what it claims to capture? Do the themes represent participants’ actual meanings? Are the categories genuine constructs of the social world, or artifacts of the researcher’s (or model’s) assumptions?

Trustworthiness (qualitative tradition’s replacement/supplement for validity): Lincoln & Guba’s four criteria:

  • Credibility — confidence that findings accurately represent participants’ experiences (analog to internal validity)
  • Transferability — the degree to which findings apply in other contexts (analog to external validity)
  • Dependability — the research process is documented and consistent (analog to reliability)
  • Confirmability — findings reflect participant voices, not researcher bias (analog to objectivity)

The empirical AI-TA literature almost exclusively measures dependability (reliability/consistency). Credibility, transferability, and confirmability are rarely operationalized.

Where the studies stand

Reliability-focused approaches (κ, Jaccard)

The most common evaluation in the corpus: run AI coding, compute Cohen’s κ or Jaccard index against human coding, interpret result.

What this establishes: The AI and human coder agree at a specified rate on these codes, for this data, at this time.

What this does not establish:

  • Whether the coding scheme is valid — codes themselves could be wrong, and AI-human agreement on wrong codes is still wrong
  • Whether the AI would produce the same codes tomorrow (temporal instability, noted in multiple sources)
  • Whether the coding captures minority voices or only dominant patterns (epistemic-flattening)
  • Whether themes have the semantic depth that qualitative analysis aims for

Key benchmarks:

The Jowsey finding — the validity gap made concrete

jowsey-frankenstein-ai-ta-2025 provides the most direct evidence of the reliability/validity gap. Copilot was tested blind — given the same interview data as published human TA studies, without access to those studies’ findings. Results:

  • Minimal overlap with human analysis themes
  • Only first 2–3 pages of transcripts engaged
  • 58% of quoted excerpts were fabricated — specific quotes attributed to participants that did not appear in the data

This is a validity failure (AI produced content that does not represent the data) that would not be detected by intercoder reliability metrics if the AI’s coding were evaluated against only its own output. Quote fabrication in particular represents an ethical failure that κ cannot measure.

The CALM approach — direct validation

carlsen-ralund-computational-grounded-theory-2022 argues that indirect validation (CGT’s approach of correlating topic measures with external variables) cannot detect systematic measurement error. If an LDA model consistently miscodes documents from a particular community, and those miscoded documents still correlate with external variables (because the miscoding is systematic rather than random), the correlation passes — but the measurement is invalid.

Direct validation: Human coders produce a gold-standard coding of a random sample → classifier output compared against this gold standard → measurement error quantified.

This is the only approach that can detect the kind of systematic bias that epistemic-flattening predicts. Indirect validation is a statistical relationship; it says nothing about whether individual documents are correctly coded.

Trustworthiness in Big-Q approaches

For interpretivist research, trustworthiness criteria (Lincoln & Guba) replace or supplement reliability metrics. These are rarely operationalized in the AI-TA literature, but some sources gesture toward them:

Credibility:

  • Member checking (not mentioned in the AI-TA literature, but standard in qualitative research)
  • Prolonged engagement with data — threatened when AI summarizes instead of researcher reads
  • Triangulation — the dual-analyst design (perkins-roe-genai-inductive-2024) approaches this by comparing human and AI-assisted coding

Dependability:

  • Audit trail — most thoroughly addressed in the corpus
  • christou-ta-through-ai-2024: “an audit trail showing every step of the process”
  • wheeler-technological-reflexivity-2026: prompts are methodological decisions requiring documentation
  • xu-ai-thematic-analysis-2026: reflexive memos at every phase
  • nguyen-trung-nita-2026: the PERFECT monitoring framework is the most structured audit trail mechanism in the corpus — seven components with documented researcher responses at each stage, including explicit Check & Reflect and Tune stages that produce a record of how AI outputs were evaluated and modified. PERFECT externalizes the reflexive process rather than leaving it implicit in memos.

Confirmability:

  • Reflexive memos documenting how AI output was interrogated and where human judgment overrode it
  • Rare in the empirical literature; operationalized in CAAI (friese-caai-framework-2026) and AbductivAI (costa-abductivai-2025)
  • wise-et-al-2026-ai-not-the-enemy proposes systematic AI-assisted surfacing of disconfirming instances as a new confirmability mechanism; more thorough triangulation than manually feasible

A proposed new criterion — Transparency in analytic decision-making: wise-et-al-2026-ai-not-the-enemy argues that AI-in-the-loop analysis requires a new criterion beyond Lincoln & Guba’s original four: explicit documentation of (1) what data context was included in the model’s active context, (2) which model and parameters (including temperature) were selected and why, and (3) how iterative prompting was enacted. This addresses the “illusion of neutrality” — the misperception that AI is more objective than human analysis. Without this criterion, AI use becomes a black box precisely where methodological transparency is most needed.

The eight-criteria framework

montrosse-moorhead-ai-evaluation-2023 offers the most comprehensive framework for AI evaluation in qualitative-adjacent work, derived from Teasdale’s Criteria Domains Framework:

Three implementation criteria:

  1. Purpose alignment — is AI used for tasks it can perform well, and do those tasks align with research goals?
  2. Methodological appropriateness — does AI fit the research design and data type?
  3. Transparency — is AI use documented clearly enough for peer scrutiny?

Five outcome criteria: 4. Accuracy — does AI produce correct outputs? 5. Credibility — are outputs validated beyond initial benchmarks? 6. Equity — does AI perform differentially across population subgroups? 7. Efficiency — do speed gains justify the validation burden? 8. Ethical integrity — are participant rights, privacy, and consent protected?

The equity criterion is specifically underemphasized in the rest of the corpus. sakaguchi-chatgpt-japanese-2025's finding — AI performs well on descriptive English themes but poorly (~30%) on culturally embedded Japanese themes — is the clearest empirical evidence of criterion 6 failure. dahal-genai-qualitative-nepal-2024's Global South perspective and epistemic-flattening address the structural dimensions.

The Q3 Framework

reeping-llm-quality-framework-2025 adapts the Q3 (Qualifying Qualitative Research Quality) Framework to LLM use. Eight dimensions of quality evaluated processually — at each phase of the research, not just at the output stage:

  1. Researcher stance and assumptions
  2. Alignment of research questions, design, and method
  3. Integrity of the analytic process
  4. Adequacy of data
  5. Legitimacy of inferences
  6. Significance and contribution
  7. Ethical dimensions
  8. Communication and transparency

The novel contribution: Q3 requires researchers to articulate their positionality relative to the AI as part of criterion 1. Not just “I am a feminist researcher” but “I am a feminist researcher who used ChatGPT-4o with these prompts, and here is how I interrogated and resisted the AI’s framings.” AI positionality is a new methodological concept this framework introduces.

Common validity failures documented in the corpus

Failure mode Evidence Source
Quote fabrication 58% of quotes invented jowsey-frankenstein-ai-ta-2025
Incomplete data coverage AI reads first 2–3 pages only jowsey-frankenstein-ai-ta-2025
Low-frequency code omission AI misses rare events salazar-gpt4-qualitative-2025, prescott-ai-thematic-analysis-2024
Cultural bias Japanese themes: 30% vs. 80% descriptive sakaguchi-chatgpt-japanese-2025
Language performance gap German weaker than English fischer-llm-qda-2024
Indirect validation miss External correlation ≠ valid measurement carlsen-ralund-computational-grounded-theory-2022
Epistemic flattening Dominant patterns amplified; minority suppressed brailas-ai-qualitative-research-2025, epistemic-flattening
Illusion of meaning Outputs appear interpretively meaningful but are algorithmically derived dellafiore-et-al-2025-expert-interviews
Tool-design validity mismatch ATLAS.ti/ChatGPT produce post-positivist output for Big-Q research ayik-et-al-2026-human-vs-ai-ta-tools

What adequate validation looks like

Drawing on the critical literature (carlsen-ralund-computational-grounded-theory-2022, montrosse-moorhead-ai-evaluation-2023, reeping-llm-quality-framework-2025):

  1. Human scheme development before AI coding — not AI generation of categories
  2. Direct validation against human-coded random sample, not just correlation with external variables
  3. Audit trail documenting prompts, AI outputs, and where human judgment overrode AI
  4. Quote verification — every AI-identified quote checked against original transcript
  5. Equity check — performance evaluated separately for different subgroups, languages, or discourse communities
  6. Reflexive memos — researcher documents how AI shaped what they saw and how they interrogated that influence

See also