Validity and Trustworthiness

Source
TL;DR	The empirical AI-TA literature measures reliability (consistency) while the qualitative methods literature demands validity (accuracy and meaningfulness). These are not the same criterion, and conflating them is the central unacknowledged methodological problem across most of the corpus. The critical literature — Carlsen & Ralund, Brailas, Jowsey — makes this gap explicit and argues that current reliability-focused evaluation cannot detect the most important failures.

The fundamental distinction

Reliability (used in the empirical literature): consistency of coding. Inter-rater reliability (κ, Jaccard) measures whether two coders — human-human, human-AI, or AI-AI — assign the same codes to the same segments. High reliability means the coding is consistent. It does not mean the coding is correct.

Validity (foregrounded in qualitative methods): accuracy and meaningfulness. Does the coding capture what it claims to capture? Do the themes represent participants’ actual meanings? Are the categories genuine constructs of the social world, or artifacts of the researcher’s (or model’s) assumptions?

Trustworthiness (qualitative tradition’s replacement/supplement for validity): Lincoln & Guba’s four criteria:

Credibility — confidence that findings accurately represent participants’ experiences (analog to internal validity)
Transferability — the degree to which findings apply in other contexts (analog to external validity)
Dependability — the research process is documented and consistent (analog to reliability)
Confirmability — findings reflect participant voices, not researcher bias (analog to objectivity)

The empirical AI-TA literature almost exclusively measures dependability (reliability/consistency). Credibility, transferability, and confirmability are rarely operationalized.

Where the studies stand

Reliability-focused approaches (κ, Jaccard)

The most common evaluation in the corpus: run AI coding, compute Cohen’s κ or Jaccard index against human coding, interpret result.

What this establishes: The AI and human coder agree at a specified rate on these codes, for this data, at this time.

What this does not establish:

Whether the coding scheme is valid — codes themselves could be wrong, and AI-human agreement on wrong codes is still wrong
Whether the AI would produce the same codes tomorrow (temporal instability, noted in multiple sources)
Whether the coding captures minority voices or only dominant patterns (epistemic-flattening)
Whether themes have the semantic depth that qualitative analysis aims for

Key benchmarks:

κ 0.72–0.82 inductive, 0.58–0.73 deductive: bijker-chatgpt-qca-2024
Jaccard = 1.00 (best of nine models): bennis-ai-thematic-analysis-2025
71% inductive theme match, 37–47% reliability: prescott-ai-thematic-analysis-2024
80% descriptive agreement, ~30% culturally embedded: sakaguchi-chatgpt-japanese-2025

The Jowsey finding — the validity gap made concrete

jowsey-frankenstein-ai-ta-2025 provides the most direct evidence of the reliability/validity gap. Copilot was tested blind — given the same interview data as published human TA studies, without access to those studies’ findings. Results:

Minimal overlap with human analysis themes
Only first 2–3 pages of transcripts engaged
58% of quoted excerpts were fabricated — specific quotes attributed to participants that did not appear in the data

This is a validity failure (AI produced content that does not represent the data) that would not be detected by intercoder reliability metrics if the AI’s coding were evaluated against only its own output. Quote fabrication in particular represents an ethical failure that κ cannot measure.

The CALM approach — direct validation

carlsen-ralund-computational-grounded-theory-2022 argues that indirect validation (CGT’s approach of correlating topic measures with external variables) cannot detect systematic measurement error. If an LDA model consistently miscodes documents from a particular community, and those miscoded documents still correlate with external variables (because the miscoding is systematic rather than random), the correlation passes — but the measurement is invalid.

Direct validation: Human coders produce a gold-standard coding of a random sample → classifier output compared against this gold standard → measurement error quantified.

This is the only approach that can detect the kind of systematic bias that epistemic-flattening predicts. Indirect validation is a statistical relationship; it says nothing about whether individual documents are correctly coded.

Trustworthiness in Big-Q approaches

For interpretivist research, trustworthiness criteria (Lincoln & Guba) replace or supplement reliability metrics. These are rarely operationalized in the AI-TA literature, but some sources gesture toward them:

Credibility:

Member checking (not mentioned in the AI-TA literature, but standard in qualitative research)
Prolonged engagement with data — threatened when AI summarizes instead of researcher reads
Triangulation — the dual-analyst design (perkins-roe-genai-inductive-2024) approaches this by comparing human and AI-assisted coding

Dependability:

Audit trail — most thoroughly addressed in the corpus
christou-ta-through-ai-2024: “an audit trail showing every step of the process”
wheeler-technological-reflexivity-2026: prompts are methodological decisions requiring documentation
xu-ai-thematic-analysis-2026: reflexive memos at every phase
nguyen-trung-nita-2026: the PERFECT monitoring framework is the most structured audit trail mechanism in the corpus — seven components with documented researcher responses at each stage, including explicit Check & Reflect and Tune stages that produce a record of how AI outputs were evaluated and modified. PERFECT externalizes the reflexive process rather than leaving it implicit in memos.

Confirmability:

Reflexive memos documenting how AI output was interrogated and where human judgment overrode it
Rare in the empirical literature; operationalized in CAAI (friese-caai-framework-2026) and AbductivAI (costa-abductivai-2025)
wise-et-al-2026-ai-not-the-enemy proposes systematic AI-assisted surfacing of disconfirming instances as a new confirmability mechanism; more thorough triangulation than manually feasible

A proposed new criterion — Transparency in analytic decision-making: wise-et-al-2026-ai-not-the-enemy argues that AI-in-the-loop analysis requires a new criterion beyond Lincoln & Guba’s original four: explicit documentation of (1) what data context was included in the model’s active context, (2) which model and parameters (including temperature) were selected and why, and (3) how iterative prompting was enacted. This addresses the “illusion of neutrality” — the misperception that AI is more objective than human analysis. Without this criterion, AI use becomes a black box precisely where methodological transparency is most needed.

The eight-criteria framework

montrosse-moorhead-ai-evaluation-2023 offers the most comprehensive framework for AI evaluation in qualitative-adjacent work, derived from Teasdale’s Criteria Domains Framework:

Three implementation criteria:

Purpose alignment — is AI used for tasks it can perform well, and do those tasks align with research goals?
Methodological appropriateness — does AI fit the research design and data type?
Transparency — is AI use documented clearly enough for peer scrutiny?

Five outcome criteria: 4. Accuracy — does AI produce correct outputs? 5. Credibility — are outputs validated beyond initial benchmarks? 6. Equity — does AI perform differentially across population subgroups? 7. Efficiency — do speed gains justify the validation burden? 8. Ethical integrity — are participant rights, privacy, and consent protected?

The equity criterion is specifically underemphasized in the rest of the corpus. sakaguchi-chatgpt-japanese-2025's finding — AI performs well on descriptive English themes but poorly (~30%) on culturally embedded Japanese themes — is the clearest empirical evidence of criterion 6 failure. dahal-genai-qualitative-nepal-2024's Global South perspective and epistemic-flattening address the structural dimensions.

The Q3 Framework

reeping-llm-quality-framework-2025 adapts the Q3 (Qualifying Qualitative Research Quality) Framework to LLM use. Eight dimensions of quality evaluated processually — at each phase of the research, not just at the output stage:

Researcher stance and assumptions
Alignment of research questions, design, and method
Integrity of the analytic process
Adequacy of data
Legitimacy of inferences
Significance and contribution
Ethical dimensions
Communication and transparency

The novel contribution: Q3 requires researchers to articulate their positionality relative to the AI as part of criterion 1. Not just “I am a feminist researcher” but “I am a feminist researcher who used ChatGPT-4o with these prompts, and here is how I interrogated and resisted the AI’s framings.” AI positionality is a new methodological concept this framework introduces.

Common validity failures documented in the corpus

Failure mode	Evidence	Source
Quote fabrication	58% of quotes invented	jowsey-frankenstein-ai-ta-2025
Incomplete data coverage	AI reads first 2–3 pages only	jowsey-frankenstein-ai-ta-2025
Low-frequency code omission	AI misses rare events	salazar-gpt4-qualitative-2025, prescott-ai-thematic-analysis-2024
Cultural bias	Japanese themes: 30% vs. 80% descriptive	sakaguchi-chatgpt-japanese-2025
Language performance gap	German weaker than English	fischer-llm-qda-2024
Indirect validation miss	External correlation ≠ valid measurement	carlsen-ralund-computational-grounded-theory-2022
Epistemic flattening	Dominant patterns amplified; minority suppressed	brailas-ai-qualitative-research-2025, epistemic-flattening
Illusion of meaning	Outputs appear interpretively meaningful but are algorithmically derived	dellafiore-et-al-2025-expert-interviews
Tool-design validity mismatch	ATLAS.ti/ChatGPT produce post-positivist output for Big-Q research	ayik-et-al-2026-human-vs-ai-ta-tools

What adequate validation looks like

Drawing on the critical literature (carlsen-ralund-computational-grounded-theory-2022, montrosse-moorhead-ai-evaluation-2023, reeping-llm-quality-framework-2025):

Human scheme development before AI coding — not AI generation of categories
Direct validation against human-coded random sample, not just correlation with external variables
Audit trail documenting prompts, AI outputs, and where human judgment overrode AI
Quote verification — every AI-identified quote checked against original transcript
Equity check — performance evaluated separately for different subgroups, languages, or discourse communities
Reflexive memos — researcher documents how AI shaped what they saw and how they interrogated that influence