Empirical Findings

Source
TL;DR	Across roughly 15 studies with actual data — benchmarks, comparisons, practitioner interviews, multi-tool analyses — the empirical record supports four conclusions: AI is reliably consistent on structured mechanical coding tasks; it fails systematically at cultural and linguistic depth; hallucination risk is almost entirely a function of prompt design, not model capability; and different tools produce structurally different output in ways that reflect implicit epistemological commitments. The empirical record is better at characterizing failure modes than at establishing what AI-assisted analysis is actually good for.

Scope: This page covers studies that collected data, ran comparisons, or systematically observed AI behavior in research contexts. It excludes theoretical critiques, framework proposals, commentary, and open letters — even where those works contain secondary empirical references. For epistemological debates about what these findings mean, see validity-trustworthiness and contested-claims.

Reliability benchmarks — AI vs. human coding

The largest cluster of empirical work asks a simple comparative question: how consistently does AI coding match human coding?

bijker-chatgpt-qca-2024 is the most methodologically clean test in the corpus. GPT-3.5 Turbo was used to code social media data using both inductive (categories derived from data) and deductive (categories mapped to a predefined framework) schemes. Results: κ 0.72–0.82 for inductive coding; κ 0.58–0.73 for deductive coding. The gap matters: inductive coding benefits from AI’s ability to generate rich, example-laden category labels that anchor consistent coding. Deductive coding suffers from overlapping framework categories and sparse semantic labels — precisely the conditions where AI struggles to disambiguate.

bennis-ai-thematic-analysis-2025 is the most aggressive benchmark. Nine LLMs — including GPT-4o, Claude 3.5, ChatGPT o1-Pro, and open-source alternatives — were evaluated against expert human analysis. The best models reached Jaccard index = 1.00 (perfect concordance). ChatGPT o1-Pro led across metrics. The finding that near-perfect concordance is achievable places a ceiling question: if reliability can reach 1.00, what does the remaining disagreement between AI and human analysis consist of? The paper treats this as a reliability success. Critics (see validity-trustworthiness) read it as revealing the limits of reliability as a criterion.

prescott-ai-thematic-analysis-2024 reports more modest numbers in a real-world deployment: 71% inductive theme match between ChatGPT/Bard and human analysts; only fair-to-moderate reliability (κ 0.37–0.47) despite the theme match. Speed: 28× faster than human analysis. The gap between theme match and reliability is the methodologically interesting finding — it suggests AI and humans arrive at similar categories through different internal logics, and those logics produce different within-category decisions.

salazar-gpt4-qualitative-2025 tests GPT-4 against expert human coding on pharmacy education qualitative data. Broad alignment on dominant themes; systematic undercounting of low-frequency codes. This low-frequency failure is predicted by epistemic-flattening — AI optimization toward statistically probable outputs structurally suppresses rare but analytically significant material.

hamilton-ai-qualitative-2023 is the first paper to reframe the comparison as complementarity. In an early ChatGPT comparison on guaranteed income interview data, human and AI analyses each caught themes the other missed. Neither was strictly superior; the combination provided richer coverage. This finding — later formalized by Perkins & Roe — shifts the question from “can AI match humans?” to “what does the combination produce?”

Cultural and linguistic limitations

The reliability numbers above mostly come from English-language, Western-context data. The studies that test beyond that find a different picture.

sakaguchi-chatgpt-japanese-2025 is the critical test. ChatGPT was applied to Japanese healthcare interview data. Descriptive themes — straightforward thematic content — achieved >80% agreement with human analysis. Culturally embedded themes — concepts like gaman (endurance), fate, social harmony — achieved approximately 30% agreement. The drop is not marginal; it is a structural limitation. Training data is English-dominant and Western-hegemonic; culturally specific meaning is precisely what statistical pre-training cannot capture.

fischer-llm-qda-2024 benchmarks open-source LLMs (Llama, Gemma, Mistral) on qualitative data analysis tasks in two languages. Performance on English-language data was promising and comparable to closed models. German-language performance was substantially weaker. Language coverage in pre-training data directly predicts where reliability will hold and where it will degrade. This has obvious implications for non-English research contexts — see dahal-genai-qualitative-nepal-2024 for a practitioner perspective.

Hallucination and data coverage

jowsey-frankenstein-ai-ta-2025 is the most alarming empirical study in the corpus. Copilot was given the same interview data as published human thematic analyses and asked to produce its own analyses blind — without access to the published results. Three findings:

Minimal overlap with the human analysis in any of the tested studies.
Partial coverage — AI only engaged with the first 2–3 pages of transcripts, ignoring later material entirely.
58% of quoted excerpts were fabricated — specific quotes attributed to participants that did not appear in the source data.

The Copilot study used no explicit constraints against hallucination. The contrasting result from ayik-et-al-2026-human-vs-ai-ta-tools — zero hallucinations across four tools, when the prompt explicitly specified “do not incorporate any external sources or data beyond these” — confirms the inference: hallucination risk is prompt-design-dependent, not an inherent property of the model. Study design drives these results more than model capability.

Multi-tool comparison

ayik-et-al-2026-human-vs-ai-ta-tools is the only study that systematically compares structurally distinct AI tools (rather than comparing models within a single tool type) against a validated human baseline. Four tools — ChatGPT-4o, QInsights, ATLAS.ti AI, MAXQDA AI Assist — were each given identical prompts and identical data (25 K-12 STEM teachers discussing formative assessment), analyzed against a six-theme human-coded baseline.

Results by tool:

Tool	Exact theme matches	Partial matches	Notable
ChatGPT-4o	1/6 (17%)	5/6	Treated single teacher’s repeated responses as multiple data points
QInsights	1/6 (17%)	5/6	Generated themes before receiving research questions
ATLAS.ti AI	1/6 (17%)	5/6	Generated 1,001 codes in under 2 minutes; missed multimodality theme
MAXQDA AI Assist	3/6 (50%)	1 partial	Required researcher engagement before producing output

Hallucinations: None across all four tools (careful prompting applied).

The epistemological finding: The authors read tool behavior as reflecting implicit epistemological stances. ATLAS.ti and ChatGPT emphasized frequency and pattern recurrence — a post-positivist orientation. QInsights and MAXQDA foregrounded researcher interaction and dialogue — an interpretivist orientation. Tool choice is a methodological decision with epistemological consequences, not a neutral instrument selection. This connects to chatzichristos-ai-positivism-2025's concern about positivism creep and to paulus-marone-qdas-discourse-2024's analysis of QDAS marketing.

wheeler-technological-reflexivity-2026 provides an independent three-tool comparison on a different dataset: 1,300+ qualitative survey responses from young people imagining the future under climate change, analyzed with MAXQDA, NVivo, and ChatGPT. Rather than measuring theme match, Wheeler examines how each tool mediated her own analytic decisions — what she noticed, what she wrote, what she treated as a finding. Key finding: ChatGPT’s conversational interface made it easier to accept synthetic interpretations without tracing them to specific data segments. This is epistemic-flattening operationalized as a methodological risk in a concrete workflow.

Practitioner behavior studies

dellafiore-et-al-2025-expert-interviews conducted semi-structured interviews with 14 Italian expert qualitative researchers in socio-anthropological and healthcare contexts. All 14 were selected for their established reputation for human-led analysis. 13 of 14 admitted using AI in their research — but several had initially presented as non-users. The dominant emotion was shame. Two substantive findings:

Task split: Experts accept AI for transcription, translation, literature review, and writing support. They are contested or refused for coding, theming, and analytic interpretation. Embodied fieldwork is categorically excluded.
“Illusion of meaning”: Practitioners coined this term to name a specific risk distinct from hallucination. AI output appears interpretively meaningful — it has the vocabulary, structure, and register of qualitative analysis — but is algorithmically derived rather than developed through researcher immersion. The illusion is that the output represents understanding. The danger is subtle enough that experienced researchers may not detect it.

zhang-ai-qualitative-research-2025 reports on interviews with qualitative researchers plus a co-design study developing AI-assisted workflows. Finding: transparency in AI processes and explicit guidance on prompt design were the primary drivers of researcher trust in AI tools. Researchers who felt they understood what the AI was doing were more willing to integrate it; opacity — characteristic of closed models and QDAS black-box implementations — produced resistance.

Dual-analyst and complementarity designs

perkins-roe-genai-inductive-2024 runs simultaneous parallel analyses — one human analyst, one GPT-4-assisted analyst — on the same educational dataset, then synthesizes results. The synthesis produces broader thematic coverage than either analysis alone. Divergences between human and AI outputs are treated as analytically interesting rather than as error: AI-unique themes flag potential human blind spots; human-unique themes flag contextually embedded meaning that AI missed.

Hallucinations were documented: GPT-4 occasionally generated fabricated quotations not present in the source data. Every AI-generated claim required verification against source transcripts — a validation burden that partially offsets efficiency gains but was not quantified.

hamilton-ai-qualitative-2023 reached the same complementarity conclusion earlier (2023, guaranteed income interviews), establishing this as a reproducible finding rather than a single result.

What the empirical record shows

Synthesizing across all studies:

Confirmed:

Reliable, efficient mechanical coding of structured data against human-developed schemes at the small-q end (κ 0.70+)
Systematic failure at cultural depth, low-frequency codes, and non-Western/non-English content
Hallucination risk is nearly eliminated by explicit constraints — it is a design problem, not a model problem
Tool architecture encodes epistemological orientation; dialogic tools produce interpretivist output; frequency-based tools produce post-positivist output
Complementarity (human + AI > either alone) is reproducible in inductive designs
Expert practitioners use AI more than they disclose, and the technical/interpretive task split is real and operating in practice

Not established by the empirical record:

Whether AI-assisted analysis produces valid qualitative findings (as distinct from reliable ones) — see validity-trustworthiness
Whether any tool achieves validity comparable to researcher-immersive analysis for Big-Q research
Long-term effects on researcher skill development
Equity effects: differential performance across populations, languages, and research contexts remains understudied beyond the Japanese-language and Global South examples

The gap between reliability and validity runs through the entire empirical corpus. High intercoder agreement does not establish that the coding scheme captures what it claims to capture, that themes represent participant meaning, or that the analysis would survive member checking. This gap is the central methodological problem the critical literature addresses — see validity-trustworthiness for the full treatment.