| url | https://doi.org/10.1186/s12911-025-02961-5 |
|---|---|
| raw | raw/Bennis_s12911-025-02961-5.pdf |
TL;DR: The most optimistic empirical benchmark in the corpus. Nine state-of-the-art AI models compared against expert human thematic analysis on psychosocial data. Several models achieve Jaccard index = 1.00 (perfect concordance). ChatGPT o1-Pro leads. The pace of improvement is startling — months, not years.
Problem
Every comparative study of AI-assisted qualitative analysis is a snapshot in a fast-moving field. Bijker et al. (2024) tested GPT-3.5 Turbo. The question Bennis & Mouwafaq address is: what happens when you run a comparable evaluation with the best models available at the end of 2024? And what happens when you test multiple models simultaneously rather than one?
The domain is also important: this is not forum posts or SMS messages, but psychosocial data on the lived experience of cutaneous leishmaniasis (CL) scars in Moroccan high school students — data that requires sensitivity to embodied experience, stigma, and cultural context. If AI can handle this, it can probably handle most structured qualitative datasets.
Approach
448 direct student quotations from a prior study (Bennis et al. 2017) were used as the dataset — they had already been coded by a multinational team of anthropologists, sociologists, and public health specialists using NVivo. This existing expert analysis serves as the reference standard (Reference A).
Nine models were tested across two cohorts:
- July 2024 cohort: Llama 3.1 405B, Claude 3.5 Sonnet, NotebookLM, Gemini 1.5 Advanced Ultra, ChatGPT o1-preview
- December 2024 cohort: ChatGPT o1 (replacing preview), GrokV2, DeepSeekV3, Gemini 2.0 Advanced, ChatGPT o1-Pro
Concordance was assessed via Cohen’s Kappa (using Jamovi) and Jaccard index (Python). Gender-specific subgroup analyses were also conducted, making this one of the few studies to examine within-sample variation in AI performance.
A second human analyst (the paper’s second author) also conducted manual analysis, providing a human-versus-human baseline.
AI’s role
AI is positioned here as a potential replacement for human expert analysis — or at least as capable of approximating it closely enough to serve as a reliable substitute for structured thematic tasks. The framing is pragmatic and optimistic: if concordance is high enough, AI can reduce the time and cost of analysis without compromising validity (as measured against the expert reference standard).
This is a more ambitious claim than most of the corpus makes. Bijker et al. positioned ChatGPT as a “second coder” assisting humans; Bennis & Mouwafaq are implicitly asking whether it can be a primary coder.
Epistemological stance
Post-positivist, with a strong quantitative evaluation logic. Concordance metrics (κ, Jaccard) are the primary evidence. The reference standard is treated as ground truth. The paper does not engage with interpretive or constructionist epistemologies — whether the expert human analysis itself is “valid” rather than simply authoritative is not a question the study raises.
The psychosocial domain introduces complexity that sits at the edge of what concordance metrics can capture: “fragile circle of vulnerabilities” (a theoretical contribution the paper makes) emerges from interpretive synthesis, not from matching codes to a reference list.
Rigor and trustworthiness
The Jaccard index = 1.00 for several models is genuinely striking. Within the paper’s own framework — concordance with a prior expert standard — this represents near-perfect performance. The Kappa calculations add nuance: some models show high concordance on one gender subgroup but not the other, revealing that aggregate concordance can mask within-sample variation.
The gender-specific analysis is methodologically important. Most AI evaluation studies report aggregate statistics; Bennis & Mouwafaq show that performance can differ meaningfully across subgroups, with implications for equity in AI-assisted research (→ montrosse-moorhead-ai-evaluation-2023).
The five-month gap between cohorts is also methodologically revealing: models improved substantially between July and December 2024. This makes any specific reliability figures quickly obsolete — the point is not the numbers, but the trajectory.
Limitations
The reference standard is a prior human analysis, not an independent ground truth. If the original expert team made systematic interpretive choices that reflected their theoretical orientation (as qualitative researchers inevitably do), then high AI concordance may mean AI replicates those choices, not that it captures some objective truth about the data. This is the validity problem in a different form.
The paper does not engage with the epistemological implications of the “fragile circle of vulnerabilities” framework the AI helped generate. If AI participated in theory generation, the provenance of that theory — and what it means to claim it — becomes complicated.
Reproducibility concerns are flagged but not resolved: the study calls for standardized reporting checklists but does not provide one.
Connections
- intercoder-agreement — Cohen’s κ and Jaccard index used throughout
- bijker-chatgpt-qca-2024 — the earlier benchmark this study extends and updates
- llm-qualitative-research — broader landscape
- epistemic-flattening — the concordance-with-reference-standard design cannot detect whether AI is replicating biases embedded in the reference
- validity-trustworthiness — the reliability-without-validity gap is most acute here, where concordance figures are highest
- contested-claims — whether Jaccard = 1.00 represents genuine analytic success or successful imitation of one team’s interpretive choices
What links here
- AI Research Ethics
- Bijker et al. (2024) — ChatGPT for Automated Qualitative Research: Content Analysis
- Contested Claims
- Empirical Findings
- Epistemology — Stances Across the Literature
- Goyanes et al. (2025) — Thematic Analysis of Interview Data with ChatGPT: Designing and Testing a Reliable Research Protocol
- Hamilton et al. (2023) — Exploring the Use of AI in Qualitative Analysis: A Comparative Study of Guaranteed Income Data
- AI in Qualitative Research
- Index
- Jowsey et al. (2025) — We Reject the Use of Generative AI for Reflexive Qualitative Research
- Jowsey et al. (2025) — Frankenstein, Thematic Analysis and Generative AI: Quality Appraisal Methods
- LLMs for Qualitative Research
- Montrosse-Moorhead (2023) — Evaluation Criteria for Artificial Intelligence
- Perkins & Roe (2024) — The Use of Generative AI in Qualitative Analysis: Inductive Thematic Analysis with ChatGPT
- Prescott et al. (2024) — Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses
- Qualitative AI Methods — A Living Taxonomy
- Sakaguchi et al. (2025) — Evaluating ChatGPT in Qualitative Thematic Analysis in the Japanese Clinical Context
- Validity and Trustworthiness