Bennis & Mouwafaq (2025) — Advancing AI-Driven Thematic Analysis: A Comparative Study of Nine Generative Models

Source
url	https://doi.org/10.1186/s12911-025-02961-5
raw	raw/Bennis_s12911-025-02961-5.pdf

TL;DR: The most optimistic empirical benchmark in the corpus. Nine state-of-the-art AI models compared against expert human thematic analysis on psychosocial data. Several models achieve Jaccard index = 1.00 (perfect concordance). ChatGPT o1-Pro leads. The pace of improvement is startling — months, not years.

Problem

Every comparative study of AI-assisted qualitative analysis is a snapshot in a fast-moving field. Bijker et al. (2024) tested GPT-3.5 Turbo. The question Bennis & Mouwafaq address is: what happens when you run a comparable evaluation with the best models available at the end of 2024? And what happens when you test multiple models simultaneously rather than one?

The domain is also important: this is not forum posts or SMS messages, but psychosocial data on the lived experience of cutaneous leishmaniasis (CL) scars in Moroccan high school students — data that requires sensitivity to embodied experience, stigma, and cultural context. If AI can handle this, it can probably handle most structured qualitative datasets.

Approach

448 direct student quotations from a prior study (Bennis et al. 2017) were used as the dataset — they had already been coded by a multinational team of anthropologists, sociologists, and public health specialists using NVivo. This existing expert analysis serves as the reference standard (Reference A).

Nine models were tested across two cohorts:

July 2024 cohort: Llama 3.1 405B, Claude 3.5 Sonnet, NotebookLM, Gemini 1.5 Advanced Ultra, ChatGPT o1-preview
December 2024 cohort: ChatGPT o1 (replacing preview), GrokV2, DeepSeekV3, Gemini 2.0 Advanced, ChatGPT o1-Pro

Concordance was assessed via Cohen’s Kappa (using Jamovi) and Jaccard index (Python). Gender-specific subgroup analyses were also conducted, making this one of the few studies to examine within-sample variation in AI performance.

A second human analyst (the paper’s second author) also conducted manual analysis, providing a human-versus-human baseline.

AI’s role

AI is positioned here as a potential replacement for human expert analysis — or at least as capable of approximating it closely enough to serve as a reliable substitute for structured thematic tasks. The framing is pragmatic and optimistic: if concordance is high enough, AI can reduce the time and cost of analysis without compromising validity (as measured against the expert reference standard).

This is a more ambitious claim than most of the corpus makes. Bijker et al. positioned ChatGPT as a “second coder” assisting humans; Bennis & Mouwafaq are implicitly asking whether it can be a primary coder.

Epistemological stance

Post-positivist, with a strong quantitative evaluation logic. Concordance metrics (κ, Jaccard) are the primary evidence. The reference standard is treated as ground truth. The paper does not engage with interpretive or constructionist epistemologies — whether the expert human analysis itself is “valid” rather than simply authoritative is not a question the study raises.

The psychosocial domain introduces complexity that sits at the edge of what concordance metrics can capture: “fragile circle of vulnerabilities” (a theoretical contribution the paper makes) emerges from interpretive synthesis, not from matching codes to a reference list.

Rigor and trustworthiness

The Jaccard index = 1.00 for several models is genuinely striking. Within the paper’s own framework — concordance with a prior expert standard — this represents near-perfect performance. The Kappa calculations add nuance: some models show high concordance on one gender subgroup but not the other, revealing that aggregate concordance can mask within-sample variation.

The gender-specific analysis is methodologically important. Most AI evaluation studies report aggregate statistics; Bennis & Mouwafaq show that performance can differ meaningfully across subgroups, with implications for equity in AI-assisted research (→ montrosse-moorhead-ai-evaluation-2023).

The five-month gap between cohorts is also methodologically revealing: models improved substantially between July and December 2024. This makes any specific reliability figures quickly obsolete — the point is not the numbers, but the trajectory.

Limitations

The reference standard is a prior human analysis, not an independent ground truth. If the original expert team made systematic interpretive choices that reflected their theoretical orientation (as qualitative researchers inevitably do), then high AI concordance may mean AI replicates those choices, not that it captures some objective truth about the data. This is the validity problem in a different form.

The paper does not engage with the epistemological implications of the “fragile circle of vulnerabilities” framework the AI helped generate. If AI participated in theory generation, the provenance of that theory — and what it means to claim it — becomes complicated.

Reproducibility concerns are flagged but not resolved: the study calls for standardized reporting checklists but does not provide one.

Connections

intercoder-agreement — Cohen’s κ and Jaccard index used throughout
bijker-chatgpt-qca-2024 — the earlier benchmark this study extends and updates
llm-qualitative-research — broader landscape
epistemic-flattening — the concordance-with-reference-standard design cannot detect whether AI is replicating biases embedded in the reference
validity-trustworthiness — the reliability-without-validity gap is most acute here, where concordance figures are highest
contested-claims — whether Jaccard = 1.00 represents genuine analytic success or successful imitation of one team’s interpretive choices