| url | https://doi.org/10.1371/journal.pone.0330217 |
|---|---|
| raw | raw/Jowsey_journal.pone.0330217.pdf |
TL;DR: The sharpest empirical counter-evidence in the corpus. Microsoft Copilot applied to five openly published qualitative datasets produced minimal theme overlap with human researchers, fabricated quotes in 58% of cases, drew almost exclusively from the first 2–3 pages of data, and showed no capacity for discursive thematic analysis. The authors cannot recommend Copilot for thematic analysis. This is the critical corrective to optimistic concordance studies.
Problem
The empirical literature on AI-assisted thematic analysis has a systematic design bias. Studies that show high concordance (Bijker et al.: κ 0.72–0.82; Bennis & Mouwafaq: Jaccard = 1.00) share a common feature: the human-coded data was available to the researchers who designed the AI prompts. The AI was not analyzing data cold — it was working in a context where the research team understood the coding scheme and could design prompts informed by prior knowledge of the findings.
This design gives AI substantial advantages. A genuinely blind comparison — where AI analyzes data without any contextual cues from the prior human analysis — is much harder to arrange. It requires published datasets from prior studies where the analysis is also published, allowing genuine comparison without contamination.
Jowsey et al. exploit exactly this design. Their systematic search identified five studies that had published both their thematic analysis findings (in peer-reviewed journals) and the underlying data (in open repositories). Copilot was then prompted to analyze the same data without any guidance from the prior findings — a true blind comparison.
Approach
Search strategy: Three databases (UK Data Service, Figshare, Google Scholar) and five journals (PLOS One, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) searched for health-related studies meeting strict inclusion criteria: human TA + published peer-reviewed findings + published open dataset.
Evaluation measures:
- Time: How long did Copilot take? (Compared to human analysis time)
- Accuracy: Did Copilot’s output reflect what was in the data?
- Theme overlap: How many of Copilot’s themes matched human-identified themes?
- Quote reliability: Were the quotes Copilot attributed to participants actually in the data?
- Participant spread: Did Copilot distribute themes across the full range of participants?
- Data coverage: Did Copilot draw from the full dataset or a subset?
Comparison group: Human analysis from the five published studies, treated as the reference standard.
Key Results
| Measure | Copilot | Human |
|---|---|---|
| Theme overlap | Minimal | — |
| Discursive TA capability | None (100% standard TA) | 40% discursive TA |
| Quote fabrication rate | 58% (SD = 45%) | 21% incorrect (SD = 27%) |
| Participant spread | None | Present in all studies |
| Data coverage | First 2–3 pages only | Full dataset |
The 58% fabrication rate is the finding that demands most attention. More than half of the quotes Copilot attributed to participants were not in the data. In qualitative research, where quotes are the primary evidence supporting analytic claims, a 58% fabrication rate makes the analysis fundamentally unreliable as a research product.
The data coverage finding — that Copilot drew almost exclusively from the first 2–3 pages of each dataset — explains why theme overlap was minimal: Copilot was not analyzing the data, it was analyzing its beginning.
AI’s Role
AI appears in this paper as the subject of critical evaluation — evaluated against a rigorous standard and found substantially deficient. The paper does not position AI as a useful tool with caveats; it concludes that Copilot, in its current version, cannot be recommended for thematic analysis.
The Frankenstein metaphor in the title is interpretively pointed: a creation built from assembled parts that does not work as intended, and whose creator may not fully appreciate the consequences of its outputs.
Epistemological Stance
Post-positivist / empiricist, with an explicit quality-appraisal agenda. The paper’s evaluation criteria are concrete and measurable: fabrication rate, theme overlap, data coverage. The reference standard is published human analysis, treated as the benchmark.
The paper raises a secondary concern that is epistemologically interesting: not just that Copilot performs poorly, but that human TA itself shows quality problems — 21% incorrect quotes, variable participant spread. This positions the study as raising concerns about qualitative research quality generally, not just AI-assisted quality.
Rigor and Trustworthiness
The blind-comparison design is the paper’s strongest methodological feature — it is the most genuine test of AI capability in the corpus, precisely because it removes the contextual cues that make guided AI comparison easier. The systematic search procedure ensures the five datasets are not cherry-picked.
The decision to compare against published human analyses rather than against a fresh human analysis of the same data means the comparison conflates two things: Copilot’s limitations versus the specific published studies’ analytic choices. A published TA from 2019 may not use the same approach as a 2025 researcher would — but this is the best available reference for studies where the original researchers are not involved.
The five-dataset sample is small but constrained by a rigorous search procedure. The inclusion criteria are demanding: studies that published both findings and open data are rare enough that five was the achievable sample.
Limitations
The study tests Microsoft Copilot specifically — not ChatGPT-4, not Claude, not Gemini. Model differences are substantial, and the findings cannot be generalized to other LLMs without replication. The paper acknowledges this but notes that generalization is warranted as a concern even if not as a conclusion.
The five datasets are all health-related, which may affect generalizability to social science, education, or humanities research contexts. Health datasets often contain sensitive, personal narratives — the stakes of fabrication are particularly high here.
The data coverage finding (first 2–3 pages) may reflect a Copilot-specific context window constraint rather than a general LLM limitation. Models with larger context windows may handle full datasets differently.
Connections
- llm-qualitative-research — the most critical empirical finding in the landscape; this paper is essential context for any optimistic concordance claim
- intercoder-agreement — why direct validation (human-coded reference standard, genuine blind comparison) matters; the design illustrates the limitations of guided comparison designs
- ai-research-ethics — 58% fabricated quotes is an ethics violation: manufactured evidence misleads readers and harms participants whose voices are falsely represented
- bennis-ai-thematic-analysis-2025 — the optimistic counterpoint; Jaccard = 1.00 under guided conditions vs. minimal overlap under blind conditions — compare the study designs
- bijker-chatgpt-qca-2024 — another high-concordance study; same design-bias concern applies
- salazar-gpt4-qualitative-2025 — hallucination documented here too; Jowsey provides the systematic frequency estimate
- perkins-roe-genai-inductive-2024 — hallucination noted as a challenge; Jowsey provides the systematic frequency estimate
- validity-trustworthiness — the paper raises quality concerns about both AI-generated and human-generated TA; a corrective to the reliability-without-validity gap
- contested-claims — the contrast between Jowsey and Bennis/Bijker is the sharpest empirical controversy in the corpus
What links here
- Anis & French (2023) — Efficient, Explicatory, and Equitable: Why Qualitative Researchers Should Embrace AI, but Cautiously
- Ayik et al. (2026) — Human vs. AI: Evaluating TA With ChatGPT, QInsights, ATLAS.ti AI, and MAXQDA AI Assist
- Bijker et al. (2024) — ChatGPT for Automated Qualitative Research: Content Analysis
- Christou (2023) — How to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research?
- Contested Claims
- Davison et al. (2024) — The Ethics of Using Generative AI for Qualitative Data Analysis
- De Paoli (2026) — Why We Should Reject to Reject the Use of Generative AI in Qualitative Analysis
- Empirical Findings
- AI in Qualitative Research
- Human-AI Collaboration — Frameworks and Models
- Index
- Jowsey et al. (2025) — We Reject the Use of Generative AI for Reflexive Qualitative Research
- Montrosse-Moorhead (2023) — Evaluation Criteria for Artificial Intelligence
- Perkins & Roe (2024) — The Use of Generative AI in Qualitative Analysis: Inductive Thematic Analysis with ChatGPT
- Qualitative AI Methods — A Living Taxonomy
- Reeping et al. (2025) — Interrogating the Use of LLMs in Qualitative Research Using the Q3 Framework
- Salazar et al. (2025) — Comparison of Qualitative Analyses Conducted by Artificial Intelligence Versus Traditional Methods
- Validity and Trustworthiness