Jowsey et al. (2025) — Frankenstein, Thematic Analysis and Generative AI: Quality Appraisal Methods

Source
url	https://doi.org/10.1371/journal.pone.0330217
raw	raw/Jowsey_journal.pone.0330217.pdf

TL;DR: The sharpest empirical counter-evidence in the corpus. Microsoft Copilot applied to five openly published qualitative datasets produced minimal theme overlap with human researchers, fabricated quotes in 58% of cases, drew almost exclusively from the first 2–3 pages of data, and showed no capacity for discursive thematic analysis. The authors cannot recommend Copilot for thematic analysis. This is the critical corrective to optimistic concordance studies.

Problem

The empirical literature on AI-assisted thematic analysis has a systematic design bias. Studies that show high concordance (Bijker et al.: κ 0.72–0.82; Bennis & Mouwafaq: Jaccard = 1.00) share a common feature: the human-coded data was available to the researchers who designed the AI prompts. The AI was not analyzing data cold — it was working in a context where the research team understood the coding scheme and could design prompts informed by prior knowledge of the findings.

This design gives AI substantial advantages. A genuinely blind comparison — where AI analyzes data without any contextual cues from the prior human analysis — is much harder to arrange. It requires published datasets from prior studies where the analysis is also published, allowing genuine comparison without contamination.

Jowsey et al. exploit exactly this design. Their systematic search identified five studies that had published both their thematic analysis findings (in peer-reviewed journals) and the underlying data (in open repositories). Copilot was then prompted to analyze the same data without any guidance from the prior findings — a true blind comparison.

Approach

Search strategy: Three databases (UK Data Service, Figshare, Google Scholar) and five journals (PLOS One, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) searched for health-related studies meeting strict inclusion criteria: human TA + published peer-reviewed findings + published open dataset.

Evaluation measures:

Time: How long did Copilot take? (Compared to human analysis time)
Accuracy: Did Copilot’s output reflect what was in the data?
Theme overlap: How many of Copilot’s themes matched human-identified themes?
Quote reliability: Were the quotes Copilot attributed to participants actually in the data?
Participant spread: Did Copilot distribute themes across the full range of participants?
Data coverage: Did Copilot draw from the full dataset or a subset?

Comparison group: Human analysis from the five published studies, treated as the reference standard.

Key Results

Measure	Copilot	Human
Theme overlap	Minimal	—
Discursive TA capability	None (100% standard TA)	40% discursive TA
Quote fabrication rate	58% (SD = 45%)	21% incorrect (SD = 27%)
Participant spread	None	Present in all studies
Data coverage	First 2–3 pages only	Full dataset

The 58% fabrication rate is the finding that demands most attention. More than half of the quotes Copilot attributed to participants were not in the data. In qualitative research, where quotes are the primary evidence supporting analytic claims, a 58% fabrication rate makes the analysis fundamentally unreliable as a research product.

The data coverage finding — that Copilot drew almost exclusively from the first 2–3 pages of each dataset — explains why theme overlap was minimal: Copilot was not analyzing the data, it was analyzing its beginning.

AI’s Role

AI appears in this paper as the subject of critical evaluation — evaluated against a rigorous standard and found substantially deficient. The paper does not position AI as a useful tool with caveats; it concludes that Copilot, in its current version, cannot be recommended for thematic analysis.

The Frankenstein metaphor in the title is interpretively pointed: a creation built from assembled parts that does not work as intended, and whose creator may not fully appreciate the consequences of its outputs.

Epistemological Stance

Post-positivist / empiricist, with an explicit quality-appraisal agenda. The paper’s evaluation criteria are concrete and measurable: fabrication rate, theme overlap, data coverage. The reference standard is published human analysis, treated as the benchmark.

The paper raises a secondary concern that is epistemologically interesting: not just that Copilot performs poorly, but that human TA itself shows quality problems — 21% incorrect quotes, variable participant spread. This positions the study as raising concerns about qualitative research quality generally, not just AI-assisted quality.

Rigor and Trustworthiness

The blind-comparison design is the paper’s strongest methodological feature — it is the most genuine test of AI capability in the corpus, precisely because it removes the contextual cues that make guided AI comparison easier. The systematic search procedure ensures the five datasets are not cherry-picked.

The decision to compare against published human analyses rather than against a fresh human analysis of the same data means the comparison conflates two things: Copilot’s limitations versus the specific published studies’ analytic choices. A published TA from 2019 may not use the same approach as a 2025 researcher would — but this is the best available reference for studies where the original researchers are not involved.

The five-dataset sample is small but constrained by a rigorous search procedure. The inclusion criteria are demanding: studies that published both findings and open data are rare enough that five was the achievable sample.

Limitations

The study tests Microsoft Copilot specifically — not ChatGPT-4, not Claude, not Gemini. Model differences are substantial, and the findings cannot be generalized to other LLMs without replication. The paper acknowledges this but notes that generalization is warranted as a concern even if not as a conclusion.

The five datasets are all health-related, which may affect generalizability to social science, education, or humanities research contexts. Health datasets often contain sensitive, personal narratives — the stakes of fabrication are particularly high here.

The data coverage finding (first 2–3 pages) may reflect a Copilot-specific context window constraint rather than a general LLM limitation. Models with larger context windows may handle full datasets differently.

Connections

llm-qualitative-research — the most critical empirical finding in the landscape; this paper is essential context for any optimistic concordance claim
intercoder-agreement — why direct validation (human-coded reference standard, genuine blind comparison) matters; the design illustrates the limitations of guided comparison designs
ai-research-ethics — 58% fabricated quotes is an ethics violation: manufactured evidence misleads readers and harms participants whose voices are falsely represented
bennis-ai-thematic-analysis-2025 — the optimistic counterpoint; Jaccard = 1.00 under guided conditions vs. minimal overlap under blind conditions — compare the study designs
bijker-chatgpt-qca-2024 — another high-concordance study; same design-bias concern applies
salazar-gpt4-qualitative-2025 — hallucination documented here too; Jowsey provides the systematic frequency estimate
perkins-roe-genai-inductive-2024 — hallucination noted as a challenge; Jowsey provides the systematic frequency estimate
validity-trustworthiness — the paper raises quality concerns about both AI-generated and human-generated TA; a corrective to the reliability-without-validity gap
contested-claims — the contrast between Jowsey and Bennis/Bijker is the sharpest empirical controversy in the corpus