Source
urlhttps://doi.org/10.1016/j.ajpe.2025.101882
rawraw/Salazar_1-s2.0-S0002945925005285-main.pdf

TL;DR: A brief report comparing GPT-4 against human-led qualitative analysis on two health professions education datasets. GPT-4 generally aligned with human-identified codes and themes at the descriptive level but failed systematically on low-frequency codes, code relationships, and interpretive nuance. Hallucination of quotations and code names was documented. Conclusion: GPT-4 can support but not replace human-led analysis, and requires a researcher with enough methodological background to identify its gaps.

Problem

Qualitative research is underutilized in health professions education despite its capacity to generate rich insight into the beliefs, experiences, and social circumstances that shape learning and patient care. The barriers are structural: insufficient training in qualitative methods among health professions faculty, and time-intensive analytic processes that compete with clinical and teaching demands.

Generative AI, and GPT-4 specifically, has attracted attention as a potential solution to both barriers — it could accelerate analysis and lower the technical threshold for qualitative work. But the evidence base in health professions education specifically was thin. Most existing comparisons (at the time of publication) had been conducted in social science contexts with interview or focus group data. The applicability to health professions datasets — particularly open-ended survey responses from pharmacy faculty and interview data from patient populations — was unknown.

Salazar et al. address this gap with a reanalysis design: two previously human-analyzed datasets (one survey-based, one interview-based) are submitted to GPT-4, and the outputs are systematically compared to the original human analysis across multiple quality dimensions.

Approach

Two datasets provided the analytic material:

Dataset 1 (EIQ): 36 survey responses from UCSF pharmacy faculty about an initiative to improve exam item quality. Previously analyzed using inductive content analysis by two investigators.

Dataset 2 (HTN): Transcripts from seven 1-hour interviews with Black or African American patients with hypertension receiving pharmacist-led care. Previously analyzed using inductive thematic analysis with multiple coders and collaborative reconciliation.

GPT-4 was accessed through Versa — a UCSF-developed private, HIPAA-compliant platform running GPT-4 Turbo (version: turbo-2024-04-09) — which adds a layer of data security relevant to health research contexts.

Prompts were structured using the ACTOR framework: Actor (GPT-4 as study investigator), Context (research questions and qualitative approach), Task (generate codes and themes), Output (organized results), Reference (supporting materials). Prompts were iteratively refined — when GPT-4 initially produced themes where codes were requested, the researchers revised the prompt to include a definition of “code.” This iterative refinement process is documented in an appendix, making the prompting decisions transparent.

Three researchers independently compared GPT-4 outputs to prior human findings on accuracy, alignment, relevance, and appropriateness, using dichotomous ratings with explanatory notes. Discrepancies were resolved collaboratively with additional arbitrators present.

AI’s Role

AI is positioned as a potential analytic supplement to human-led research — not as an independent analyst. The paper’s framing is cautious: GPT-4 can support the process, but its outputs require interpretation by a researcher who already understands qualitative methods well enough to identify what is missing, redundant, or wrong.

This is a more conservative claim than Bennis & Mouwafaq’s optimistic benchmarks. The ACTOR prompting framework treats GPT-4 explicitly as an AI study investigator — the prompts are designed to simulate methodological context — but the outputs still require substantial human evaluation.

The hallucination finding is particularly significant for AI’s role: GPT-4 occasionally generated quotations and code names that did not correspond to anything in the actual data. In a health professions education context, where data includes sensitive patient information and faculty feedback, undetected hallucinations could corrupt the research record. This makes the requirement for human oversight non-negotiable.

Epistemological Stance

Post-positivist, within a health sciences research framework. The evaluation criteria — accuracy, alignment, relevance, appropriateness — are operationalized dichotomously (yes/no ratings), reflecting a quantitative sensibility applied to a qualitative task. The paper does not engage with interpretivist or constructionist epistemologies.

This is appropriate for the domain. Health professions education research typically operates within a post-positivist framework where reliability and systematic comparison are standard quality criteria. The paper’s scope is explicitly bounded: it is not claiming that GPT-4 can replace interpretive qualitative research, but that it can support structured qualitative tasks in applied health education contexts.

Rigor and Trustworthiness

The multi-evaluator comparison design — three independent researchers rating GPT-4 outputs against human analysis, with arbitration for disagreements — provides reasonable rigor within the paper’s framework. The inclusion of evaluators who were not involved in the original human analysis is methodologically important: it reduces the risk that comparison ratings are shaped by familiarity with the original findings.

The ACTOR prompting framework, documented in the appendix, makes the prompting process replicable. The iterative refinement log (initial prompt, GPT-4 output, revised prompt, revised output) is among the more transparent prompt documentation in the corpus.

The brief report format, however, limits depth. The results section provides qualitative description and Table summaries rather than systematic quantification. The number of aligned vs. non-aligned codes is reported, but without the κ calculations or Jaccard indices that characterize the more rigorous benchmarks in the corpus.

Limitations

The two datasets are small by any standard: 36 survey responses and 7 interview transcripts. Even combined, this is a minimal corpus for evaluating AI performance, and low-frequency findings in small datasets are particularly vulnerable to noise. The failure to detect 7 infrequently used codes in the EIQ data (each applied 1–2 times) is a consistent pattern across the literature, but it cannot be ruled out that some of those codes were themselves marginal rather than analytically important.

The GPT-4 version used (turbo-2024-04-09) is already dated at the time of publication. Model improvements between versions are substantial, and the specific failure patterns documented here — hallucination, difficulty with code relationships, coarse theme granularity — may not characterize more recent models.

The paper does not quantify the reliability of the human-comparison process itself. Whether three evaluators agreed with each other on what constituted “alignment” with the original analysis is not reported, which introduces an unacknowledged layer of uncertainty.

The equity implications of using a private, institutional AI platform (Versa/UCSF) versus public ChatGPT are not addressed. Researchers at institutions without comparable infrastructure would need to navigate different privacy and capability constraints.

Connections

  • prescott-ai-thematic-analysis-2024 — parallel empirical study in digital health; same failure pattern (descriptive alignment, interpretive gap) with a larger dataset
  • bijker-chatgpt-qca-2024 — the more rigorous benchmark; GPT-3.5 Turbo with extensive κ measurement; compare the methodological difference
  • anis-french-ai-qualitative-research-2023 — reframes the low-frequency code failure as potentially analytically productive rather than just a limitation
  • jowsey-frankenstein-ai-ta-2025 — the critical counterpart; hallucination documented more systematically and framed as a fundamental reliability threat
  • llm-qualitative-research — the broader landscape
  • prompt-engineering — the ACTOR framework documented here is one of the better domain-specific prompting approaches in the corpus
  • ai-research-ethics — hallucination in health research contexts is an ethical issue, not just a quality one
  • validity-trustworthiness — the alignment-without-reliability gap; the paper demonstrates that descriptive code alignment does not guarantee interpretive accuracy