Source
urlhttps://ai.jmir.org/2024/1/e54482
rawraw/Prescott_ai-2024-1-e54482.pdf

TL;DR: ChatGPT and Bard matched 71% of human-identified themes inductively, but intercoder reliability was only fair-to-moderate (37–47%). AI was 28× faster. The headline concordance figure looks promising; the reliability figure reveals the gap. Conclusion: hybrid approaches are necessary; AI is more reliable at descriptive than interpretive themes.

Problem

Qualitative methods are recognized as essential to digital health intervention research, but their labor intensity creates a bottleneck. In implementation science, where timely feedback from community data often determines whether an intervention succeeds, months-long coding processes are a real constraint. Rapid qualitative approaches have emerged as a partial solution, but most still require substantial human labor.

Generative AI offers a potential alternative: faster, cheaper, and potentially consistent. But “potentially” is doing significant work in that sentence. At the time of this study (2024), the empirical evidence on AI reliability for qualitative thematic analysis was thin and methodologically uneven. Most existing comparisons had studied one AI system with one dataset and reported concordance without drilling into category-level variation.

Prescott et al. address this gap in a domain where the stakes are concrete: HIV medication adherence interventions targeting methamphetamine users. The data — 40 brief SMS text message prompts from a digital health intervention — are messy, contextually specific, and emotionally weighted. If AI performs poorly here, it will not serve as a practical alternative in clinical implementation contexts.

Approach

The design is a direct human-AI comparison across two analytic modes:

Inductive thematic analysis: Two independent human coding teams developed themes from the data without a predefined framework. An independent human analyst then conducted the same analysis using ChatGPT and Bard separately, treating each AI as an independent coder.

Deductive thematic analysis: Both human teams and AI applied a predefined theoretical framework to the same data. This tests whether AI can accurately map complex health behavior theory onto real participant language.

Concordance was measured in two ways: theme consistency (did the AI identify the same themes as humans, at the theme level?) and intercoder reliability (when coding the same segment, did human and AI assign the same code?). This distinction matters. Theme consistency operates at the schema level — a relatively forgiving comparison. Reliability operates at the instance level — where disagreement is visible in every coded segment.

Time was also measured: AI required an average of 20 minutes (SD 3.5); humans required 567 minutes (SD 106.5). The 28× efficiency gain is consistent across both AI systems.

AI’s Role

AI is positioned as a potential efficiency substitute — the study is asking whether AI could replace or substantially supplement human coders in health implementation contexts where speed matters. The framing is explicitly pragmatic: the question is not epistemological but operational. Can AI produce analysis fast enough to be useful in real implementation cycles without sacrificing accuracy?

The answer is nuanced. AI can identify the major themes. AI cannot reliably code individual instances within those themes, which means it cannot substitute for human coders in tasks where instance-level accuracy matters — as it often does in clinical and behavioral research.

Epistemological Stance

Post-positivist, operating within a digital health and implementation science framework. Concordance metrics (percent agreement, consistency) are the primary evidence. The study does not engage with interpretivist or constructionist epistemologies, and its evaluation criteria would not be recognized as appropriate by Big Q researchers. This is not a limitation within the paper’s scope — implementation science has its own validity criteria, and the paper is explicit about its goals — but it means the conclusions should not be read as applying to reflexive or constructionist qualitative work.

The deductive approach, in particular, assumes that pre-existing health behavior theory provides a valid structure for interpreting participant language. Whether that mapping is itself valid is not a question the study raises.

Rigor and Trustworthiness

The study’s most methodologically careful move is separating theme consistency from intercoder reliability. Most AI-TA comparisons report one or the other; reporting both reveals the gap between schema-level agreement and instance-level agreement.

Theme consistency at 71% (inductive) is higher than most small-corpus studies with older models. But the reliability figures — 37–47% agreement — are frank about what this means in practice: human and AI coders apply the same themes to different instances more often than not. This is a finding about the limits of AI consistency, not just its speed.

The between-AI comparison is also useful: ChatGPT and Bard performed similarly to each other (100% inductive theme consistency, 83% deductive). This consistency between AI systems, combined with inconsistency versus humans, suggests the divergence is structural — both AIs are capturing something about the data, but something different from what humans capture.

Limitations

The dataset is deliberately minimal — 40 SMS prompts. This is sufficient for initial thematic analysis but does not test AI performance on the larger corpora where efficiency gains are most valuable. Whether 71% theme consistency holds at 500 or 5,000 messages is unknown.

The models tested (ChatGPT-4, Bard) represent a snapshot in 2024. Model improvements since then — and the dramatic performance gains documented in bennis-ai-thematic-analysis-2025 — mean the specific reliability figures here are already dated. The methodological contribution is the design and the gap it reveals, not the numbers.

The paper does not investigate why AI failed on specific themes. The finding that deductive approaches produce lower consistency (50–58%) than inductive (71%) replicates bijker-chatgpt-qca-2024's pattern, but neither paper unpacks the mechanism. The consistent failure on nuanced and interpretive themes, versus success on descriptive themes, is documented but not explained.

Connections

  • intercoder-agreement — percent agreement used throughout; the gap between theme consistency and reliability is the study’s key methodological contribution
  • bijker-chatgpt-qca-2024 — parallel empirical comparison with higher κ but different data type (forum posts vs. SMS); both find deductive harder than inductive
  • bennis-ai-thematic-analysis-2025 — the trajectory study: models tested five months later achieve near-perfect concordance, contextualizing what 71% meant in mid-2024
  • salazar-gpt4-qualitative-2025 — similar comparison in health professions education; consistent failure on low-frequency codes and nuanced themes
  • anis-french-ai-qualitative-research-2023 — reframes the failure cases as analytically productive rather than just a limitation
  • llm-qualitative-research — broader landscape
  • validity-trustworthiness — the reliability-without-validity gap is acute here; consistency with human themes does not guarantee the themes are meaningful